Block average quantile regression for massive dataset

Xu, Qifa; Cai, Chao; Jiang, Cuixia; Sun, Fang; Huang, Xue

doi:10.1007/s00362-017-0932-6

Block average quantile regression for massive dataset

Regular Article
Published: 06 July 2017

Volume 61, pages 141–165, (2020)
Cite this article

Statistical Papers Aims and scope Submit manuscript

Qifa Xu¹,
Chao Cai¹,
Cuixia Jiang¹,
Fang Sun² &
…
Xue Huang³

605 Accesses
12 Citations
Explore all metrics

Abstract

Nowadays, researchers are frequently confronted with challenges from large-scale data computing. Quantile regression on massive dataset is challenging due to the limitations of computer primary memory. Our proposed block average quantile regression provides a simple and efficient way to implement quantile regression on massive dataset. The major novelty of this method is splitting the entire data into a few blocks, applying the convectional quantile regression onto the data within each block, and deriving final results through aggregating these quantile regression results via simple average approach. While our approach can significantly reduce the storage volume needed for estimation, the resulting estimator is theoretically as efficient as the traditional quantile regression on entire dataset. On the statistical side, asymptotic properties of the resulting estimator are investigated. We verify and illustrate our proposed method via extensive Monte Carlo simulation studies as well as a real-world application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

Big data analytics on Apache Spark

Article 13 October 2016

References

Alhamzawi R (2015) Model selection in quantile regression models. J Appl Stat 42(2):445–458
Article MathSciNet Google Scholar
Arcones MA (1996) The bahadur-kiefer representation of lp regression estimators. Econ Theor 12(2):257–283
Article Google Scholar
Briollais L, Durrieu G (2014) Application of quantile regression to recent genetic and -omic studies. Hum Genet 133(8):951–966
Article Google Scholar
Chen X, Mg Xie (2014) A split-and-conquer approach for analysis of extraordinarily large data. Stat Sin 24(4):1655–1684
MathSciNet MATH Google Scholar
Cook LP (2014) Gendered parenthood penalties and premiums across the earnings distribution in Australia, the United Kingdom, and the United States. Eur Sociol Rev 30(3):360–372
Article MathSciNet Google Scholar
El Bantli F, Hallin M (1999) L1-estimation in linear models with heterogeneous white noise. Stat Prob Lett 45(4):305–315
Article Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Article MathSciNet Google Scholar
Fan TH, Cheng KF (2007) Tests and variables selection on regression analysis for massive datasets. Data Knowl Eng 63(3):811–819
Article Google Scholar
Fan TH, Lin DKJ, Cheng KF (2007) Regression analysis for massive datasets. Data Knowl Eng 61(3):554–562
Article Google Scholar
He X, Shao QM (1996) A general bahadur representation of m-estimators and its application to linear regression with nonstochastic designs. Ann Stat 24(6):2608–2630
Article MathSciNet Google Scholar
Jiang R, Qian WM, Zhou ZG (2016) Single-index composite quantile regression with heteroscedasticity and general error distributions. Stat Pap 57:185–203
Article MathSciNet Google Scholar
Killewald A, Bearak J (2014) Is the motherhood penalty larger for low-wage women? A comment on quantile regression. Am Sociol Rev 79(2):350–357
Article Google Scholar
Knight K (1998) Limiting distributions for l1 regression estimators under general conditions. Ann Stat 26(2):755–770
Article Google Scholar
Koenker R (2005) Quantile regression. Cambridge University Press, New York
Book Google Scholar
Koenker R, Bassett GW (1978) Regression quantiles. Econometrica 46(1):33–50
Article MathSciNet Google Scholar
Koenker R, Geling O (2001) Reappraising medfly longevity: a quantile regression survival analysis. J Am Stat Assoc 96(454):458–468
Article MathSciNet Google Scholar
Koenker R, Portnoy S (1987) L-estimation for linear models. J Am Stat Assoc 82(399):851–857
MATH Google Scholar
Koenker R, Zhao Q (1994) L-estimatton for linear heteroscedastic models. J Nonparametr Stat 3(3–4):223–235
Article Google Scholar
Li R, Lin DK, Li B (2013) Statistical inference in massive data sets. Appl Stoch Models Bus Ind 29(5):399–409
MathSciNet Google Scholar
Ning Z, Tang L (2014) Estimation and test procedures for composite quantile regression with covariates missing at random. Stat Prob Lett 95:15–25
Article MathSciNet Google Scholar
Okada K, Samreth S (2012) The effect of foreign aid on corruption: a quantile regression approach. Econ Lett 115(2):240–243
Article Google Scholar
Peng L, Huang Y (2008) Survival analysis with quantile regression models. J Am Stat Assoc 103(482):637–649
Article MathSciNet Google Scholar
Powell D, Wagner J (2014) The exporter productivity premium along the productivity distribution: evidence from quantile regression with nonadditive firm fixed effects. Rev World Econ 150(4):763–785
Article Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc 58(1):267–288
MathSciNet MATH Google Scholar
van der Vaart AW (1998) Asymptotic statistics. Cambridge University Press, New York
Wang H, He X (2007) Detecting differential expressions in genechip microarray studies: a quantile approach. J Am Stat Assoc 102(477):104–112
Article MathSciNet Google Scholar
Xu Q, Niu X, Jiang C, Huang X (2015) The phillips curve in the us: a nonlinear quantile regression approach. Econ Model 49:186–197
Article Google Scholar
Yang H, Liu H (2016) Penalized weighted composite quantile estimators with missing covariates. Stat Pap 57:69–88
Article MathSciNet Google Scholar
Yang J, Meng X, Mahoney MW (2013) Quantile regression for large-scale applications. In: Proceedings of the 30th international conference on machine learning, pp 881–887
Yang J, Meng X, Mahoney MW (2014) Quantile regression for large-scale applications. SIAM J Sci Comput 36(5):S78–S110
Article MathSciNet Google Scholar
Zhang Y, Duchi J, Wainwright M (2013) Divide and conquer kernel ridge regression. J Mach Learn Res 30:592–617
MATH Google Scholar
Zhao T, Kolar M, Liu H (2015) A general framework for robust testing and confidence regions in high-dimensional quantile regression. Tech. rep
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors are grateful to the Editor-in-Chief, an associate editor, and three anonymous referees for their helpful comments and constructive guidance. The authors also gratefully acknowledge financial support from the National Natural Science Foundation of PR China (71671056, 71490725), the National Social Science Foundation of PR China (15BJY008), the Humanity and Social Science Foundation of Ministry of Education of PR China (14YJA790015).

Author information

Authors and Affiliations

School of Management, Hefei University of Technology, Hefei, 230009, Anhui, China
Qifa Xu, Chao Cai & Cuixia Jiang
Department of Accounting and Information Systems, Queens College, New York, 11376, USA
Fang Sun
Department of Statistics, Florida State University, Tallahassee, 32304, USA
Xue Huang

Authors

Qifa Xu
View author publications
You can also search for this author in PubMed Google Scholar
Chao Cai
View author publications
You can also search for this author in PubMed Google Scholar
Cuixia Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Fang Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xue Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cuixia Jiang.

Appendix

Proof of Theorem 1

Note that

$$\begin{aligned} \hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau )=\frac{1}{K}\sum \limits ^K_{k=1}\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )=\frac{1}{K}\sum \limits ^K_{k=1}\left( \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\right) . \end{aligned}$$

(18)

From Proposition 1 and Proposition 2 of El Bantli and Hallin (1999), $\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\xrightarrow {p}\varvec{0}$ as $n\rightarrow \infty $. According to property of convergence in probability, we have

$$\begin{aligned} \begin{array}{lll} \hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau )\xrightarrow {p}\varvec{0}&as&n\rightarrow \infty . \end{array} \end{aligned}$$

(19)

Consequently, the consistency holds.

By Knight (1998) and Koenker (2005), $\Vert \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\Vert _{2} = O_p(1/\sqrt{n})$. Then, from (18), we may derive

$$\begin{aligned} \begin{array}{lll} \Vert \hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau )\Vert _{2} &{}=&{} \Vert \frac{1}{K}\sum \limits ^K_{k=1}\left( \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\right) \Vert _2\\ &{}\le &{} \frac{1}{K}\sum \limits ^K_{k=1}\Vert \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\Vert _2\\ &{}=&{} O_p(1/\sqrt{n}). \end{array} \end{aligned}$$

(20)

Thus, the estimator $\hat{\varvec{\beta }}(\tau )$ is said to converge with rate $O_p(1/\sqrt{n})$.

Proof of Theorem 2

From Theorem 4.1 of Koenker (2005), the joint asymptotic distribution of the quantile regression estimator $\hat{\varvec{\beta }}^{(k)}(\tau )$ takes the form:

$$\begin{aligned} \sqrt{n}(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau ))\xrightarrow {d} N(\varvec{0},\varvec{\Sigma }^{(k)}), \end{aligned}$$

(21)

where $\varvec{\Sigma }^{(k)}=\omega ^{2}_{k}[\varvec{D}^{(k)}_{0}]^{-1}$ for i.i.d. errors and $\varvec{\Sigma }^{(k)}=\tau (1-\tau )[\varvec{D}^{(k)}_{1}(\tau )]^{-1}\varvec{D}^{(k)}_{0}$$[\varvec{D}^{(k)}_{1}(\tau )]^{-1}$ for non-i.i.d. errors.

From C3, the Bahadur representation for the estimator $\hat{\varvec{\beta }}^{(k)}(\tau )$ is $\sqrt{n}(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau ))=\varvec{W}^{(k)}+R^{(k)}_{n}$. By Arcones (1996) and He and Shao (1996), the remainder term meets

$$\begin{aligned} \begin{array}{lll} R^{(k)}_{n}=O(n^{-1/4}(\log \log n)^{3/4})\xrightarrow {p}0&as&n\rightarrow \infty . \end{array} \end{aligned}$$

(22)

Hence, combining (21) with (22), using the Theorem 2.7 of van der Vaart (1998), we can infer that $\varvec{W}^{(k)}= \sqrt{n}(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )) - R^{(k)}_{n}$ has the same asymptotic normality as $\sqrt{n}(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau ))$, i.e.

$$\begin{aligned} \varvec{W}^{(k)}\xrightarrow {d} N(\varvec{0},\varvec{\Sigma }^{(k)}). \end{aligned}$$

(23)

First, based on C2, it is easy to show that $\varvec{\Sigma }^{(1)}=\varvec{\Sigma }^{(2)}=\cdots =\varvec{\Sigma }^{(K)}=\varvec{\Sigma }$ for both i.i.d. and non-i.i.d. cases. Using the BAQR estimator, we derive

$$\begin{aligned} \begin{array}{lll} \sqrt{N}(\hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau )) &{}=&{} \sqrt{N}\left( \frac{1}{K}\sum \limits ^K_{k=1}\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\right) \\ &{}=&{} \frac{\sqrt{N}}{K} \sum \limits ^K_{k=1}\left( \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\right) \\ &{}=&{} \frac{\sqrt{N}}{K\sqrt{n}} \sum \limits ^K_{k=1}\sqrt{n}\left( \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\right) \\ &{}=&{} \frac{1}{\sqrt{K}} \sum \limits ^K_{k=1}\sqrt{n}\left( \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\right) \\ &{}=&{} \frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}+\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}R^{(k)}_{n}. \end{array} \end{aligned}$$

(24)

Second, we show that $\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}$ follows a multivariate normal distribution. Since $(\varvec{x}_{i},y_{i})$ are independent for $i=1,2,\ldots ,N$, $\varvec{W}^{(k)}$ defined in (9) or (10) is therefore independent for $k=1,2,\ldots ,K$. From (23), we know that $\varvec{W}^{(k)}$ follows a multivariate normal distribution. Hence, we conclude that $\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}$ also follows a multivariate normal distribution.

Third, we calculate the expectation and covariance matrix of $\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}$. From (23), we have

$$\begin{aligned} \text {E}\left( \frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\right) =\frac{1}{\sqrt{K}} \sum \limits ^K_{k=1} \text {E}(\varvec{W}^{(k)})=\varvec{0}, \end{aligned}$$

(25)

and

$$\begin{aligned} \begin{array}{lll} \text {Var}\left( \frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\right) &{}=&{} \frac{1}{K} \sum \limits ^K_{k=1} \text {Var}(\varvec{W}^{(k)})\\ &{}=&{} \frac{1}{K} (\varvec{\Sigma }^{(1)}+\varvec{\Sigma }^{(2)}+\cdots +\varvec{\Sigma }^{(K)})\\ &{}=&{} \varvec{\Sigma }. \end{array} \end{aligned}$$

(26)

Based on the results of the second and third parts, we can infer that

$$\begin{aligned} \frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\xrightarrow {d} N(\varvec{0},\varvec{\Sigma }). \end{aligned}$$

(27)

Fourth, we show that $\sqrt{N}(\hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau ))$ follows a multivariate normal distribution. If

$$\begin{aligned} \begin{array}{lll} \frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}R^{(k)}_{n}=O(\sqrt{K}n^{-1/4}(\log \log n)^{3/4})\xrightarrow {p}0,&as&n\rightarrow \infty , \end{array} \end{aligned}$$

(28)

holds(see Eq. (22)), combining (27) with (28), using the Theorem 2.7 of van der Vaart (1998), we can infer that $\sqrt{N}(\hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau ))$ has the same asymptotic normality as $\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}$, i.e.

$$\begin{aligned} \sqrt{N}(\hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau ))\xrightarrow {d} N(\varvec{0},\varvec{\Sigma }). \end{aligned}$$

(29)

To ensure (29), we suggest $N=Ke^{K^{1/2}}$. In addition, using $N=nK$ we then get $K=\log ^2n$. Hence, as $n\rightarrow \infty $, we have

$$\begin{aligned} \begin{array}{lll} \frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}R^{(k)}_{n}&{}=&{}O(\sqrt{K}n^{-1/4}(\log \log n)^{3/4})\\ &{}=&{}O(n^{-1/4}(\log \log n)^{3/4}\log n)\xrightarrow {p}0. \end{array} \end{aligned}$$

(30)

Therefore, the result of asymptotic normality in (12) holds when $N=Ke^{K^{1/2}}$, and we complete the whole proof.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, Q., Cai, C., Jiang, C. et al. Block average quantile regression for massive dataset. Stat Papers 61, 141–165 (2020). https://doi.org/10.1007/s00362-017-0932-6

Download citation

Received: 08 July 2016
Revised: 29 April 2017
Published: 06 July 2017
Issue Date: February 2020
DOI: https://doi.org/10.1007/s00362-017-0932-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Block average quantile regression for massive dataset

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Violating the normality assumption may be the lesser of two evils

Big data analytics on Apache Spark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Proof of Theorem 1

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Block average quantile regression for massive dataset

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Violating the normality assumption may be the lesser of two evils

Big data analytics on Apache Spark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Proof of Theorem 1

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation