Abstract
This paper focuses on the high-dimensional additive quantile model, allowing for both dimension and sparsity to increase with sample size. We propose a new sparsity-smoothness penalty over a reproducing kernel Hilbert space (RKHS), which includes linear function and spline-based nonlinear function as special cases. The combination of sparsity and smoothness is crucial for the asymptotic theory as well as the computational efficiency. Oracle inequalities on excess risk of the proposed method are established under weaker conditions than most existing results. Furthermore, we develop a majorize-minimization forward splitting iterative algorithm (MMFIA) for efficient computation and investigate its numerical convergence properties. Numerical experiments are conducted on the simulated and real data examples, which support the effectiveness of the proposed method.
Similar content being viewed by others
References
Bartlett, P. L., Bousquet, O., Mendelson, S. (2005). Local Rademacher complexities. Annals of Statistics, 33, 1497–1537.
Beck, A., Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, 183–202.
Belloni, A., Chernozhukov, V. (2011). \(\ell _1\) penalized quantile regression in high-dimensional sparse models. Annals of Statistics, 39, 83–130.
Combettes, P., Wajs, V. (2005). Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4, 1168–1200.
Donoho, D. L., Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90, 1200–1224.
Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Hastie, T., Tibshirani, R. (1990). Monographs on Statistics and Applied Probability, Generalized Additive Models (1st ed.), London: Chapman and Hall.
He, X. M., Wang, L., Hong, H. G. (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Annals of Statistics, 41, 324–369.
Hunter, D. R., Lange, K. (2000). Quantile regression via an MM algorithm. Journal of Computational and Graphical Statistics, 11, 60–77.
Jaakkola, T., Diekhans, M., Haussler, D. (1999). Using the Fisher kernel method to detect remote protein homologies. In Proceedings of Seventh International Conference on Intelligent Systems for Molecular Biology, 149–158.
Kato, K. (2016). Group Lasso for high dimensional sparse quantile regression models. Manuscript.
Koenker, R. (2011). Additive models for quantile regression: Model selection and confidence bandaids. Brazilian Journal of Probability and Statistics, 25, 239–262.
Koenker, R., Basset, G. (1978). Regression quantiles. Econometrica, 46, 33–50.
Koltchinskii, V., Yuan, M. (2008). Sparse recovery in large ensembles of kerenl machines. In: 21st Annual Conference on Learning Theory, Helsinki, 229–238.
Koltchinskii, V., Yuan, M. (2010). Sparsity in multiple kernel learning. Annals of Statistics, 38, 3660–3695.
Li, Y., Zhu, J. (2008). \(l^1\)-norm quantile regressions. Journal of Computational and Graphical Statistics, 17, 163–185.
Li, Y., Liu, Y., Zhu, J. (2007). Quantile regression in reproducing kernel Hilbert spaces. Journal of the American Statistical Association, 102, 255–268.
Lian, H. (2012). Estimation of additive quantile regression models by two-fold penalty. Journal of Business and Economic Statistics, 30, 337–350.
Lv, S. G., Lin, H. Z., Lian, H., Huang, J. (2016). Oracle inequalities for sparse additive quantile regression models in reproducing kernel Hilbert space. Manuscript.
Meier, L., Van der Geer, S., Bühlmann, P. (2009). High-dimensional additive modeling. Annals of Statistics, 37, 3779–3821.
Mernshausen, N., Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics, 37, 246–270.
Moreau, J. J. (1962). Fonctions convexes duales et points proximaux dans un espace Hilbertien. Reports of the Paris Academy of Sciences, Series A, 255, 2897–2899.
Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S. (2010). Solving structured sparsity regularization with proximal methods. Machine Learning and Knowledge Discovery in Databases, 6322, 418–433.
Negahban, S., Ravikumar, P., Wainwright, M. J., Yu, B. (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27, 538–557.
Pearce, N. D., Wand, M. P. (2006). Penalized splines and reproducing kernel methods. The American Statistician, 60, 233–240.
Raskutti, G., Wainwright, M., Yu, B. (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of Machine Learning Research, 13, 389–427.
Ravikumar, P., Liu, H., Lafferty, J., Wasserman, L. (2009). SpAM: Sparse additive models. Journal of the Royal Statistical Society: Series B, 71, 1009–1030.
Rosasco, L., Villa, S., Mosci, S., Santoro, M., Verri, A. (2013). Nonparametric sparsity and regularization. Journal of Machine Learning Research, 14, 1665–1714.
Steinwart, I., Christmann, A. (2011). Estimating conditional quantiles with the help of pinball loss. Bernoulli, 17, 211–225.
Tseng, P. (2010). Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 125, 263–295.
Van der Geer, S. (2000). Empirical Processes in M-estimation. Cambridge: Cambridge University Press.
Van der Geer, S. (2008). High-dimensional generalized linear models and the Lasso. Annals of Statistics, 36, 614–645.
Wahba, G. (1999). Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV. Advances in kernel methods: support vector learning, pp. 69–88.
Wang, L., Wu, Y. C., Li, R. Z. (2012). Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107, 214–222.
Xue, L. (2009). Consistent variable selection in additive models. Statistical Science, 19, 1281–1296.
Yafeh, Y., Yosha, O. (2003). Large Shareholders and banks: Who monitors and how? The Economic Journal, 113, 128–146.
Yuan, M. (2006). GACV for Quantile Smoothing Splines. Computational Statistics and Data Analysis, 5, 813–829.
Acknowledgments
SL’s research is supported partially by NSFC-11301421, JBK141111, JBK14TD0046, JBK140210, and KLAS-130026507, and JW’s research is supported partially by HK GRF-11302615 and CityU SRG-7004244. The authors also acknowledge Professor Fukumizu for providing an immensely hospitable and fruitful environment when SL visited ISM of Japan, and this work is partially supported by MEXT Grant-in-Aid for Scientific Research on Innovative Areas of Japan (25120012).
Author information
Authors and Affiliations
Corresponding author
Appendix: Main Proofs
Appendix: Main Proofs
To simplify the proof, we only consider the special case where \(\mu _{\tau }=0\) in our model (1). Lemma 1 presents the behavior of weight empirical process (see Lemma 8.4 of Van der Geer (2000)).
Lemma 1
Let \(\mathcal {G}\) be a collection of functions \(g : \{z_1,\ldots ,z_n\}\rightarrow \mathbb {R}\), endowed with a metric induced by the norm \(\Vert g\Vert _n\). Let \(H(\cdot )\) be the entropy of \(\mathcal {G}\). Suppose that
where A is some constant and \(\alpha \in (0,1)\). In addition, let \(\epsilon _1, \ldots , \epsilon _n\) be independent centered random variables, satisfying
Denote \(\langle \epsilon ,g\rangle _n=\frac{1}{n}\sum _{i=1}^n\epsilon _ig(z_i)\) with any given \(g\in \mathcal {G}\), then for a constant \(c_0\) depending on \(\alpha , A, L\), and M, we have for all \(T \ge c_0\)
According to Lemma 1, we can establish the following technical lemma, which tells us that the key quantity involved in empirical process can be bounded by the proposed regularization term. It turns out that the corresponding oracle rates are improved.
Lemma 2
Under the same conditions of Lemma 1. Define the following event as
where \(c_0\) is some universal constant, which may differ from that of Lemma 1. When \(2\log p\ge c_0\), we have
Proof
Let \(\mathcal {G}=\{g_j: \Vert g_j\Vert _{\mathcal {H}}=1\}\) involved in Lemma 1. Then, applying Lemma 1, it follows that
with probability at least \(1-c_0\exp (-T^2/c_0^2)\). Let \(T=\sqrt{2c_0\log p}\), and the assumption \(2\log p\ge c_0\) implies that \(T\ge c_0\). Then, we have
In other words, with probability at least \(1- c_0\exp \left( -\frac{\log p}{c_0}\right) \), there holds
Thus, we derive our first desired conclusion for \(\Theta \) based on the basic inequality:
\(\square \)
Similar results on the Rademacher complexity and Gaussian complexity have been established in Koltchinskii and Yuan (2010) and Raskutti et al. (2012), respectively.
The next lemma shows that the quantities \(\sum _{j=1}^p \sqrt{\Vert \widehat{\Delta }_j\Vert _n^2+\rho _n \Vert \widehat{\Delta }_j\Vert _{\mathcal {H}}^2}\) can be controlled by the corresponding one as applied to the active set S. They provide a way to prove sparsity oracle inequalities for the estimator (2).
Proposition 3
Conditioned on the events \(\Theta \), with the choices of \(\lambda _n\ge 2\mu _n\) and \(\rho _n\ge \mu _n^{2}\), we have
Proof
Define the functional
and note that by definition of our M estimator, the error function \(\widehat{\Delta }:=\widehat{f}-f^*\) minimizes \(\widetilde{\mathcal {L}}\). From the inequality \(\widetilde{\mathcal {L}}(\widehat{\Delta })\le \widetilde{\mathcal {L}}(0)\), that is
Denote \(a(t)=\tau -1_{\{t\le 0\}}(t)\). Recall that \(\rho _{\tau }\) is a convex function and \(a(t)\in \partial \rho _{\tau }(t)\), where \(\partial \rho _{\tau }(t)\) is denoted to be the sub-gradient of \(\rho _{\tau }\) at point t. By the definition of sub-gradient, we have
This in connection with (16) shows that
It is easy to check that \(J_n(f_j):=\sqrt{\Vert f_j\Vert _{n}^2+\rho _n\Vert f_j\Vert _{\mathcal {H}}^2}\) forms a standard mixed norm with any \(f_j\in \mathcal {H}\). For any \(j\in S\), by the triangle inequality with respect to the norm, we have
On the other hand, for any \(j\in S^c\), we have
This in connection with (18) implies that
In addition, it is clear that \(\{a(\epsilon _i)\}_{i=1}^n\) are bounded and independent variables with zero-mean, so the condition of (15) is satisfied. Thus, by Lemma 2 on \(\Theta \), one gets
with the choices of \(\lambda _n\ge 2\mu _n\) and \(\rho _n\ge \mu _n^{2}\), the above quantity is plugged into (19) to yield our desired result immediately. \(\square \)
Now, we introduce the local Rademacher complexity, which is critical to our derived results. Given the bounded function class \(\mathcal {G}\) with the star-shaped property [see Bartlett et al. (2005)], satisfying \(\Vert g\Vert _{\infty }\le b\)(\(b\ge 1\)) for all \(g\in \mathcal {G} \). Let \(\{x_i\}_{i=1}^n\) be an i.i.d. sequence of variables from X, drawn according to some distribution \(\mathbb {Q}\). For each \(a>0\), we define the local Rademacher complexity:
where \(\{\sigma _i\}_{i=1}^n\) is an i.i.d. sequence of Rademcher variables, taking values \(\{\pm 1\}\) with probability 1 / 2. Denote \(\nu _n\) to be the smallest solution to the inequality:
Note that such an \(\nu _n\) exists, since the star-shape property ensures that the function \(\mathcal {R}_n(\mathcal {G};a)/a\) is non-increasing in a.
Lemma 3
For any \(j\in \{1,2,\ldots ,p\}\), suppose that \(\Vert f_j\Vert _{\infty }\le b\) for all \(f_j\in \mathcal {H}\). For any \(t\ge \nu _n\), define
Denote \(\mathrm {E}(t):=\bigcap _{j=1}^p\mathrm {E}_j(t)\). If \(t\ge \sqrt{\frac{\log p}{n}}\) also holds, then there exist universal constants \((c_1,c_2)\), such that
To establish the relationship between \(\alpha \) of empirical covering number and \(\nu _n\) of local Rademacher complexity, we need the following conclusion, showing that local Rademacher averages can be estimated by empirical covering numbers.
Lemma 4
Let \(\mathcal {G}\) be a class of measurable functions from X to \([-1,1]\). Suppose that Assumption A1 holds for some \(\alpha \in (0, 1)\). Then, there exists a constant \(c_{\alpha }\) depending only on \(\alpha \), such that
Furthermore, for the case of a single RKHS \(\mathcal {H}\), we need the relationship between the empirical and \(\Vert \cdot \Vert _2\) norms for function in \(\mathcal {H}\). The following conclusion is derived immediately combining Theorem 4 of Koltchinskii and Yuan (2010) and Lemma 3 above.
Lemma 5
Suppose that \(N \ge 4\) and \(p \ge 2 \log n\). Then, there exists a universal constant \(c >0\), such that with probability at least \(1-p^{-N}\), for all \(f\in \mathcal {H}\)
For any given \(\Delta _{-},\,\Delta _{+}>0\), we define the function subset of \(\mathcal {F}\) as
where \(\Vert f\Vert _{2,1}=\sum _{j=1}^p\Vert f_j\Vert _{2}\) and \(\Vert f\Vert _{\mathcal {H},1}=\sum _{j=1}^p\Vert f_j\Vert _{\mathcal {H}}\) for any \(f=\sum _{j=1}^pf_j\). Equipped with this result, we can then prove a refined uniform convergence rate.
Proposition 4
Let \(\mathcal {F}(\Delta _{-},\Delta _{+})\) be a measurable function subset defined as above. Suppose that assumption (14) holds for each univariate \(\mathcal {H}\). For some \(N>4\) involved in \(c_0\), with confidence at least \(1-c_0\exp \left( -\frac{\log p}{c_0}\right) -2p^{-N/2}\), the following bound holds uniformly on \(\Delta _{-}\le e^p\) and \(\Delta _{-}\le e^p\):
Proof of Theorem 1
By the definition of \(\hat{f}\), it follows that
This can be rewritten as
By the triangle inequality, we get
Note that on \(j\in S^c\), we have \(\Vert \hat{f}_j\Vert _n=\Vert \hat{f}_j-f^*_j\Vert _n\) and \(\Vert \hat{f}_j\Vert _{\mathcal {H}}=\Vert \hat{f}_j-f^*_j\Vert _{\mathcal {H}}\). \(\sum _{j\in S}\sqrt{\Vert \hat{f}_j\Vert _n^2+\rho _n\Vert \hat{f}_j\Vert ^2_{\mathcal {H}}}\) is added to both the sides of (22), this implies that
Applying Lemma 5 for \(\Vert \hat{f}_j-f_j^*\Vert _{n}\), \(j=1,\ldots ,p\), with probability at least \(1-p^{-N}\), we have
When \(\zeta >2\) is satisfied, the quantity (23) can be further formulated as
We can claim that
with probability 1. For simplicity, we only verify the first term. Note that \(\Vert f_j\Vert _n\le \Vert f_j\Vert _{\mathcal {H}}\le 1\) for any \(f_j\in \mathcal {H}\), and we see that
This together Proposition 4 implies that, with probability at least \(1-c_0\exp \left( -\frac{\log p}{c_0}\right) -3p^{-N/2}\)
Let \(\eta \) be large sufficiently, such that \(\max \{2\sqrt{2}c c_1,1\}\le \eta \), then with the same probability as above, we have
On the other hand, with the choices \(\rho _n=\eta \mu _n\) and \(\lambda _n^2 =\eta \mu _n^2\), it follows that
where we used the fact \(\Vert f_j\Vert _n\le \Vert f_j\Vert _{\mathcal {H}}\le 1\) for any \(f_j\in \mathcal {H}\), \(j=1,\ldots ,p\). Plugging the above quantity into the right side of (24) yields
It is verified easily that \(p\ge \log n\) implies that \(e^{-p}\le 4\sqrt{2}s\eta ^{3/2}\mu _n \sqrt{1+\mu ^2_n}\); then, we have
\(\square \)
About this article
Cite this article
Lv, S., He, X. & Wang, J. A unified penalized method for sparse additive quantile models: an RKHS approach. Ann Inst Stat Math 69, 897–923 (2017). https://doi.org/10.1007/s10463-016-0566-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-016-0566-9