Skip to main content
Log in

A unified penalized method for sparse additive quantile models: an RKHS approach

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

This paper focuses on the high-dimensional additive quantile model, allowing for both dimension and sparsity to increase with sample size. We propose a new sparsity-smoothness penalty over a reproducing kernel Hilbert space (RKHS), which includes linear function and spline-based nonlinear function as special cases. The combination of sparsity and smoothness is crucial for the asymptotic theory as well as the computational efficiency. Oracle inequalities on excess risk of the proposed method are established under weaker conditions than most existing results. Furthermore, we develop a majorize-minimization forward splitting iterative algorithm (MMFIA) for efficient computation and investigate its numerical convergence properties. Numerical experiments are conducted on the simulated and real data examples, which support the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

References

  • Bartlett, P. L., Bousquet, O., Mendelson, S. (2005). Local Rademacher complexities. Annals of Statistics, 33, 1497–1537.

  • Beck, A., Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, 183–202.

  • Belloni, A., Chernozhukov, V. (2011). \(\ell _1\) penalized quantile regression in high-dimensional sparse models. Annals of Statistics, 39, 83–130.

  • Combettes, P., Wajs, V. (2005). Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4, 1168–1200.

  • Donoho, D. L., Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90, 1200–1224.

  • Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.

  • Hastie, T., Tibshirani, R. (1990). Monographs on Statistics and Applied Probability, Generalized Additive Models (1st ed.), London: Chapman and Hall.

  • He, X. M., Wang, L., Hong, H. G. (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Annals of Statistics, 41, 324–369.

  • Hunter, D. R., Lange, K. (2000). Quantile regression via an MM algorithm. Journal of Computational and Graphical Statistics, 11, 60–77.

  • Jaakkola, T., Diekhans, M., Haussler, D. (1999). Using the Fisher kernel method to detect remote protein homologies. In Proceedings of Seventh International Conference on Intelligent Systems for Molecular Biology, 149–158.

  • Kato, K. (2016). Group Lasso for high dimensional sparse quantile regression models. Manuscript.

  • Koenker, R. (2011). Additive models for quantile regression: Model selection and confidence bandaids. Brazilian Journal of Probability and Statistics, 25, 239–262.

    Article  MathSciNet  MATH  Google Scholar 

  • Koenker, R., Basset, G. (1978). Regression quantiles. Econometrica, 46, 33–50.

  • Koltchinskii, V., Yuan, M. (2008). Sparse recovery in large ensembles of kerenl machines. In: 21st Annual Conference on Learning Theory, Helsinki, 229–238.

  • Koltchinskii, V., Yuan, M. (2010). Sparsity in multiple kernel learning. Annals of Statistics, 38, 3660–3695.

  • Li, Y., Zhu, J. (2008). \(l^1\)-norm quantile regressions. Journal of Computational and Graphical Statistics, 17, 163–185.

  • Li, Y., Liu, Y., Zhu, J. (2007). Quantile regression in reproducing kernel Hilbert spaces. Journal of the American Statistical Association, 102, 255–268.

  • Lian, H. (2012). Estimation of additive quantile regression models by two-fold penalty. Journal of Business and Economic Statistics, 30, 337–350.

    Article  MathSciNet  Google Scholar 

  • Lv, S. G., Lin, H. Z., Lian, H., Huang, J. (2016). Oracle inequalities for sparse additive quantile regression models in reproducing kernel Hilbert space. Manuscript.

  • Meier, L., Van der Geer, S., Bühlmann, P. (2009). High-dimensional additive modeling. Annals of Statistics, 37, 3779–3821.

  • Mernshausen, N., Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics, 37, 246–270.

  • Moreau, J. J. (1962). Fonctions convexes duales et points proximaux dans un espace Hilbertien. Reports of the Paris Academy of Sciences, Series A, 255, 2897–2899.

    MathSciNet  MATH  Google Scholar 

  • Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S. (2010). Solving structured sparsity regularization with proximal methods. Machine Learning and Knowledge Discovery in Databases, 6322, 418–433.

  • Negahban, S., Ravikumar, P., Wainwright, M. J., Yu, B. (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27, 538–557.

  • Pearce, N. D., Wand, M. P. (2006). Penalized splines and reproducing kernel methods. The American Statistician, 60, 233–240.

  • Raskutti, G., Wainwright, M., Yu, B. (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of Machine Learning Research, 13, 389–427.

  • Ravikumar, P., Liu, H., Lafferty, J., Wasserman, L. (2009). SpAM: Sparse additive models. Journal of the Royal Statistical Society: Series B, 71, 1009–1030.

  • Rosasco, L., Villa, S., Mosci, S., Santoro, M., Verri, A. (2013). Nonparametric sparsity and regularization. Journal of Machine Learning Research, 14, 1665–1714.

  • Steinwart, I., Christmann, A. (2011). Estimating conditional quantiles with the help of pinball loss. Bernoulli, 17, 211–225.

  • Tseng, P. (2010). Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 125, 263–295.

    Article  MathSciNet  MATH  Google Scholar 

  • Van der Geer, S. (2000). Empirical Processes in M-estimation. Cambridge: Cambridge University Press.

  • Van der Geer, S. (2008). High-dimensional generalized linear models and the Lasso. Annals of Statistics, 36, 614–645.

  • Wahba, G. (1999). Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV. Advances in kernel methods: support vector learning, pp. 69–88.

  • Wang, L., Wu, Y. C., Li, R. Z. (2012). Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107, 214–222.

  • Xue, L. (2009). Consistent variable selection in additive models. Statistical Science, 19, 1281–1296.

    MathSciNet  MATH  Google Scholar 

  • Yafeh, Y., Yosha, O. (2003). Large Shareholders and banks: Who monitors and how? The Economic Journal, 113, 128–146.

  • Yuan, M. (2006). GACV for Quantile Smoothing Splines. Computational Statistics and Data Analysis, 5, 813–829.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

SL’s research is supported partially by NSFC-11301421, JBK141111, JBK14TD0046, JBK140210, and KLAS-130026507, and JW’s research is supported partially by HK GRF-11302615 and CityU SRG-7004244. The authors also acknowledge Professor Fukumizu for providing an immensely hospitable and fruitful environment when SL visited ISM of Japan, and this work is partially supported by MEXT Grant-in-Aid for Scientific Research on Innovative Areas of Japan (25120012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junhui Wang.

Appendix: Main Proofs

Appendix: Main Proofs

To simplify the proof, we only consider the special case where \(\mu _{\tau }=0\) in our model (1). Lemma 1 presents the behavior of weight empirical process (see Lemma 8.4 of Van der Geer (2000)).

Lemma 1

Let \(\mathcal {G}\) be a collection of functions \(g : \{z_1,\ldots ,z_n\}\rightarrow \mathbb {R}\), endowed with a metric induced by the norm \(\Vert g\Vert _n\). Let \(H(\cdot )\) be the entropy of \(\mathcal {G}\). Suppose that

$$\begin{aligned} H(\varepsilon )\le A\varepsilon ^{-2(1-\alpha )},\quad \forall \, \varepsilon >0, \end{aligned}$$

where A is some constant and \(\alpha \in (0,1)\). In addition, let \(\epsilon _1, \ldots , \epsilon _n\) be independent centered random variables, satisfying

$$\begin{aligned} \max _i\mathbb {E}[exp(\epsilon _i^2 /L)]\le M. \end{aligned}$$
(15)

Denote \(\langle \epsilon ,g\rangle _n=\frac{1}{n}\sum _{i=1}^n\epsilon _ig(z_i)\) with any given \(g\in \mathcal {G}\), then for a constant \(c_0\) depending on \(\alpha , A, L\), and M, we have for all \(T \ge c_0\)

$$\begin{aligned} \mathbb {P}\left( \sup _{g\in \mathcal {G}}\frac{2\langle \epsilon ,g\rangle _n}{\Vert g\Vert _n^\alpha }>\frac{T}{\sqrt{n}}\right) \le c_0\exp \left( -\frac{T^2}{c_0^2}\right) . \end{aligned}$$

According to Lemma 1, we can establish the following technical lemma, which tells us that the key quantity involved in empirical process can be bounded by the proposed regularization term. It turns out that the corresponding oracle rates are improved.

Lemma 2

Under the same conditions of Lemma 1. Define the following event as

$$\begin{aligned} \Theta :=\left\{ \forall j=1,2,\ldots ,p\,\left| \langle \epsilon ,f_j\rangle _n\right| \le \mu _n\sqrt{\Vert f_j\Vert _n^2+\mu ^2_n \Vert f_j\Vert _{\mathcal {H}}},\quad \hbox {for all}\, f_j\in \mathcal {H}\right\} , \end{aligned}$$

where \(c_0\) is some universal constant, which may differ from that of Lemma 1. When \(2\log p\ge c_0\), we have

$$\begin{aligned} \mathbb {P}(\Theta )\ge 1- c_0\exp \left( -\frac{\log p}{c_0}\right) . \end{aligned}$$

Proof

Let \(\mathcal {G}=\{g_j: \Vert g_j\Vert _{\mathcal {H}}=1\}\) involved in Lemma 1. Then, applying Lemma 1, it follows that

$$\begin{aligned} \sup _{f_j}\frac{2\langle \epsilon ,f_j\rangle _n}{\Vert f_j\Vert _n^\alpha \Vert f_j\Vert _{\mathcal {H}}^{1-\alpha }}=\sup _{f_j}\frac{2\langle \epsilon , f_j/\Vert f_j\Vert _{\mathcal {H}}\rangle _n}{\Vert f_j/\Vert f_j\Vert _{\mathcal {H}}\Vert _n^\alpha }\le \frac{T}{\sqrt{n}} \end{aligned}$$

with probability at least \(1-c_0\exp (-T^2/c_0^2)\). Let \(T=\sqrt{2c_0\log p}\), and the assumption \(2\log p\ge c_0\) implies that \(T\ge c_0\). Then, we have

$$\begin{aligned} \mathbb {P}\left( \max _{j}\sup _{f_j}\frac{2\langle \epsilon ,f_j\rangle _n}{\Vert f_j\Vert _n^\alpha \Vert f_j\Vert _{\mathcal {H}}^{1-\alpha }}>\sqrt{\frac{2c_0\log p}{n}}\right) \le c_0p\exp \left( -\frac{T^2}{c_0^2}\right) \le c_0\exp \left( -\frac{\log p}{c_0}\right) . \end{aligned}$$

In other words, with probability at least \(1- c_0\exp \left( -\frac{\log p}{c_0}\right) \), there holds

$$\begin{aligned} \sup _{f_j\in \mathcal {H}}\frac{\langle \epsilon ,f_j\rangle _n}{\Vert f_j\Vert _n^\alpha \Vert f_j\Vert _{\mathcal {H}}^{1-\alpha }}\le \sqrt{\frac{c_0\log p}{2n}},\quad \hbox {for all}\, j \in \{1,2,\ldots ,p\}. \end{aligned}$$

Thus, we derive our first desired conclusion for \(\Theta \) based on the basic inequality:

$$\begin{aligned} x^\alpha y^{1-\alpha }\le \sqrt{x^2+y^2},\quad \hbox {for any}\,\alpha \in (0,1)\,\hbox {and}\, x,y>0. \end{aligned}$$

\(\square \)

Similar results on the Rademacher complexity and Gaussian complexity have been established in Koltchinskii and Yuan (2010) and Raskutti et al. (2012), respectively.

The next lemma shows that the quantities \(\sum _{j=1}^p \sqrt{\Vert \widehat{\Delta }_j\Vert _n^2+\rho _n \Vert \widehat{\Delta }_j\Vert _{\mathcal {H}}^2}\) can be controlled by the corresponding one as applied to the active set S. They provide a way to prove sparsity oracle inequalities for the estimator (2).

Proposition 3

Conditioned on the events \(\Theta \), with the choices of \(\lambda _n\ge 2\mu _n\) and \(\rho _n\ge \mu _n^{2}\), we have

$$\begin{aligned} \sum _{j=1}^p \sqrt{\Vert \widehat{\Delta }_j\Vert _n^2+\rho _n \Vert \widehat{\Delta }_j\Vert _{\mathcal {H}}^2}\le 4\sum _{j\in S} \sqrt{\Vert \widehat{\Delta }_j\Vert _{n}^2+\rho _n\Vert \widehat{\Delta }_j\Vert _{\mathcal {H}}^2 }.\end{aligned}$$

Proof

Define the functional

$$\begin{aligned} \widetilde{\mathcal {L}}(\Delta )=\frac{1}{n}\sum _{i=1}^n\rho _{\tau }\left( \epsilon _i-\Delta (X_i)\right) + \lambda _n\sum _{j=1}^p\sqrt{\Vert f^*_j+\Delta _j\Vert _{n}^2+\rho _n\Vert f^*_j+\Delta _j\Vert _{\mathcal {H}}^2} \end{aligned}$$

and note that by definition of our M estimator, the error function \(\widehat{\Delta }:=\widehat{f}-f^*\) minimizes \(\widetilde{\mathcal {L}}\). From the inequality \(\widetilde{\mathcal {L}}(\widehat{\Delta })\le \widetilde{\mathcal {L}}(0)\), that is

$$\begin{aligned}&\frac{1}{n}\sum _{i=1}^n\rho _{\tau }\left( \epsilon _i -\widehat{\Delta }(X_i)\right) -\frac{1}{n}\sum _{i=1}^n\rho _{\tau }\left( \epsilon _i\right) \nonumber \\&\quad \le \lambda _n\sum _{j=1}^p\sqrt{\Vert f^*_j\Vert _{n}^2+\rho _n\Vert f^*_j\Vert _{\mathcal {H}}^2} -\lambda _n\sum _{j=1}^p\sqrt{\Vert f^*_j+\widehat{\Delta }_j\Vert _{n}^2+\rho _n\Vert f^*_j +\widehat{\Delta }_j\Vert _{\mathcal {H}}^2}.\nonumber \\ \end{aligned}$$
(16)

Denote \(a(t)=\tau -1_{\{t\le 0\}}(t)\). Recall that \(\rho _{\tau }\) is a convex function and \(a(t)\in \partial \rho _{\tau }(t)\), where \(\partial \rho _{\tau }(t)\) is denoted to be the sub-gradient of \(\rho _{\tau }\) at point t. By the definition of sub-gradient, we have

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n\rho _{\tau }\left( \epsilon _i- \widehat{\Delta }(X_i)\right) -\frac{1}{n}\sum _{i=1}^n\rho _{\tau }\left( \epsilon _i\right) \ge -\frac{1}{n}\sum _{i=1}^n a(\epsilon _i)\widehat{\Delta }(X_i). \end{aligned}$$
(17)

This in connection with (16) shows that

$$\begin{aligned}&-\frac{1}{n}\sum _{i=1}^n a(\epsilon _i)\widehat{\Delta }(X_i)\nonumber \\&\quad \le \lambda _n\sum _{j=1}^p \left( \sqrt{\Vert f^*_j\Vert _{n}^2+\rho _n\Vert f^*_j\Vert _{\mathcal {H}}^2}. -\sqrt{\Vert f^*_j+\widehat{\Delta }_j\Vert _{n}^2+\rho _n\Vert f^*_j +\widehat{\Delta }_j\Vert _{\mathcal {H}}^2}\right) .\qquad \qquad \end{aligned}$$
(18)

It is easy to check that \(J_n(f_j):=\sqrt{\Vert f_j\Vert _{n}^2+\rho _n\Vert f_j\Vert _{\mathcal {H}}^2}\) forms a standard mixed norm with any \(f_j\in \mathcal {H}\). For any \(j\in S\), by the triangle inequality with respect to the norm, we have

$$\begin{aligned} \sqrt{\Vert f^*_j\Vert _{n}^2+\rho _n\Vert f^*_j\Vert _{\mathcal {H}}^2} -\sqrt{\Vert f^*_j+\widehat{\Delta }_j\Vert _{n}^2+\rho _n\Vert f^*_j +\widehat{\Delta }_j\Vert _{\mathcal {H}}^2} \le \sqrt{\Vert \widehat{\Delta }_j\Vert _{n}^2+\rho _n\Vert \widehat{\Delta }_j\Vert _{\mathcal {H}}^2 }. \end{aligned}$$

On the other hand, for any \(j\in S^c\), we have

$$\begin{aligned} \sqrt{\Vert f^*_j\Vert _{n}^2+\rho _n\Vert f^*_j\Vert _{\mathcal {H}}^2} -\sqrt{\Vert f^*_j+\widehat{\Delta }_j\Vert _{n}^2+\rho _n\Vert f^*_j +\widehat{\Delta }_j\Vert _{\mathcal {H}}^2}=-\sqrt{\Vert \widehat{\Delta }_j\Vert _{n}^2+ \rho _n\Vert \widehat{\Delta }_j\Vert _{\mathcal {H}}^2}. \end{aligned}$$

This in connection with (18) implies that

$$\begin{aligned} -\frac{1}{n}\sum _{i=1}^n a(\epsilon _i)\widehat{\Delta }(X_i)\le \lambda _n\sum _{j\in S} \sqrt{\Vert \widehat{\Delta }_j\Vert _{n}^2+\rho _n\Vert \widehat{\Delta }_j\Vert _{\mathcal {H}}^2}- \lambda _n\sum _{j\in S^c}\sqrt{\Vert \widehat{\Delta }_j\Vert _{n}^2+\rho _n \Vert \widehat{\Delta }_j\Vert _{\mathcal {H}}^2}. \end{aligned}$$
(19)

In addition, it is clear that \(\{a(\epsilon _i)\}_{i=1}^n\) are bounded and independent variables with zero-mean, so the condition of (15) is satisfied. Thus, by Lemma 2 on \(\Theta \), one gets

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n a(\epsilon _i)\widehat{\Delta }(X_i)\le \mu _n \sum _{j=1}^p \sqrt{\Vert \widehat{\Delta }_j\Vert _n^2+\mu _n^2 \Vert \widehat{\Delta }_j\Vert _{\mathcal {H}}^2}, \end{aligned}$$

with the choices of \(\lambda _n\ge 2\mu _n\) and \(\rho _n\ge \mu _n^{2}\), the above quantity is plugged into (19) to yield our desired result immediately. \(\square \)

Now, we introduce the local Rademacher complexity, which is critical to our derived results. Given the bounded function class \(\mathcal {G}\) with the star-shaped property [see Bartlett et al. (2005)], satisfying \(\Vert g\Vert _{\infty }\le b\)(\(b\ge 1\)) for all \(g\in \mathcal {G} \). Let \(\{x_i\}_{i=1}^n\) be an i.i.d. sequence of variables from X, drawn according to some distribution \(\mathbb {Q}\). For each \(a>0\), we define the local Rademacher complexity:

$$\begin{aligned} \mathcal {R}_n(\mathcal {G};a):=\mathbb {E}_{x,\sigma } \left[ \sup _{g\in \mathcal {G},\Vert g\Vert _2\le a}\frac{1}{n}\left| \sum _{i=1}^n\sigma _ig(x_i)\right| \right] , \end{aligned}$$

where \(\{\sigma _i\}_{i=1}^n\) is an i.i.d. sequence of Rademcher variables, taking values \(\{\pm 1\}\) with probability 1 / 2. Denote \(\nu _n\) to be the smallest solution to the inequality:

$$\begin{aligned} \mathcal {R}_n(\{f_j:\,\Vert f_j\Vert \le 1\};\nu _n )=\frac{\nu _n^2}{40}. \end{aligned}$$
(20)

Note that such an \(\nu _n\) exists, since the star-shape property ensures that the function \(\mathcal {R}_n(\mathcal {G};a)/a\) is non-increasing in a.

Lemma 3

For any \(j\in \{1,2,\ldots ,p\}\), suppose that \(\Vert f_j\Vert _{\infty }\le b\) for all \(f_j\in \mathcal {H}\). For any \(t\ge \nu _n\), define

$$\begin{aligned} \mathrm {E}_j(t):=\bigg \{\frac{1}{2}\Vert f_j\Vert _2\le \Vert f_j\Vert _n\le \frac{3}{2}\Vert f_j\Vert _2, \quad \hbox {for all}\, f_j\in \mathcal {H} \,\hbox {with}\,\Vert f_j\Vert _2\ge bt \bigg \}. \end{aligned}$$
(21)

Denote \(\mathrm {E}(t):=\bigcap _{j=1}^p\mathrm {E}_j(t)\). If \(t\ge \sqrt{\frac{\log p}{n}}\) also holds, then there exist universal constants \((c_1,c_2)\), such that

$$\begin{aligned} \mathbb {P}[\mathrm {E}(t)]\ge 1-c_1\exp (-c_2nt^2). \end{aligned}$$

To establish the relationship between \(\alpha \) of empirical covering number and \(\nu _n\) of local Rademacher complexity, we need the following conclusion, showing that local Rademacher averages can be estimated by empirical covering numbers.

Lemma 4

Let \(\mathcal {G}\) be a class of measurable functions from X to \([-1,1]\). Suppose that Assumption A1 holds for some \(\alpha \in (0, 1)\). Then, there exists a constant \(c_{\alpha }\) depending only on \(\alpha \), such that

$$\begin{aligned} \mathcal {R}_n(\mathcal {H}; r)\le c_{\alpha } \max \left\{ r^{\alpha }\left( \frac{A}{n}\right) ^{1/2}, \,\left( \frac{A}{n}\right) ^{1(2-\alpha )}\right\} . \end{aligned}$$

Furthermore, for the case of a single RKHS \(\mathcal {H}\), we need the relationship between the empirical and \(\Vert \cdot \Vert _2\) norms for function in \(\mathcal {H}\). The following conclusion is derived immediately combining Theorem 4 of Koltchinskii and Yuan (2010) and Lemma 3 above.

Lemma 5

Suppose that \(N \ge 4\) and \(p \ge 2 \log n\). Then, there exists a universal constant \(c >0\), such that with probability at least \(1-p^{-N}\), for all \(f\in \mathcal {H}\)

$$\begin{aligned} \Vert f\Vert _2\le c(\Vert f\Vert _n+\mu _n\Vert f\Vert _{\mathcal {H}}),\\ \Vert f\Vert _n\le c(\Vert f\Vert _2+\mu _n\Vert f\Vert _{\mathcal {H}}). \end{aligned}$$

For any given \(\Delta _{-},\,\Delta _{+}>0\), we define the function subset of \(\mathcal {F}\) as

$$\begin{aligned} \mathcal {F}(\Delta _{-},\Delta _{+}):=\{f:\mu _n\Vert f-f^*\Vert _{2,1}\le \Delta _{-}, \mu _n^2\Vert f-f^*\Vert _{\mathcal {H},1}\le \Delta _{+}\}, \end{aligned}$$

where \(\Vert f\Vert _{2,1}=\sum _{j=1}^p\Vert f_j\Vert _{2}\) and \(\Vert f\Vert _{\mathcal {H},1}=\sum _{j=1}^p\Vert f_j\Vert _{\mathcal {H}}\) for any \(f=\sum _{j=1}^pf_j\). Equipped with this result, we can then prove a refined uniform convergence rate.

Proposition 4

Let \(\mathcal {F}(\Delta _{-},\Delta _{+})\) be a measurable function subset defined as above. Suppose that assumption (14) holds for each univariate \(\mathcal {H}\). For some \(N>4\) involved in \(c_0\), with confidence at least \(1-c_0\exp \left( -\frac{\log p}{c_0}\right) -2p^{-N/2}\), the following bound holds uniformly on \(\Delta _{-}\le e^p\) and \(\Delta _{-}\le e^p\):

$$\begin{aligned}{}[\mathcal {E}(f)-\mathcal {E}(f^*)]-[\mathcal {E}_n(f)-\mathcal {E}_n(f^*)] \le c_1(\Delta _{-}+\Delta _{+})+e^{-p},\quad \forall \,f\in \mathcal {F}(\Delta _{-},\Delta _{+}). \end{aligned}$$

Proof of Theorem 1

By the definition of \(\hat{f}\), it follows that

$$\begin{aligned} \mathcal {E}_n\big (\hat{f}\big )+\lambda _n\sum _{j=1}^p\sqrt{\Vert \hat{f}_j\Vert _n^2+\rho _n \Vert \hat{f}_j\Vert ^2_{\mathcal {H}}} \le \mathcal {E}_n(f^*)+\lambda _n\sum _{j=1}^p\sqrt{\Vert f^*_j\Vert _n^2+\rho _n \Vert f^*_j\Vert ^2_{\mathcal {H}}}. \end{aligned}$$

This can be rewritten as

$$\begin{aligned}&\mathcal {E}\big (\hat{f}\big )-\mathcal {E}(f^*)+\lambda _n\sum _{j=1}^p\sqrt{\Vert \hat{f}_j\Vert _n^2+\rho _n \Vert \hat{f}_j\Vert ^2_{\mathcal {H}}}\\&\quad \le [\mathcal {E}\big (\hat{f}\big )-\mathcal {E}(f^*)]-[\mathcal {E}_n\big (\hat{f}\big )-\mathcal {E}_n(f^*)]+ \lambda _n\sum _{j=1}^p\sqrt{\Vert f^*_j\Vert _n^2+\rho _n \Vert f^*_j\Vert ^2_{\mathcal {H}}}. \end{aligned}$$

By the triangle inequality, we get

$$\begin{aligned}&\mathcal {E}\big (\hat{f}\big )-\mathcal {E}(f^*)+\lambda _n\sum _{j\in S^c}\sqrt{\Vert \hat{f}_j\Vert _n^2+\rho _n \Vert \hat{f}_j\Vert ^2_{\mathcal {H}}}\nonumber \\&\quad \le [\mathcal {E}\big (\hat{f}\big )-\mathcal {E}(f^*)]- [\mathcal {E}_n\big (\hat{f}\big )-\mathcal {E}_n(f^*)]\nonumber \\&\qquad +\lambda _n\sum _{j\in S}\sqrt{\Vert \hat{f}_j-f^*_j\Vert _n^2+\rho _n \Vert \hat{f}_j-f^*_j\Vert ^2_{\mathcal {H}}}. \end{aligned}$$
(22)

Note that on \(j\in S^c\), we have \(\Vert \hat{f}_j\Vert _n=\Vert \hat{f}_j-f^*_j\Vert _n\) and \(\Vert \hat{f}_j\Vert _{\mathcal {H}}=\Vert \hat{f}_j-f^*_j\Vert _{\mathcal {H}}\). \(\sum _{j\in S}\sqrt{\Vert \hat{f}_j\Vert _n^2+\rho _n\Vert \hat{f}_j\Vert ^2_{\mathcal {H}}}\) is added to both the sides of (22), this implies that

$$\begin{aligned}&\mathcal {E}\big (\hat{f}\big )-\mathcal {E}(f^*)+\lambda _n\sum _{j=1}^p\sqrt{\Vert \hat{f}_j-f^*_j\Vert _n^2+\rho _n \Vert \hat{f}_j-f^*_j\Vert ^2_{\mathcal {H}}}\nonumber \\&\quad \le [\mathcal {E}\big (\hat{f}\big )-\mathcal {E}(f^*)]- [\mathcal {E}_n\big (\hat{f}\big )-\mathcal {E}_n(f^*)]\nonumber \\&\qquad +2\lambda _n\sum _{j\in S}\sqrt{\Vert \hat{f}_j-f^*_j\Vert _n^2+\rho _n \Vert \hat{f}_j-f^*_j\Vert ^2_{\mathcal {H}}}. \end{aligned}$$
(23)

Applying Lemma 5 for \(\Vert \hat{f}_j-f_j^*\Vert _{n}\), \(j=1,\ldots ,p\), with probability at least \(1-p^{-N}\), we have

$$\begin{aligned} \big \Vert \hat{f}_j-f^*_j\big \Vert _{n}^2\ge c^{-2}/2\big \Vert \hat{f}_j-f_j^*\big \Vert _{2}^2- \mu _n^2\big \Vert \hat{f}_j-f_j^*\big \Vert ^2_{\mathcal {H}}. \end{aligned}$$

When \(\zeta >2\) is satisfied, the quantity (23) can be further formulated as

$$\begin{aligned}&\mathcal {E}\big (\hat{f}\big )-\mathcal {E}(f^*)+\lambda _n\sum _{j=1}^p\sqrt{c^{-2}/2\Vert \hat{f}_j-f^*_j\Vert _2^2+\rho _n/2 \Vert \hat{f}_j-f^*_j\Vert ^2_{\mathcal {H}}}\\&\quad \le [\mathcal {E}\big (\hat{f}\big )-\mathcal {E}(f^*)]- [\mathcal {E}_n\big (\hat{f}\big )-\mathcal {E}_n(f^*)] \\&\qquad +2\lambda _n\sum _{j\in S}\sqrt{\Vert \hat{f}_j-f^*_j\Vert _n^2+\rho _n \Vert \hat{f}_j-f^*_j\Vert ^2_{\mathcal {H}}}. \end{aligned}$$

We can claim that

$$\begin{aligned} \mu _n\big \Vert \hat{f}-f^*\big \Vert _{2,1}\le e^p,\quad \mu ^2_n\big \Vert \hat{f}-f^*\big \Vert _{\mathcal {H},1}\le e^p, \end{aligned}$$

with probability 1. For simplicity, we only verify the first term. Note that \(\Vert f_j\Vert _n\le \Vert f_j\Vert _{\mathcal {H}}\le 1\) for any \(f_j\in \mathcal {H}\), and we see that

$$\begin{aligned} \mu _n\big \Vert \hat{f}-f^*\big \Vert _{2,1}\le 2p\left( \frac{\log p}{n}\right) ^{\frac{1}{2(2-\alpha )}}\le 2p\left( \frac{\log p}{n}\right) ^{\frac{1}{4}}\le e^p,\quad \hbox {for all}\,n\ge 1,\,\alpha \in (0,1). \end{aligned}$$

This together Proposition 4 implies that, with probability at least \(1-c_0\exp \left( -\frac{\log p}{c_0}\right) -3p^{-N/2}\)

$$\begin{aligned}&\mathcal {E}\big (\hat{f}\big )-\mathcal {E}(f^*)+\lambda _n/\sqrt{2}\sum _{j=1}^p\sqrt{c^{-2} \Vert \hat{f}_j-f^*_j\Vert _2^2+\rho _n \Vert \hat{f}_j-f^*_j\Vert ^2_{\mathcal {H}}}\\&\quad \le c_1\mu _n\sum _{j=1}^p \sqrt{\Vert \hat{f}_j-f^*_j\Vert _2^2+\mu ^2_n\Vert \hat{f}_j-f^*_j\Vert ^2_{\mathcal {H}}}\\&\qquad +e^{-p}+2\lambda _n\sum _{j\in S}\sqrt{\Vert \hat{f}_j-f^*_j\Vert _n^2+\rho _n \Vert \hat{f}_j-f^*_j\Vert ^2_{\mathcal {H}}}. \end{aligned}$$

Let \(\eta \) be large sufficiently, such that \(\max \{2\sqrt{2}c c_1,1\}\le \eta \), then with the same probability as above, we have

$$\begin{aligned}&\mathcal {E}\big (\hat{f}\big )-\mathcal {E}(f^*) +\lambda _n/4\sum _{j=1}^p \sqrt{c^{-2}\Vert \hat{f}_j-f^*_j\Vert _2^2 +\rho _n \Vert \hat{f}_j-f^*_j\Vert ^2_{\mathcal {H}}} \nonumber \\&\quad \le e^{-p}+2\lambda _n\sum _{j\in S}\sqrt{\Vert \hat{f}_j-f^*_j\Vert _n^2+\rho _n \Vert \hat{f}_j-f^*_j\Vert ^2_{\mathcal {H}}}. \end{aligned}$$
(24)

On the other hand, with the choices \(\rho _n=\eta \mu _n\) and \(\lambda _n^2 =\eta \mu _n^2\), it follows that

$$\begin{aligned} \lambda _n\sum _{j\in S}\sqrt{\Vert \hat{f}_j-f^*_j\Vert _2^2+\rho _n \Vert \hat{f}_j-f^*_j\Vert ^2_{\mathcal {H}}}\le 4\sqrt{2}s\eta ^{3/2}\mu _n \sqrt{1+\mu ^2_n}, \end{aligned}$$

where we used the fact \(\Vert f_j\Vert _n\le \Vert f_j\Vert _{\mathcal {H}}\le 1\) for any \(f_j\in \mathcal {H}\), \(j=1,\ldots ,p\). Plugging the above quantity into the right side of (24) yields

$$\begin{aligned} \mathcal {E}\big (\hat{f}\big )-\mathcal {E}(f^*) \le 4\sqrt{2}s\eta ^{3/2}\mu _n \sqrt{1+\mu ^2_n}+e^{-p}. \end{aligned}$$

It is verified easily that \(p\ge \log n\) implies that \(e^{-p}\le 4\sqrt{2}s\eta ^{3/2}\mu _n \sqrt{1+\mu ^2_n}\); then, we have

$$\begin{aligned} \mathcal {E}\big (\hat{f}\big )-\mathcal {E}(f^*) \le 8\sqrt{2}s\eta ^{3/2}\mu _n \sqrt{1+\mu ^2_n}. \end{aligned}$$

\(\square \)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lv, S., He, X. & Wang, J. A unified penalized method for sparse additive quantile models: an RKHS approach. Ann Inst Stat Math 69, 897–923 (2017). https://doi.org/10.1007/s10463-016-0566-9

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-016-0566-9

Keywords

Navigation