Abstract
We study the asymptotic properties of a new version of the Sparse Group Lasso estimator (SGL), called adaptive SGL. This new version includes two distinct regularization parameters, one for the Lasso penalty and one for the Group Lasso penalty, and we consider the adaptive version of this regularization, where both penalties are weighted by preliminary random coefficients. The asymptotic properties are established in a general framework, where the data are dependent and the loss function is convex. We prove that this estimator satisfies the oracle property: the sparsity-based estimator recovers the true underlying sparse model and is asymptotically normally distributed. We also study its asymptotic properties in a double-asymptotic framework, where the number of parameters diverges with the sample size. We show by simulations and on real data that the adaptive SGL outperforms other oracle-like methods in terms of estimation precision and variable selection.
Similar content being viewed by others
References
Anderson, P. K., Gill, R. D. (1982). Cox’s regression model for counting processes: A large sample study. The Annals of Statistics, 10(4), 1100–1120.
Bertsekas, D. (1995). Nonlinear programming. Belmont, MA: Athena Scientific.
Billingsley, P. (1961). The Lindeberg–Levy theorem for martingales. Proceedings of the American Mathematical Society, 12, 788792.
Billingsley, P. (1995). Probability and measure. New York: Wiley.
Bühlmann, P., van de Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications. Springer series in statistics Berlin: Springer.
Chernozhukov, V. (2005). Extremal quantile regression. The Annals of Statistics, 33(2), 806–839.
Chernozhukov, V., Hong, H. (2004). Likelihood estimation and inference in a class of nonregular econometric models. Econometrica, 72(5), 1445–1480.
Davis, R. A., Knight, K., Liu, J. (1992). M-estimation for autoregressions with infinite variance. Stochastic Processes and Their Applications, 40, 145–180.
Fan, J. (1997). Comments on wavelets in statistics: A review by A. Antoniadis. Journal of the Italian Statistical Association, 6, 131138.
Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Fan, J., Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32(3), 928–961.
Francq, C., Thieu, L. Q. (2015). QML inference for volatility models with covariates. MPRA paper no. 63198.
Francq, C., Zakoïan, J. M. (2010). GARCH models. Chichester: Wiley.
Fu, W. J. (1998). Penalized regression: the Bridge versus the Lasso. Journal of Computational and Graphical Statistics, 7, 397–416.
Geyer, C. J. (1996). On the asymptotics of convex stochastic optimization. Unpublished manuscript.
Hjort, N. L., Pollard, D. (1993). Asymptotics for minimisers of convex processes. Unpublished manuscript.
Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and Monte Carlo. The Annals of Statistics, 1(5), 799821.
Hunter, D. R., Li, R. (2005). Variable selection using MM algorithms. The Annals of Statistics, 33(4), 1617–1642.
Kato, K. (2009). Asymptotics for argmin processes: Convexity arguments. Journal of Multivariate Analysis, 100, 1816–1829.
Knight, K., Fu, W. (2000). Asymptotics for Lasso-type estimators. The Annals of Statistics, 28(5), 1356–1378.
Li, X., Mo, L., Yuan, X., Zhang, J. (2014). Linearized alternating direction method of multipliers for Sparse Group and Fused Lasso models. Computational Statistics and Data Analysis, 79, 203–221.
Nardi, Y., Rinaldo, A. (2008). On the asymptotic properties of the Group Lasso estimator for linear models. Electronic Journal of Statistics, 2, 605–633.
Neumann, M. H. (2013). A central limit theorem for triangular arrays of weakly dependent random variables, with applications in statistics. Probability and Statistics, 17, 120–134.
Newey, W. K., Powell, J. L. (1987). Asymmetric least squares estimation and testing. Econometrica, 55(4), 819–847.
Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory, 7(2), 186–199.
Racine, J. (2000). Consistent cross-validatory model-selection for dependent data: hv-block cross-validation. Journal of Econometrics, 99, 39–61.
Rio, E. (2013). Inequalities and limit theorems for weakly dependent sequences. 3 ème Cycle, cel–00867106, 170.
Rockafeller, R. T. (1970). Convex analysis. Princeton: Princeton University Press.
Shiryaev, A. N. (1991). Probability. Berlin: Springer.
Simon, N., Friedman, J., Hastie, T., Tibshirani, R. (2013). A Sparse Group Lasso. Journal of Computational and Graphical Statistics, 22(2), 231–245.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58(1), 267–288.
Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using \(l^1\)-constrained quadratic programming. IEEE Transactions on Information Theory, 55(5), 2183–2202.
Wellner, J. A., van der Vaart, A. W. (1996). Weak convergence and empirical processes. With applications to statistics. New York, NY: Springer.
Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B, 68(1), 49–67.
Zou, H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.
Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320.
Zou, H., Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37(4), 1733–1751.
Acknowledgements
I would like to thank Alexandre Tsybakov, Arnak Dalalyan, Jean-Michel Zakoïan and Christian Francq for all the theoretical references they provided. And I thank warmly Jean-David Fermanian for his significant help and helpful comments. I gratefully acknowledge the Ecodec Laboratory for its support and the Japan Society for the Promotion of Science.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
We first introduce some preleminary results. The dependent setting requires the use of more sophisticated probabilistic tools to derive asymptotic results than the i.i.d. case. Assumptions 1 and 4 allow for using the central limit theorem of Billingsley (1961). We remind this result stated as a corollary in Billingsley (1961).
Corollary 1
(Billingsley 1961) If \((x_t,{{\mathcal {F}}}_t)\) is a stationary and ergodic sequence of square integrable martingale increments such that \(\sigma ^2_x = \text {Var}(x_t) \ne 0\), then \(T^{-1/2} \sum ^{T}_{t=1} x_t \overset{d}{\rightarrow } {{\mathcal {N}}}(0,\sigma ^2_x)\).
Note that the square martingale difference condition can be relaxed by \(\alpha \)-mixing and moment conditions. For instance, Rio (2013) provides a central limit theorem for strongly mixing and stationary sequences.
To prove Theorem 1, we remind of Theorem II.1 of Anderson and Gill (1982) which proves that pointwise convergence in probability of random concave functions implies uniform convergence on compact subspaces.
Theorem 9
(Anderson and Gill 1982) Let E be an open convex subset of \({{\mathbb {R}}}^p\), and let \(F_1, F_2,\ldots ,\) be a sequence of random concave functions on E such that \(F_n(x) \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} f(x)\) for every \(x \in E\) where f is some real function on E. Then f is also concave, and for all compact \(A \subset E\),
The proof of this theorem is based on a diagonal argument and Theorem 10.8 of Rockafeller (1970), that is, the pointwise convergence of concave random functions on a dense and countable subset of an open set implies uniform convergence on any compact subset of the open set. Then the following corollary is stated.
Corollary 2
(Anderson and Gill 1982) Assume \(F_n(x) \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} f(x)\), for every \(x \in E\), an open convex subset of \({{\mathbb {R}}}^p\). Suppose f has a unique maximum at \(x_0 \in E\). Let \({\hat{X}}_n\) maximize \(F_n\). Then \({\hat{X}}_n \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} x_0\).
Newey and Powell (1987) use a similar theorem to prove the consistency of asymmetric least squares estimators without any compacity assumption on \(\varTheta \). We apply these results in our framework, where the parameter set \(\varTheta \) is supposed to be convex.
We used the convexity argument to derive the asymptotic distribution of the SGL estimator. Chernozhukov and Hong (2004) and Chernozhukov (2005) use this convexity argument to obtain the asymptotic distribution of quantile regression-type estimators. This argument relies on the convexity lemma, which is a key result to obtain an asymptotic distribution when the objective function is not differentiable. It only requires the lower-semicontinuity and convexity of the empirical criterion. The convexity lemma, as in Chernozhukov (2005), proof of Theorem 4.1, can be stated as follows:
Lemma 1
(Chernozhukov 2005) Suppose
- (i)
a sequence of convex lower-semicontinuous \({{\mathbb {F}}}_T: {{\mathbb {R}}}^d \rightarrow {\bar{{{\mathbb {R}}}}}\) marginally converges to \({{\mathbb {F}}}_{\infty }: {{\mathbb {R}}}^d \rightarrow {\bar{{{\mathbb {R}}}}}\) over a dense subset of \({{\mathbb {R}}}^d\);
- (ii)
\({{\mathbb {F}}}_{\infty }\) is finite over a non-empty open set \(E \subset {{\mathbb {R}}}^d\);
- (iii)
\({{\mathbb {F}}}_{\infty }\) is uniquely minimized at a random vector \(\varvec{u}_{\infty }\).
Then
This is a key argument used in Theorem 3, Proposition 1 and Theorem 5.
When we consider a diverging number of parameters, the empirical criterion can be viewed as a sequence of dependent arrays for which we need refined asymptotic results. Shiryaev (1991) proposed a version of the central limit theorem for dependent sequence of arrays, provided this sequence is a square integrable martingale difference satisfying the so-called Lindeberg condition. A similar theorem can be found in Billingsley (1995, Theorem 35.12, p.476). We provide here the theorem of Shiryaev (see Theorem 4, p.543 of Shiryaev 1991) that we will use to derive the asymptotic distribution of the adaptive SGL estimator.
Theorem 10
(Shiryaev 1991) Let a sequence of square integrable martingale differences \(\xi ^n = (\xi _{nk},{{\mathcal {F}}}^n_k),n \ge 1\), with \({{\mathcal {F}}}^n_k = \sigma (\xi _{ns},s \le k)\), satisfy the Lindeberg condition for any \(0<t\le 1\), for \(\epsilon > 0\), given by
then if \(\overset{\lfloor nt \rfloor }{\underset{k=0}{\sum }} {{\mathbb {E}}}[\xi ^2_{nk}| {{\mathcal {F}}}^n_{k-1} ] \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} \sigma ^2_t\), or \(\overset{\lfloor nt \rfloor }{\underset{k=0}{\sum }} \xi ^2_{nk} \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} \sigma ^2_t\), then \(\overset{\lfloor nt \rfloor }{\underset{k=0}{\sum }} \xi _{nk} \overset{d}{\longrightarrow } {{\mathcal {N}}}(0,\sigma ^2_t).\)
There exist central limit results relaxing the stationarity and martingale difference assumptions for sequences of arrays. Neumann (2013) proposed such a central limit theorem for weakly dependent sequences of arrays. Such sequences should also satisfy a Lindeberg condition and conditions on covariances. Equipped with these preliminary results, we now report the proofs of Sect. 4.
Proof of Theorem 1
By definition, \({\hat{\varvec{\theta }}} = \underset{\varvec{\theta }\in \varTheta }{\arg \, \min } \, \{{{\mathbb {G}}}_T \varphi (\varvec{\theta })\}\). In a first step, we prove the uniform convergence of \({{\mathbb {G}}}_T \varphi (.)\) to the limit quantity \({{\mathbb {G}}}_{\infty }\varphi (.)\) on any compact set \(\varvec{B}\subset \varTheta \), idest
We define \({{\mathcal {C}}}\subset \varTheta \) an open convex set and pick \(\varvec{x}\in {{\mathcal {C}}}\). Then by Assumption 1, the law of large number implies
Consequently, if \(\lambda _T / T \rightarrow \lambda _0 \ge 0\) and \(\gamma _T / T \rightarrow \gamma _0 \ge 0\), we obtain the pointwise convergence
By Theorem 9 of Anderson and Gill (1982), \({{\mathbb {G}}}_{\infty } \varphi (.)\) is a convex function and we deduce the desired uniform convergence over any compact subset of \(\varTheta \), that is (7).
Now we would like that \(\arg \, \min \, \{{{\mathbb {G}}}_T \varphi (.)\} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \arg \, \min \, \{{{\mathbb {G}}}_{\infty } \varphi (.)\}\). By Assumption 3, \(\varphi (.)\) is convex, which implies
Consequently, \(\arg \, \min \{{{\mathbb {G}}}_T \varphi (\varvec{x})\} = O(1)\), such that \({\hat{\varvec{\theta }}} \in {{\mathcal {B}}}_o(\varvec{\theta }_0,C)\) with probability approaching one for C large enough, with \({{\mathcal {B}}}_o(\varvec{\theta }_0,C)\) an open ball centered at \(\varvec{\theta }_0\) and of radius C. Furthermore, as \({{\mathbb {G}}}_{\infty } \varphi (.)\) is convex, continuous, then \(\underset{\varvec{x}\in B}{\arg \, \min } \, \{{{\mathbb {G}}}_{\infty } \varphi (\varvec{x})\}\) exists and is unique. Then by Corollary 2 of Andersen and Gill, we obtain
\(\square \)
Proof of Theorem 2
We denote \(\nu _T = T^{-1/2} + \lambda _T T^{-1} a + \gamma _T T^{-1} b\), with \(a = \text {card}({{\mathcal {A}}})(\underset{k}{\max } \; \alpha _k)\) and \(b = \text {card}({{\mathcal {A}}})(\underset{l}{\max } \; \xi _l)\). We would like to prove that for any \(\varvec{\epsilon }> 0\), there exists \(C_{\varvec{\epsilon }} > 0\) such that \({{\mathbb {P}}}(\nu ^{-1}_T\Vert {\hat{\varvec{\theta }}} - \varvec{\theta }_0\Vert > C_{\varvec{\epsilon }}) < \varvec{\epsilon }\). We have
\(\Vert \varvec{u}\Vert _2\) can potentially be large as it represents the discrepancy \({\hat{\varvec{\theta }}}-\varvec{\theta }_0\) normalized by \(\nu _T\). Now based on the convexity of the objective function, we have
a relationship that allows us to work with a fixed \(\Vert \varvec{u}\Vert _2\). Let us define \(\varvec{\theta }_1 = \varvec{\theta }_0 + \nu _T \varvec{u}^*\) such that \({{\mathbb {G}}}_T \varphi (\varvec{\theta }_1) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)\). Let \(\alpha \in (0,1)\) and \(\varvec{\theta }= \alpha \varvec{\theta }_1 + (1-\alpha ) \varvec{\theta }_0\). Then by convexity of \({{\mathbb {G}}}_T \varphi (.)\), we obtain
We pick \(\alpha \) such that \(\Vert {\bar{\varvec{u}}}\Vert = C_{\varvec{\epsilon }}\) with \({\bar{\varvec{u}}} := \alpha \varvec{\theta }_1 + (1-\alpha ) \varvec{\theta }_0\). Hence (8) holds, which implies
Hence, we pick a \(\varvec{u}\) such that \(\Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}\). Using \(\varvec{p}_1(\lambda _T,\alpha ,0) = 0\) and \(\varvec{p}_2(\gamma _T,\xi ,0) = 0\), by a Taylor expansion to \({{\mathbb {G}}}_T l(\varvec{\theta }_0 + \nu _T \varvec{u})\), we obtain
where \({\bar{\varvec{\theta }}}\) is defined as \(\Vert {\bar{\varvec{\theta }}} - \varvec{\theta }_0\Vert \le \Vert \varvec{\theta }_T - \varvec{\theta }_0\Vert \). We want to prove
where \({{\mathcal {R}}}_T(\varvec{\theta }_0) = \overset{d}{\underset{k,l=1}{\sum }} \varvec{u}_k \varvec{u}_l \{\partial ^2_{\theta _k \theta _l} {{\mathbb {G}}}_T l(\varvec{\theta }_0) - {{\mathbb {E}}}[\partial ^2_{\theta _k \theta _l} {{\mathbb {G}}}_T l(\varvec{\theta }_0)]\}\). By Assumption 1, \((\varvec{\epsilon }_t)\) is a non-anticipative stationary solution and is ergodic. As a square integrable martingale difference by Assumption 4,
by the central limit theorem of Billingsley (1961), which implies \({\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}= O_p(T^{-1/2} \varvec{u}' {{\mathbb {M}}}\varvec{u})\). By the ergodic theorem of Billingsley (1995), we have
This implies \({{\mathcal {R}}}_T(\varvec{\theta }_0) = o_p(1)\). Furthermore, by the Markov inequality, for \(b > 0\)
where \(\eta (C_{\varvec{\epsilon }})\) is defined in Assumption 6. We now focus on the penalty terms. As \(\varvec{p}_1(\lambda _T,\alpha ,0)=0\), for the \(l^1\) norm penalty, we have
As for the \(l^1/l^2\) norm, we obtain
Then denoting by \(\delta _T = \lambda _{\min }({{\mathbb {H}}}) C^2_{\varvec{\epsilon }} \nu _T/2\), and using \(\frac{\nu _T}{2} {{\mathbb {E}}}[\varvec{u}' \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}] \ge \delta _T\), we deduce that (9) can be bounded as
We also have for \(C_{\varvec{\epsilon }}\) and T large enough, and using norm equivalences that
Moreover, if \(\nu _T = T^{-1/2} + \lambda _T T^{-1} a + \gamma _T T^{-1} b\), then for \(C_{\varvec{\epsilon }}\) large enough
Moreover
where \(C_{st} > 0\) is a generic constant. We obtain
for \(C_{\varvec{\epsilon }}\) and T large enough. We then deduce \(\Vert {\hat{\varvec{\theta }}} - \varvec{\theta }_0\Vert = O_p(\nu _T)\). \(\square \)
Proof of Theorem 3
Let \(\varvec{u}\in {{\mathbb {R}}}^d\) such that \(\varvec{\theta }= \varvec{\theta }_0 + \varvec{u}/T^{1/2}\) and we define the empirical criterion \({{\mathbb {F}}}_T(\varvec{u}) = T {{\mathbb {G}}}_T (\varphi (\varvec{\theta }_0 + \varvec{u}/T^{1/2}) - \varphi (\varvec{\theta }_0))\). First, we are going to prove the finite distributional convergence of \({{\mathbb {F}}}_T\) to \({{\mathbb {F}}}_{\infty }\). Then we use the convexity of \({{\mathbb {F}}}_T(.)\) to obtain the convergence in distribution of the \(\arg \, \min \) empirical criterion to the \(\arg \, \min \) process limit. To do so, let \(\varvec{u}= \sqrt{T}(\varvec{\theta }- \varvec{\theta }_0)\). We have
where \({{\mathbb {F}}}_T(.)\) is convex and \(C^0({{\mathbb {R}}}^d)\). We now prove the finite dimensional distribution of \({{\mathbb {F}}}_T\) to \({{\mathbb {F}}}_{\infty }\) to apply Lemma 1. For the \(l^1\) penalty, for any group k, we have for T sufficiently large
which implies that
under the condition that \(\lambda _T / \sqrt{T} \rightarrow \lambda _0\). As for the \(l^1/l^2\) quantity, for any group l, we have
Consequently, if \(\gamma _T T^{-1/2} \rightarrow \gamma _0 \ge 0\), we obtain
Now for the unpenalized criterion \({{\mathbb {G}}}_T l(.)\), by a Taylor expansion, we have
where \({\bar{\varvec{\theta }}}\) is defined as \(\Vert {\bar{\varvec{\theta }}} - \varvec{\theta }_0\Vert \le \Vert \varvec{u}\Vert /\sqrt{T}\). Then by Assumption 4, we have the central limit theorem of Billingsley (1961)
\(\sqrt{T} {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \overset{d}{\longrightarrow } {{\mathcal {N}}}(0,{{\mathbb {M}}})\), and by the ergodic theorem \(\ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} {{\mathbb {H}}}\). Furthermore, we have by Assumption 6
for C large enough, such that \(\upsilon _t(C) = \underset{k,l,m=1,\ldots ,d}{\sup } \{ \underset{\varvec{\theta }:\Vert \varvec{\theta }-\varvec{\theta }_0\Vert _2 \le \nu _T C}{\sup } |\partial ^3_{\theta _k \theta _l \theta _m} l(\varvec{\epsilon }_t;\varvec{\theta })|\}\) with \(\nu _T = T^{-1/2} + \lambda _T T^{-1} a_T + \gamma _T T^{-1} b_T\). We deduce \(\nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}= O_p(\Vert \varvec{u}\Vert ^3_2 \eta (C))\). We obtain
Then we proved that \({{\mathbb {F}}}_T(\varvec{u}) \overset{d}{\longrightarrow } {{\mathbb {F}}}_{\infty }(\varvec{u})\), for a fixed \(\varvec{u}\). Let us observe that
and \({{\mathbb {F}}}_T(.)\) admits as a minimizer \(\varvec{u}^*_T = \sqrt{T}({\hat{\varvec{\theta }}} - \varvec{\theta }_0)\). As \({{\mathbb {F}}}_T\) is convex and \({{\mathbb {F}}}_{\infty }\) is continuous, convex and has a unique minimum by Assumption 5, then by convexity Lemma 1, we obtain
\(\square \)
Proof of Proposition 1
In Theorem 3, we proved \(\sqrt{T}({\hat{\varvec{\theta }}} - \varvec{\theta }_0) := \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min } \{{{\mathbb {F}}}_T\} \overset{d}{\longrightarrow } \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min } \{{{\mathbb {F}}}_{\infty }\}\) for \(\lambda _T/\sqrt{T} \rightarrow \lambda _0\) and \(\gamma _T / \sqrt{T} \rightarrow \gamma _0\). The limit random function is
First, let us observe that
Both sets describing \(\{{\hat{{{\mathcal {A}}}}} = {{\mathcal {A}}}\}\) are symmetric, and thus we can focus on
Hence
Denoting by \( \varvec{u}^*:= \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min } \{{{\mathbb {F}}}_{\infty }(\varvec{u})\}\), Theorem 3 corresponds to \(\sqrt{T}({\hat{\varvec{\theta }}}_{{{\mathcal {A}}}} - \varvec{\theta }_{0,{{\mathcal {A}}}}) \overset{d}{\longrightarrow } \varvec{u}^*_{{{\mathcal {A}}}}\). By the Portmanteau theorem (see Wellner and van der Vaart 1996), we have
as \(\varvec{\theta }_{0,{{\mathcal {A}}}^c} = {\mathbf {0}}\). Consequently, we need to prove that the probability of the right-hand side is strictly inferior to 1, which is upper-bounded by
If \(\lambda _0 = \gamma _0 = 0\), then \(\varvec{u}^* = -{{\mathbb {H}}}^{-1} \varvec{Z}\) so that \({{\mathbb {P}}}_{\varvec{u}^*} = {{\mathcal {N}}}(0,{{\mathbb {H}}}^{-1} {{\mathbb {M}}}{{\mathbb {H}}}^{-1})\). Hence \(c = 0\).
If \(\lambda _0 \ne 0\) or \(\gamma _0 \ne 0\), the necessary and sufficient optimality conditions for a group k tell us that \(\varvec{u}^*\) satisfies
where \(\varvec{w}^{(k)}\) and \(\varvec{z}^{(k)}\) are the subgradients of \(\Vert \varvec{u}^{(k)}\Vert _1\) and \(\Vert \varvec{u}^{(k)}\Vert _2\) given by
and \(\varvec{p}^{(k)}_i = \partial _{\varvec{u}_i} \{ |\varvec{u}^{(k)}_i| {\mathbf {1}}_{\theta ^{(k)}_{0,i} = 0} + \varvec{u}^{(k)}_i \text {sgn}(\theta ^{(k)}_{0,i}){\mathbf {1}}_{\theta ^{(k)}_{0,i} \ne 0} \}\).
If \(\varvec{u}^{(m) *} = 0, \forall m \notin {{\mathcal {S}}}\), then the optimality conditions (11) become
with \(\tau _{{{\mathcal {S}}}} = \text {vec}(k \in {{\mathcal {S}}}, \alpha _k \varvec{p}^{(k)})\) and \(\zeta _{{{\mathcal {S}}}} = \text {vec}(k \in {{\mathcal {S}}}, \xi _k \frac{\varvec{\theta }^{(k)}_0}{\Vert \varvec{\theta }^{(k)}_0\Vert _2})\), which are vectors of \({{\mathbb {R}}}^{\text {card}({{\mathcal {S}}})}\).
For \(k \in {{\mathcal {S}}}\), that is, the vector \(\varvec{\theta }^{(k)}_0\) is at least nonzero, then
Consequently, if \(\varvec{u}^{(k) *}_i = 0, \forall i \in {{\mathcal {A}}}^c_k\), with \(k \in {{\mathcal {S}}}\), then the conditions (13) become
Combining relationships in (12), we obtain
The same reasoning applies for active groups with inactive components, so that combining relationships in (13), we obtain
Hence we deduce
Under the assumption that \(\lambda _0 < \infty \) and \(\gamma _0 < \infty \), we obtain
Thus \(c < 1\), which proves (10), that is proposition 1. \(\square \)
Proof of Theorem 4
The proof relies on the same steps as in the proof of Theorem 2.
\(\square \)
Proof of Theorem 5
We start with the asymptotic distribution and proceed as in the proof of Theorem 3, where we used Lemma 1. To do so, we prove the finite dimensional convergence in distribution of the empirical criterion \({{\mathbb {F}}}_T(\varvec{u})\) to \({{\mathbb {F}}}_{\infty }(\varvec{u})\) with \(\varvec{u}\in {{\mathbb {R}}}^d\), where these quantities are, respectively, defined as
and
with \(\varvec{Z}_{{{\mathcal {A}}}} \sim {{\mathcal {N}}}(0,{{\mathbb {M}}}_{{{\mathcal {A}}}{{\mathcal {A}}}})\). By Lemma 1, the finite dimensional convergence in distribution implies \(\underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min }\{{{\mathbb {F}}}_T(\varvec{u})\} \overset{d}{\longrightarrow } \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min }\{{{\mathbb {F}}}_{\infty }(\varvec{u})\}\). We first consider the unpenalized empirical criterion of \({{\mathbb {F}}}_T(.)\), which can be expanded as
where \({\bar{\varvec{\theta }}}\) lies between \(\varvec{\theta }_0\) and \(\varvec{\theta }_0 + \varvec{u}/\sqrt{T}\). First, using the same reasoning on the third-order term, we obtain \(\frac{1}{6 T^{1/3}} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}})\}\varvec{u}\overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0\). By the ergodic theorem, we deduce \(\ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} {{\mathbb {H}}}\) and by Assumption 4, \(\sqrt{T}{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \overset{d}{\longrightarrow } {{\mathcal {N}}}(0,{{\mathbb {M}}})\).
We now focus on the penalty terms of (4), we remind that \(\alpha ^{(k)}_{T,i} = |{\tilde{\theta }}^{(k)}_i|^{-\eta }\), so that for \(i \in {{\mathcal {A}}}_k, k \in {{\mathcal {S}}}\), \({\tilde{\theta }}^{(k)}_i \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \theta ^{(k)}_{0,i} \ne 0\). Note that
This implies that, for \(i \in {{\mathcal {A}}}_k\), \(k \in {{\mathcal {S}}}\), we have
under the condition \(\lambda _T T^{-1/2} \rightarrow 0\). For \(i \in {{\mathcal {A}}}^c_k\), \(\theta ^{(k)}_{0,i} = 0\), then \(T^{\eta /2} (|{\tilde{\theta }}^{(k)}_i|)^{\eta } = O_p(1)\). Hence under the assumption \(\lambda _T T^{(\eta -1)/2} \rightarrow \infty \), we obtain
As for the \(l^1/l^2\) quantity, we remind that \(\xi _{T,l} = \Vert {\tilde{\varvec{\theta }}}^{(l)}\Vert ^{-\mu }_2\), so that for \(l \in {{\mathcal {S}}}\), \({\tilde{\varvec{\theta }}}^{(l)} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{\theta }^{(l)}_0\), and in this case
Consequently, using \(\gamma _T T^{-1/2} \rightarrow 0\), and for \(l \in {{\mathcal {S}}}\), we obtain
Combining the fact \(k \in {{\mathcal {S}}}\) and \(\varvec{\theta }^{(k)}_0\) is partially zero, that is \(i \in {{\mathcal {A}}}^c_k\), we obtain the divergence given in (15). Furthermore, if \(l \notin {{\mathcal {S}}}\), that is \(\varvec{\theta }^{(l)}_0 = 0\), then
and \(T^{\mu /2} (\Vert {\tilde{\varvec{\theta }}}^{(l)}\Vert _2)^{\mu } = O_p(1)\). Then by \(\gamma _T T^{(\mu -1)/2} \rightarrow \infty \) we have
We deduce the pointwise convergence \({{\mathbb {F}}}_T (\varvec{u}) \overset{d}{\longrightarrow } {{\mathbb {F}}}_{\infty }(\varvec{u})\), where \({{\mathbb {F}}}_{\infty }(.)\) is given in (14). As \({{\mathbb {F}}}_T(.)\) is convex and \({{\mathbb {F}}}_{\infty }(.)\) is convex and has a unique minimum \(({{\mathbb {H}}}^{-1}_{{{\mathcal {A}}}{{\mathcal {A}}}} \varvec{Z}_{{{\mathcal {A}}}},{\mathbf {0}}_{{{\mathcal {A}}}^c})\) since \({{\mathbb {H}}}\) is positive definite, by Lemma 1, we obtain
that is to say \(\sqrt{T}({\hat{\varvec{\theta }}}_{{{\mathcal {A}}}} - \varvec{\theta }_{0,{{\mathcal {A}}}}) \overset{d}{\longrightarrow } {{\mathbb {H}}}^{-1}_{{{\mathcal {A}}}{{\mathcal {A}}}} \varvec{Z}_{{{\mathcal {A}}}}, \; \text {and} \; \sqrt{T}({\hat{\varvec{\theta }}}_{{{\mathcal {A}}}^c} - \varvec{\theta }_{0,{{\mathcal {A}}}^c}) \overset{d}{\longrightarrow } \mathbf {0}_{{{\mathcal {A}}}^c}\).
We now prove the model selection consistency. Let \(i \in {{\mathcal {A}}}_k\), then by the asymptotic normality result, \({\hat{\theta }}^{(k)}_i \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{\theta }^{(k)}_0\), which implies \({{\mathbb {P}}}(i \in {\hat{{{\mathcal {A}}}}}_k) \rightarrow 1\). Thus the proof consists of proving
This problem can be split into two parts as
Let us start with the case \(k \notin {{\mathcal {S}}}\). If \(k \in {\hat{{{\mathcal {S}}}}}\), by the optimality conditions given by the Karush–Kuhn–Tucker theorem applied on \({{\mathbb {G}}}_T \psi ({\hat{\varvec{\theta }}})\), we have
\(\odot \) is the element-by-element vector product, and
Multiplying the unpenalized part by \(T^{1/2}\), we have the expansion
which is asymptotically normal by consistency, Assumption 6 regarding the bound on the third-order term, the Slutsky theorem and the central limit theorem of Billingsley (1961). Furthermore, we have
Then using \(T^{(\mu -\eta )/2} \gamma _T \lambda ^{-1}_T \rightarrow \infty \), we have
We now pick \(k \in {{\mathcal {S}}}\) and consider the event \(\{i \in {\hat{{{\mathcal {A}}}}}_k\}\). Then the Karush–Kuhn–Tucker conditions for \({{\mathbb {G}}}_T \psi ({\hat{\varvec{\theta }}})\) are given by
Using the same reasoning as previously, \(T^{1/2}({\dot{{{\mathbb {G}}}}}_T l({\hat{\varvec{\theta }}}))_{(k),i}\) is also asymptotically normal, and \({\tilde{\varvec{\theta }}}^{(k)} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{\theta }^{(k)}_0\) for \(k \in {{\mathcal {S}}}\), and besides
so that we obtain the same when adding \(\gamma _T T^{-1/2}\xi _{T,k} \frac{{\hat{\theta }}^{(k)}_i}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2}\). Therefore, we have for any \(k \in {{\mathcal {S}}}\) and \(i \notin {{\mathcal {A}}}_k\)
We have proved (16). \(\square \)
About this article
Cite this article
Poignard, B. Asymptotic theory of the adaptive Sparse Group Lasso. Ann Inst Stat Math 72, 297–328 (2020). https://doi.org/10.1007/s10463-018-0692-7
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-018-0692-7