Asymptotic theory of the adaptive Sparse Group Lasso

Poignard, Benjamin

doi:10.1007/s10463-018-0692-7

Asymptotic theory of the adaptive Sparse Group Lasso

Published: 11 October 2018

Volume 72, pages 297–328, (2020)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Benjamin Poignard¹

1493 Accesses
10 Citations
Explore all metrics

Abstract

We study the asymptotic properties of a new version of the Sparse Group Lasso estimator (SGL), called adaptive SGL. This new version includes two distinct regularization parameters, one for the Lasso penalty and one for the Group Lasso penalty, and we consider the adaptive version of this regularization, where both penalties are weighted by preliminary random coefficients. The asymptotic properties are established in a general framework, where the data are dependent and the loss function is convex. We prove that this estimator satisfies the oracle property: the sparsity-based estimator recovers the true underlying sparse model and is asymptotically normally distributed. We also study its asymptotic properties in a double-asymptotic framework, where the number of parameters diverges with the sample size. We show by simulations and on real data that the adaptive SGL outperforms other oracle-like methods in terms of estimation precision and variable selection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The linearized alternating direction method of multipliers for sparse group LAD model

Article 30 August 2017

Convergence and sparsity of Lasso and group Lasso in high-dimensional generalized linear models

Article 03 July 2014

On the oracle property of adaptive group Lasso in high-dimensional linear models

Article 01 May 2015

References

Anderson, P. K., Gill, R. D. (1982). Cox’s regression model for counting processes: A large sample study. The Annals of Statistics, 10(4), 1100–1120.
Article MathSciNet Google Scholar
Bertsekas, D. (1995). Nonlinear programming. Belmont, MA: Athena Scientific.
MATH Google Scholar
Billingsley, P. (1961). The Lindeberg–Levy theorem for martingales. Proceedings of the American Mathematical Society, 12, 788792.
MATH Google Scholar
Billingsley, P. (1995). Probability and measure. New York: Wiley.
MATH Google Scholar
Bühlmann, P., van de Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications. Springer series in statistics Berlin: Springer.
Chapter Google Scholar
Chernozhukov, V. (2005). Extremal quantile regression. The Annals of Statistics, 33(2), 806–839.
Article MathSciNet Google Scholar
Chernozhukov, V., Hong, H. (2004). Likelihood estimation and inference in a class of nonregular econometric models. Econometrica, 72(5), 1445–1480.
Article MathSciNet Google Scholar
Davis, R. A., Knight, K., Liu, J. (1992). M-estimation for autoregressions with infinite variance. Stochastic Processes and Their Applications, 40, 145–180.
Article MathSciNet Google Scholar
Fan, J. (1997). Comments on wavelets in statistics: A review by A. Antoniadis. Journal of the Italian Statistical Association, 6, 131138.
Google Scholar
Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Article MathSciNet Google Scholar
Fan, J., Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32(3), 928–961.
Article MathSciNet Google Scholar
Francq, C., Thieu, L. Q. (2015). QML inference for volatility models with covariates. MPRA paper no. 63198.
Francq, C., Zakoïan, J. M. (2010). GARCH models. Chichester: Wiley.
Book Google Scholar
Fu, W. J. (1998). Penalized regression: the Bridge versus the Lasso. Journal of Computational and Graphical Statistics, 7, 397–416.
MathSciNet Google Scholar
Geyer, C. J. (1996). On the asymptotics of convex stochastic optimization. Unpublished manuscript.
Hjort, N. L., Pollard, D. (1993). Asymptotics for minimisers of convex processes. Unpublished manuscript.
Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and Monte Carlo. The Annals of Statistics, 1(5), 799821.
Article MathSciNet Google Scholar
Hunter, D. R., Li, R. (2005). Variable selection using MM algorithms. The Annals of Statistics, 33(4), 1617–1642.
Article MathSciNet Google Scholar
Kato, K. (2009). Asymptotics for argmin processes: Convexity arguments. Journal of Multivariate Analysis, 100, 1816–1829.
Article MathSciNet Google Scholar
Knight, K., Fu, W. (2000). Asymptotics for Lasso-type estimators. The Annals of Statistics, 28(5), 1356–1378.
Article MathSciNet Google Scholar
Li, X., Mo, L., Yuan, X., Zhang, J. (2014). Linearized alternating direction method of multipliers for Sparse Group and Fused Lasso models. Computational Statistics and Data Analysis, 79, 203–221.
Article MathSciNet Google Scholar
Nardi, Y., Rinaldo, A. (2008). On the asymptotic properties of the Group Lasso estimator for linear models. Electronic Journal of Statistics, 2, 605–633.
Article MathSciNet Google Scholar
Neumann, M. H. (2013). A central limit theorem for triangular arrays of weakly dependent random variables, with applications in statistics. Probability and Statistics, 17, 120–134.
Article MathSciNet Google Scholar
Newey, W. K., Powell, J. L. (1987). Asymmetric least squares estimation and testing. Econometrica, 55(4), 819–847.
Article MathSciNet Google Scholar
Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory, 7(2), 186–199.
Article MathSciNet Google Scholar
Racine, J. (2000). Consistent cross-validatory model-selection for dependent data: hv-block cross-validation. Journal of Econometrics, 99, 39–61.
Article Google Scholar
Rio, E. (2013). Inequalities and limit theorems for weakly dependent sequences. 3 ème Cycle, cel–00867106, 170.
Google Scholar
Rockafeller, R. T. (1970). Convex analysis. Princeton: Princeton University Press.
Book Google Scholar
Shiryaev, A. N. (1991). Probability. Berlin: Springer.
MATH Google Scholar
Simon, N., Friedman, J., Hastie, T., Tibshirani, R. (2013). A Sparse Group Lasso. Journal of Computational and Graphical Statistics, 22(2), 231–245.
Article MathSciNet Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58(1), 267–288.
MathSciNet MATH Google Scholar
Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $l^1$-constrained quadratic programming. IEEE Transactions on Information Theory, 55(5), 2183–2202.
Article MathSciNet Google Scholar
Wellner, J. A., van der Vaart, A. W. (1996). Weak convergence and empirical processes. With applications to statistics. New York, NY: Springer.
MATH Google Scholar
Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B, 68(1), 49–67.
Article MathSciNet Google Scholar
Zou, H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.
Article MathSciNet Google Scholar
Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320.
Article MathSciNet Google Scholar
Zou, H., Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37(4), 1733–1751.
Article MathSciNet Google Scholar

Download references

Acknowledgements

I would like to thank Alexandre Tsybakov, Arnak Dalalyan, Jean-Michel Zakoïan and Christian Francq for all the theoretical references they provided. And I thank warmly Jean-David Fermanian for his significant help and helpful comments. I gratefully acknowledge the Ecodec Laboratory for its support and the Japan Society for the Promotion of Science.

Author information

Authors and Affiliations

Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka, 560-8531, Japan
Benjamin Poignard

Authors

Benjamin Poignard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benjamin Poignard.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 279 KB)

Appendix

We first introduce some preleminary results. The dependent setting requires the use of more sophisticated probabilistic tools to derive asymptotic results than the i.i.d. case. Assumptions 1 and 4 allow for using the central limit theorem of Billingsley (1961). We remind this result stated as a corollary in Billingsley (1961).

Corollary 1

(Billingsley 1961) If $(x_t,{{\mathcal {F}}}_t)$ is a stationary and ergodic sequence of square integrable martingale increments such that $\sigma ^2_x = \text {Var}(x_t) \ne 0$, then $T^{-1/2} \sum ^{T}_{t=1} x_t \overset{d}{\rightarrow } {{\mathcal {N}}}(0,\sigma ^2_x)$.

Note that the square martingale difference condition can be relaxed by $\alpha $-mixing and moment conditions. For instance, Rio (2013) provides a central limit theorem for strongly mixing and stationary sequences.

To prove Theorem 1, we remind of Theorem II.1 of Anderson and Gill (1982) which proves that pointwise convergence in probability of random concave functions implies uniform convergence on compact subspaces.

Theorem 9

(Anderson and Gill 1982) Let E be an open convex subset of ${{\mathbb {R}}}^p$, and let $F_1, F_2,\ldots ,$ be a sequence of random concave functions on E such that $F_n(x) \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} f(x)$ for every $x \in E$ where f is some real function on E. Then f is also concave, and for all compact $A \subset E$,

$$\begin{aligned} \underset{x \in A}{\sup } |F_n(x) - f(x)| \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} 0. \end{aligned}$$

The proof of this theorem is based on a diagonal argument and Theorem 10.8 of Rockafeller (1970), that is, the pointwise convergence of concave random functions on a dense and countable subset of an open set implies uniform convergence on any compact subset of the open set. Then the following corollary is stated.

Corollary 2

(Anderson and Gill 1982) Assume $F_n(x) \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} f(x)$, for every $x \in E$, an open convex subset of ${{\mathbb {R}}}^p$. Suppose f has a unique maximum at $x_0 \in E$. Let ${\hat{X}}_n$ maximize $F_n$. Then ${\hat{X}}_n \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} x_0$.

Newey and Powell (1987) use a similar theorem to prove the consistency of asymmetric least squares estimators without any compacity assumption on $\varTheta $. We apply these results in our framework, where the parameter set $\varTheta $ is supposed to be convex.

We used the convexity argument to derive the asymptotic distribution of the SGL estimator. Chernozhukov and Hong (2004) and Chernozhukov (2005) use this convexity argument to obtain the asymptotic distribution of quantile regression-type estimators. This argument relies on the convexity lemma, which is a key result to obtain an asymptotic distribution when the objective function is not differentiable. It only requires the lower-semicontinuity and convexity of the empirical criterion. The convexity lemma, as in Chernozhukov (2005), proof of Theorem 4.1, can be stated as follows:

Lemma 1

(Chernozhukov 2005) Suppose

(i)
a sequence of convex lower-semicontinuous ${{\mathbb {F}}}_T: {{\mathbb {R}}}^d \rightarrow {\bar{{{\mathbb {R}}}}}$ marginally converges to ${{\mathbb {F}}}_{\infty }: {{\mathbb {R}}}^d \rightarrow {\bar{{{\mathbb {R}}}}}$ over a dense subset of ${{\mathbb {R}}}^d$;
(ii)
${{\mathbb {F}}}_{\infty }$ is finite over a non-empty open set $E \subset {{\mathbb {R}}}^d$;
(iii)
${{\mathbb {F}}}_{\infty }$ is uniquely minimized at a random vector $\varvec{u}_{\infty }$.

Then

$$\begin{aligned} \underset{\varvec{z}\in {{\mathbb {R}}}^d}{\arg \, \min } \, {{\mathbb {F}}}_T(\varvec{z}) \overset{d}{\longrightarrow } \underset{\varvec{z}\in {{\mathbb {R}}}^d}{\arg \, \min } \, {{\mathbb {F}}}_{\infty }(\varvec{z}), \, \text {that is} \; \varvec{u}_T \overset{d}{\longrightarrow } \varvec{u}_{\infty }. \end{aligned}$$

This is a key argument used in Theorem 3, Proposition 1 and Theorem 5.

When we consider a diverging number of parameters, the empirical criterion can be viewed as a sequence of dependent arrays for which we need refined asymptotic results. Shiryaev (1991) proposed a version of the central limit theorem for dependent sequence of arrays, provided this sequence is a square integrable martingale difference satisfying the so-called Lindeberg condition. A similar theorem can be found in Billingsley (1995, Theorem 35.12, p.476). We provide here the theorem of Shiryaev (see Theorem 4, p.543 of Shiryaev 1991) that we will use to derive the asymptotic distribution of the adaptive SGL estimator.

Theorem 10

(Shiryaev 1991) Let a sequence of square integrable martingale differences $\xi ^n = (\xi _{nk},{{\mathcal {F}}}^n_k),n \ge 1$, with ${{\mathcal {F}}}^n_k = \sigma (\xi _{ns},s \le k)$, satisfy the Lindeberg condition for any $0<t\le 1$, for $\epsilon > 0$, given by

$$\begin{aligned} \overset{\lfloor nt \rfloor }{\underset{k=0}{\sum }} {{\mathbb {E}}}\left[ \xi ^2_{nk} {\mathbf {1}}_{|\xi _{nk}| > \epsilon } | {{\mathcal {F}}}^n_{k-1}\right] \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} 0, \end{aligned}$$

then if $\overset{\lfloor nt \rfloor }{\underset{k=0}{\sum }} {{\mathbb {E}}}[\xi ^2_{nk}| {{\mathcal {F}}}^n_{k-1} ] \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} \sigma ^2_t$, or $\overset{\lfloor nt \rfloor }{\underset{k=0}{\sum }} \xi ^2_{nk} \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} \sigma ^2_t$, then $\overset{\lfloor nt \rfloor }{\underset{k=0}{\sum }} \xi _{nk} \overset{d}{\longrightarrow } {{\mathcal {N}}}(0,\sigma ^2_t).$

There exist central limit results relaxing the stationarity and martingale difference assumptions for sequences of arrays. Neumann (2013) proposed such a central limit theorem for weakly dependent sequences of arrays. Such sequences should also satisfy a Lindeberg condition and conditions on covariances. Equipped with these preliminary results, we now report the proofs of Sect. 4.

Proof of Theorem 1

By definition, ${\hat{\varvec{\theta }}} = \underset{\varvec{\theta }\in \varTheta }{\arg \, \min } \, \{{{\mathbb {G}}}_T \varphi (\varvec{\theta })\}$. In a first step, we prove the uniform convergence of ${{\mathbb {G}}}_T \varphi (.)$ to the limit quantity ${{\mathbb {G}}}_{\infty }\varphi (.)$ on any compact set $\varvec{B}\subset \varTheta $, idest

$$\begin{aligned} \underset{\varvec{x}\in \varvec{B}}{\sup } |{{\mathbb {G}}}_T \varphi (\varvec{x}) - {{\mathbb {G}}}_{\infty }\varphi (\varvec{x}) | \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0. \end{aligned}$$

(7)

We define ${{\mathcal {C}}}\subset \varTheta $ an open convex set and pick $\varvec{x}\in {{\mathcal {C}}}$. Then by Assumption 1, the law of large number implies

$$\begin{aligned} {{\mathbb {G}}}_T l(\varvec{x}) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} {{\mathbb {G}}}_{\infty } l(\varvec{x}). \end{aligned}$$

Consequently, if $\lambda _T / T \rightarrow \lambda _0 \ge 0$ and $\gamma _T / T \rightarrow \gamma _0 \ge 0$, we obtain the pointwise convergence

$$\begin{aligned} |{{\mathbb {G}}}_T \varphi (\varvec{x}) - {{\mathbb {G}}}_{\infty }\varphi (\varvec{x})| \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0. \end{aligned}$$

By Theorem 9 of Anderson and Gill (1982), ${{\mathbb {G}}}_{\infty } \varphi (.)$ is a convex function and we deduce the desired uniform convergence over any compact subset of $\varTheta $, that is (7).

Now we would like that $\arg \, \min \, \{{{\mathbb {G}}}_T \varphi (.)\} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \arg \, \min \, \{{{\mathbb {G}}}_{\infty } \varphi (.)\}$. By Assumption 3, $\varphi (.)$ is convex, which implies

$$\begin{aligned} |{{\mathbb {G}}}_T \varphi (\varvec{\theta })| \overset{{{\mathbb {P}}}}{\underset{\Vert \varvec{\theta }\Vert \rightarrow \infty }{\longrightarrow }} \infty . \end{aligned}$$

Consequently, $\arg \, \min \{{{\mathbb {G}}}_T \varphi (\varvec{x})\} = O(1)$, such that ${\hat{\varvec{\theta }}} \in {{\mathcal {B}}}_o(\varvec{\theta }_0,C)$ with probability approaching one for C large enough, with ${{\mathcal {B}}}_o(\varvec{\theta }_0,C)$ an open ball centered at $\varvec{\theta }_0$ and of radius C. Furthermore, as ${{\mathbb {G}}}_{\infty } \varphi (.)$ is convex, continuous, then $\underset{\varvec{x}\in B}{\arg \, \min } \, \{{{\mathbb {G}}}_{\infty } \varphi (\varvec{x})\}$ exists and is unique. Then by Corollary 2 of Andersen and Gill, we obtain

$$\begin{aligned} \underset{\varvec{x}\in \varvec{B}}{\arg \, \min } \{{{\mathbb {G}}}_T \varphi (\varvec{x})\} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \underset{\varvec{x}\in \varvec{B}}{\arg \, \min } \{{{\mathbb {G}}}_{\infty } \varphi (\varvec{x})\}, \; \; \text {that is} \; \; {\hat{\varvec{\theta }}} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{\theta }^*_0. \end{aligned}$$

$\square $

Proof of Theorem 2

We denote $\nu _T = T^{-1/2} + \lambda _T T^{-1} a + \gamma _T T^{-1} b$, with $a = \text {card}({{\mathcal {A}}})(\underset{k}{\max } \; \alpha _k)$ and $b = \text {card}({{\mathcal {A}}})(\underset{l}{\max } \; \xi _l)$. We would like to prove that for any $\varvec{\epsilon }> 0$, there exists $C_{\varvec{\epsilon }} > 0$ such that ${{\mathbb {P}}}(\nu ^{-1}_T\Vert {\hat{\varvec{\theta }}} - \varvec{\theta }_0\Vert > C_{\varvec{\epsilon }}) < \varvec{\epsilon }$. We have

$$\begin{aligned} {{\mathbb {P}}}(\nu ^{-1}_T \Vert {\hat{\varvec{\theta }}} - \varvec{\theta }_0\Vert > C_{\varvec{\epsilon }}) \le {{\mathbb {P}}}\left( \exists \varvec{u}\in {{\mathbb {R}}}^d, \Vert \varvec{u}\Vert _2 \ge C_{\varvec{\epsilon }}: {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0 + \nu _T \varvec{u}) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)\right) . \end{aligned}$$

$\Vert \varvec{u}\Vert _2$ can potentially be large as it represents the discrepancy ${\hat{\varvec{\theta }}}-\varvec{\theta }_0$ normalized by $\nu _T$. Now based on the convexity of the objective function, we have

$$\begin{aligned}&\left\{ \exists \varvec{u}^*, \Vert \varvec{u}^*\Vert _2 \ge C_{\varvec{\epsilon }}, {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0 + \nu _T \varvec{u}^*) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)\right\} \nonumber \\&\subset \big \{\exists {\bar{\varvec{u}}}, \Vert {\bar{\varvec{u}}}\Vert _2 = C_{\varvec{\epsilon }}, {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0 + \nu _T {\bar{\varvec{u}}}) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)\big \}, \end{aligned}$$

(8)

a relationship that allows us to work with a fixed $\Vert \varvec{u}\Vert _2$. Let us define $\varvec{\theta }_1 = \varvec{\theta }_0 + \nu _T \varvec{u}^*$ such that ${{\mathbb {G}}}_T \varphi (\varvec{\theta }_1) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)$. Let $\alpha \in (0,1)$ and $\varvec{\theta }= \alpha \varvec{\theta }_1 + (1-\alpha ) \varvec{\theta }_0$. Then by convexity of ${{\mathbb {G}}}_T \varphi (.)$, we obtain

$$\begin{aligned} {{\mathbb {G}}}_T \varphi (\varvec{\theta }) \le \alpha {{\mathbb {G}}}_T \varphi (\varvec{\theta }_1) + (1-\alpha ) {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0). \end{aligned}$$

We pick $\alpha $ such that $\Vert {\bar{\varvec{u}}}\Vert = C_{\varvec{\epsilon }}$ with ${\bar{\varvec{u}}} := \alpha \varvec{\theta }_1 + (1-\alpha ) \varvec{\theta }_0$. Hence (8) holds, which implies

$$\begin{aligned} {{\mathbb {P}}}(\Vert {\hat{\varvec{\theta }}} - \varvec{\theta }_0 \Vert > C_{\varvec{\epsilon }} \nu _T)\le & {} {{\mathbb {P}}}(\exists \varvec{u}\in {{\mathbb {R}}}^d, \Vert \varvec{u}\Vert _2 \ge C_{\varvec{\epsilon }}: {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0 + \nu _T \varvec{u}) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)) \\\le & {} {{\mathbb {P}}}(\exists {\bar{\varvec{u}}}, \Vert {\bar{\varvec{u}}}\Vert _2 = C_{\varvec{\epsilon }}: {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0 + \nu _T {\bar{\varvec{u}}}) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)). \end{aligned}$$

Hence, we pick a $\varvec{u}$ such that $\Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}$. Using $\varvec{p}_1(\lambda _T,\alpha ,0) = 0$ and $\varvec{p}_2(\gamma _T,\xi ,0) = 0$, by a Taylor expansion to ${{\mathbb {G}}}_T l(\varvec{\theta }_0 + \nu _T \varvec{u})$, we obtain

$$\begin{aligned} {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0 + \nu _T \varvec{u}) - {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)= & {} \nu _T {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}+ \frac{\nu ^2_T}{2} \varvec{u}' \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}\\&+\, \frac{\nu ^3_T}{6} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}+ \varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T)\\&-\,\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0) + \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T)-\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0), \end{aligned}$$

where ${\bar{\varvec{\theta }}}$ is defined as $\Vert {\bar{\varvec{\theta }}} - \varvec{\theta }_0\Vert \le \Vert \varvec{\theta }_T - \varvec{\theta }_0\Vert $. We want to prove

$$\begin{aligned} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2= & {} C_{\varvec{\epsilon }}: {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}+ \frac{\nu _T}{2} {{\mathbb {E}}}[\varvec{u}' \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}] + \frac{\nu _T}{2} {{\mathcal {R}}}_T(\varvec{\theta }_0) \nonumber \\&+\, \frac{\nu ^2_T}{6} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}+\nu ^{-1}_T\{\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T)-\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0)\nonumber \\&+\, \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T)-\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0)\} \le 0) < \varvec{\epsilon }, \end{aligned}$$

(9)

where ${{\mathcal {R}}}_T(\varvec{\theta }_0) = \overset{d}{\underset{k,l=1}{\sum }} \varvec{u}_k \varvec{u}_l \{\partial ^2_{\theta _k \theta _l} {{\mathbb {G}}}_T l(\varvec{\theta }_0) - {{\mathbb {E}}}[\partial ^2_{\theta _k \theta _l} {{\mathbb {G}}}_T l(\varvec{\theta }_0)]\}$. By Assumption 1, $(\varvec{\epsilon }_t)$ is a non-anticipative stationary solution and is ergodic. As a square integrable martingale difference by Assumption 4,

$$\begin{aligned} \sqrt{T} {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}\overset{d}{\longrightarrow } {{\mathcal {N}}}(0, \varvec{u}' {{\mathbb {M}}}\varvec{u}), \end{aligned}$$

by the central limit theorem of Billingsley (1961), which implies ${\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}= O_p(T^{-1/2} \varvec{u}' {{\mathbb {M}}}\varvec{u})$. By the ergodic theorem of Billingsley (1995), we have

$$\begin{aligned} \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} {{\mathbb {H}}}. \end{aligned}$$

This implies ${{\mathcal {R}}}_T(\varvec{\theta }_0) = o_p(1)$. Furthermore, by the Markov inequality, for $b > 0$

$$\begin{aligned} \begin{array}{llll} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: \underset{{\bar{\varvec{\theta }}}:\Vert \varvec{\theta }-\varvec{\theta }_0\Vert _2 \le \nu _T C_{\varvec{\epsilon }}}{\sup }|\frac{\nu ^2_T}{6} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}| > b)\le & {} \frac{\nu ^4_T C^6_{\varvec{\epsilon }}}{36 b^2} \eta (C_{\varvec{\epsilon }}), \end{array} \end{aligned}$$

where $\eta (C_{\varvec{\epsilon }})$ is defined in Assumption 6. We now focus on the penalty terms. As $\varvec{p}_1(\lambda _T,\alpha ,0)=0$, for the $l^1$ norm penalty, we have

$$\begin{aligned} \varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T) - \varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0)= & {} \lambda _T T^{-1} \underset{k \in {{\mathcal {S}}}}{\sum } \alpha _k \left\{ \Vert \varvec{\theta }^{(k)}_0 + \nu _T \varvec{u}^{(k)}\Vert _1 - \Vert \varvec{\theta }^{(k)}_0\Vert _1 \right\} , \nonumber \\ \text {and} \; |\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T) - \varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0)|\le & {} \text {card}({{\mathcal {S}}}) \{ \underset{k \in {{\mathcal {S}}}}{\max } \; \alpha _k \} \lambda _T T^{-1} \nu _T \Vert \varvec{u}\Vert _1. \end{aligned}$$

As for the $l^1/l^2$ norm, we obtain

$$\begin{aligned} \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T) - \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0)= & {} \gamma _T T^{-1} \underset{l \in {{\mathcal {S}}}}{\sum } \xi _l \left\{ \Vert \varvec{\theta }^{(l)}_T\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2\right\} , \nonumber \\ \text {and} \; |\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T) - \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0) |\le & {} \gamma _T T^{-1} \underset{l \in {{\mathcal {S}}}}{\sum } \xi _l \nu _T \Vert \varvec{u}^{(l)}\Vert _2 \\\le & {} \text {card}({{\mathcal {S}}}) \left\{ \underset{l\in {{\mathcal {S}}}}{\max } \; \xi _l \right\} \gamma _T T^{-1} \nu _T \Vert \varvec{u}\Vert _2. \nonumber \end{aligned}$$

Then denoting by $\delta _T = \lambda _{\min }({{\mathbb {H}}}) C^2_{\varvec{\epsilon }} \nu _T/2$, and using $\frac{\nu _T}{2} {{\mathbb {E}}}[\varvec{u}' \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}] \ge \delta _T$, we deduce that (9) can be bounded as

$$\begin{aligned} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2= & {} C_{\varvec{\epsilon }}: {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}+ \frac{\nu _T}{2} \varvec{u}' \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}+ \frac{\nu ^2_T}{6} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}\\&+\, \nu ^{-1}_T \{\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T)-\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0) + \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T)\\&-\,\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0) \} \le 0) \\\le & {} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}|> \delta _T/8) + {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2\\&= C_{\varvec{\epsilon }}: \frac{\nu _T}{2} |{{\mathcal {R}}}_T(\varvec{\theta }_0)|> \delta _T/8) \\&\quad +\, {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}:| \frac{\nu ^2_T}{6} \nabla '\left\{ \varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\right\} \varvec{u}|> \delta _T/8)\\&\quad +\, {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T)-\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0)|> \nu _T \delta _T/8) \\&\quad +\, {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T)-\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0)| > \nu _T \delta _T/8). \end{aligned}$$

We also have for $C_{\varvec{\epsilon }}$ and T large enough, and using norm equivalences that

$$\begin{aligned} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2= & {} C_{\varvec{\epsilon }}: |\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T)-\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0)|> \nu _T \delta _T/8) \\\le & {} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: \text {card}({{\mathcal {S}}}) \{ \underset{k \in {{\mathcal {S}}}}{\max } \; \alpha _k \} \lambda _T T^{-1} \nu _T \Vert \varvec{u}\Vert _1> \nu _T \delta _T/8)< \varvec{\epsilon }/5, \\&{{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T)-\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0)|> \nu _T \delta _T/8) \\\le & {} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: \text {card}({{\mathcal {S}}}) \{ \underset{l\in {{\mathcal {S}}}}{\max } \; \xi _l \} \gamma _T T^{-1} \nu _T \Vert \varvec{u}\Vert _2 > \nu _T \delta _T/8) < \varvec{\epsilon }/5. \end{aligned}$$

Moreover, if $\nu _T = T^{-1/2} + \lambda _T T^{-1} a + \gamma _T T^{-1} b$, then for $C_{\varvec{\epsilon }}$ large enough

$$\begin{aligned} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}| > \delta _T/8) \le \frac{C^2_{\varvec{\epsilon }} C_{st}}{T \delta ^2_T} \le \frac{C_{st}}{C^4_{\varvec{\epsilon }}} < \varvec{\epsilon }/5. \end{aligned}$$

Moreover

$$\begin{aligned} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2= & {} C_{\varvec{\epsilon }}: \underset{{\bar{\varvec{\theta }}}:\Vert {\bar{\varvec{\theta }}}-\varvec{\theta }_0\Vert _2 < \nu _T C_{\varvec{\epsilon }}}{\sup }|\frac{\nu ^2_T}{6} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}|> \delta _T/8) \\\le & {} \frac{C_{st} \nu ^4_T \eta (C_{\varvec{\epsilon }})}{\delta ^2_T} \le C_{st} \nu ^2_T C^2_{\varvec{\epsilon }} \eta (C_{\varvec{\epsilon }}) \end{aligned}$$

where $C_{st} > 0$ is a generic constant. We obtain

$$\begin{aligned} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2= & {} C_{\varvec{\epsilon }}: |{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}|> \delta _T/8) + {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: \frac{\nu _T}{2} |{{\mathcal {R}}}_T(\varvec{\theta }_0)|> \delta _T/8) \\&+\, {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}:| \frac{\nu ^2_T}{6} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}|> \delta _T/8)\\&+\, {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0)-\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T)|> \nu _T \delta _T/8) \\&+\, {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0)-\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T)| > \nu _T \delta _T/8) \\\le & {} \frac{C_{st}}{C^4_{\varvec{\epsilon }}} + \nu ^2_T C^2_{\varvec{\epsilon }} \eta (C_{\varvec{\epsilon }}) C_{st} + 3 \varvec{\epsilon }/5 \le \varvec{\epsilon }, \end{aligned}$$

for $C_{\varvec{\epsilon }}$ and T large enough. We then deduce $\Vert {\hat{\varvec{\theta }}} - \varvec{\theta }_0\Vert = O_p(\nu _T)$. $\square $

Proof of Theorem 3

Let $\varvec{u}\in {{\mathbb {R}}}^d$ such that $\varvec{\theta }= \varvec{\theta }_0 + \varvec{u}/T^{1/2}$ and we define the empirical criterion ${{\mathbb {F}}}_T(\varvec{u}) = T {{\mathbb {G}}}_T (\varphi (\varvec{\theta }_0 + \varvec{u}/T^{1/2}) - \varphi (\varvec{\theta }_0))$. First, we are going to prove the finite distributional convergence of ${{\mathbb {F}}}_T$ to ${{\mathbb {F}}}_{\infty }$. Then we use the convexity of ${{\mathbb {F}}}_T(.)$ to obtain the convergence in distribution of the $\arg \, \min $ empirical criterion to the $\arg \, \min $ process limit. To do so, let $\varvec{u}= \sqrt{T}(\varvec{\theta }- \varvec{\theta }_0)$. We have

$$\begin{aligned} {{\mathbb {F}}}_T(\varvec{u})= & {} T \left\{ {{\mathbb {G}}}_T (l(\varvec{\theta }) - l(\varvec{\theta }_0)) + \varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }) - \varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0) + \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta })\right. \\&\left. -\,\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0) \right\} \\= & {} T {{\mathbb {G}}}_T (l(\varvec{\theta }_0 + \varvec{u}/T^{1/2}) - l(\varvec{\theta }_0)) + \lambda _T \overset{m}{\underset{k = 1}{\sum }} \alpha _k \left[ \Vert \varvec{\theta }^{(k)}_0 + \varvec{u}^{(k)}/\sqrt{T}\Vert _1 - \Vert \varvec{\theta }^{(k)}_0\Vert _1\right] \\&+\, \gamma _T \overset{m}{\underset{l = 1}{\sum }} \xi _l \left[ \Vert \varvec{\theta }^{(l)}_0 + \varvec{u}^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2\right] , \end{aligned}$$

where ${{\mathbb {F}}}_T(.)$ is convex and $C^0({{\mathbb {R}}}^d)$. We now prove the finite dimensional distribution of ${{\mathbb {F}}}_T$ to ${{\mathbb {F}}}_{\infty }$ to apply Lemma 1. For the $l^1$ penalty, for any group k, we have for T sufficiently large

$$\begin{aligned} \Vert \varvec{\theta }^{(k)}_0 + \varvec{u}^{(k)}/\sqrt{T}\Vert _1 - \Vert \varvec{\theta }^{(k)}_0\Vert _1 = T^{-1/2} \overset{\varvec{c}_k}{\underset{i = 1}{\sum }} \left\{ |\varvec{u}^{(k)}_i| {\mathbf {1}}_{\theta ^{(k)}_{0,i} = 0} + \varvec{u}^{(k)}_i \text {sgn}(\theta ^{(k)}_{0,i}){\mathbf {1}}_{\theta ^{(k)}_{0,i} \ne 0} \right\} , \end{aligned}$$

which implies that

$$\begin{aligned}&\lambda _T \overset{m}{\underset{k = 1}{\sum }} \alpha _k \left[ \Vert \varvec{\theta }^{(k)}_0 + \varvec{u}^{(k)}/\sqrt{T}\Vert _1 - \Vert \varvec{\theta }^{(k)}_0\Vert _1\right] \underset{T \rightarrow \infty }{\longrightarrow } \lambda _0 \overset{m}{\underset{k = 1}{\sum }} \alpha _k \overset{\varvec{c}_k}{\underset{i = 1}{\sum }} \left\{ |\varvec{u}^{(k)}_i| {\mathbf {1}}_{\theta ^{(k)}_{0,i} = 0}\right. \\&\left. \quad +\, \varvec{u}^{(k)}_i \text {sgn}(\theta ^{(k)}_{0,i}){\mathbf {1}}_{\theta ^{(k)}_{0,i} \ne 0} \right\} , \end{aligned}$$

under the condition that $\lambda _T / \sqrt{T} \rightarrow \lambda _0$. As for the $l^1/l^2$ quantity, for any group l, we have

$$\begin{aligned} \Vert \varvec{\theta }^{(l)}_0 + u^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2 = T^{-1/2} \left\{ \Vert u^{(l)}\Vert _2 {\mathbf {1}}_{\varvec{\theta }^{(l)}_{0} = {\mathbf {0}}} + \frac{u^{(l)'} \varvec{\theta }^{(l)}_0}{\Vert \varvec{\theta }^{(l)}_0\Vert _2} {\mathbf {1}}_{\varvec{\theta }^{(l)}_{0} \ne {\mathbf {0}}}\right\} + o(T^{-1}). \end{aligned}$$

Consequently, if $\gamma _T T^{-1/2} \rightarrow \gamma _0 \ge 0$, we obtain

$$\begin{aligned} \gamma _T \overset{m}{\underset{l = 1}{\sum }} \xi _l \left[ \Vert \varvec{\theta }^{(l)}_0 + u^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2 \right]= & {} \gamma _0 \overset{m}{\underset{l = 1}{\sum }} \xi _l \left\{ \Vert u^{(l)}\Vert _2 {\mathbf {1}}_{\theta ^{(l)}_{0,k} = {\mathbf {0}}}\right. \\&\left. + \frac{u^{(l)'} \varvec{\theta }^{(l)}_0}{\Vert \varvec{\theta }^{(l)}_0\Vert _2} {\mathbf {1}}_{\varvec{\theta }^{(l)}_0 \ne {\mathbf {0}}}\right\} + o(T^{-1}) \gamma _T. \end{aligned}$$

Now for the unpenalized criterion ${{\mathbb {G}}}_T l(.)$, by a Taylor expansion, we have

$$\begin{aligned}&T {{\mathbb {G}}}_T (l(\varvec{\theta }_0 + \varvec{u}/T^{1/2}) - l(\varvec{\theta }_0)) = \varvec{u}' T^{1/2}{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) + \frac{1}{2} \varvec{u}' \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}\\&\quad +\, \frac{1}{6 T^{1/3}}\nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}, \end{aligned}$$

where ${\bar{\varvec{\theta }}}$ is defined as $\Vert {\bar{\varvec{\theta }}} - \varvec{\theta }_0\Vert \le \Vert \varvec{u}\Vert /\sqrt{T}$. Then by Assumption 4, we have the central limit theorem of Billingsley (1961)

$\sqrt{T} {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \overset{d}{\longrightarrow } {{\mathcal {N}}}(0,{{\mathbb {M}}})$, and by the ergodic theorem $\ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} {{\mathbb {H}}}$. Furthermore, we have by Assumption 6

$$\begin{aligned}&|\nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}|^2 \\&\quad \le \frac{1}{T^2} \overset{T}{\underset{t,t'=1}{\sum }} \overset{d}{\underset{k_1,l_1,m_1}{\sum }}\overset{d}{\underset{k_2,l_2,m_2}{\sum }} \varvec{u}_{k_1} \varvec{u}_{l_1} \varvec{u}_{m_1} \varvec{u}_{k_2} \varvec{u}_{l_2} \varvec{u}_{m_2} |\partial ^3_{\theta _{k_1} \theta _{l_1} \theta _{m_1}} l(\varvec{\epsilon }_t;{\bar{\varvec{\theta }}}) . \partial ^3_{\theta _{k_2} \theta _{l_2} \theta _{m_2}} l(\varvec{\epsilon }_{t'};{\bar{\varvec{\theta }}}) | \\&\quad \le \frac{1}{T^2} \overset{T}{\underset{t,t'=1}{\sum }} \overset{d}{\underset{k_1,l_1,m_1}{\sum }}\overset{d}{\underset{k_2,l_2,m_2}{\sum }} \varvec{u}_{k_1} \varvec{u}_{l_1} \varvec{u}_{m_1} \varvec{u}_{k_2} \varvec{u}_{l_2} \varvec{u}_{m_2} \upsilon _t(C) \upsilon _{t'}(C), \end{aligned}$$

for C large enough, such that $\upsilon _t(C) = \underset{k,l,m=1,\ldots ,d}{\sup } \{ \underset{\varvec{\theta }:\Vert \varvec{\theta }-\varvec{\theta }_0\Vert _2 \le \nu _T C}{\sup } |\partial ^3_{\theta _k \theta _l \theta _m} l(\varvec{\epsilon }_t;\varvec{\theta })|\}$ with $\nu _T = T^{-1/2} + \lambda _T T^{-1} a_T + \gamma _T T^{-1} b_T$. We deduce $\nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}= O_p(\Vert \varvec{u}\Vert ^3_2 \eta (C))$. We obtain

$$\begin{aligned} \frac{1}{6 T^{1/3}}\nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}\overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0. \end{aligned}$$

Then we proved that ${{\mathbb {F}}}_T(\varvec{u}) \overset{d}{\longrightarrow } {{\mathbb {F}}}_{\infty }(\varvec{u})$, for a fixed $\varvec{u}$. Let us observe that

$$\begin{aligned} \varvec{u}^*_T = \underset{\varvec{u}}{\arg \; \min } \, \{ {{\mathbb {F}}}_T(\varvec{u}) \}, \end{aligned}$$

and ${{\mathbb {F}}}_T(.)$ admits as a minimizer $\varvec{u}^*_T = \sqrt{T}({\hat{\varvec{\theta }}} - \varvec{\theta }_0)$. As ${{\mathbb {F}}}_T$ is convex and ${{\mathbb {F}}}_{\infty }$ is continuous, convex and has a unique minimum by Assumption 5, then by convexity Lemma 1, we obtain

$$\begin{aligned} \sqrt{T}({\hat{\varvec{\theta }}} - \varvec{\theta }_0) = \underset{\varvec{u}}{\arg \; \min } \{ {{\mathbb {F}}}_T \} \overset{d}{\longrightarrow } \underset{\varvec{u}}{\arg \; \min } \{ {{\mathbb {F}}}_{\infty } \}. \end{aligned}$$

$\square $

Proof of Proposition 1

In Theorem 3, we proved $\sqrt{T}({\hat{\varvec{\theta }}} - \varvec{\theta }_0) := \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min } \{{{\mathbb {F}}}_T\} \overset{d}{\longrightarrow } \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min } \{{{\mathbb {F}}}_{\infty }\}$ for $\lambda _T/\sqrt{T} \rightarrow \lambda _0$ and $\gamma _T / \sqrt{T} \rightarrow \gamma _0$. The limit random function is

$$\begin{aligned} {{\mathbb {F}}}_{\infty }(\varvec{u})= & {} \frac{1}{2}\varvec{u}' {{\mathbb {H}}}\varvec{u}+ \varvec{u}' \varvec{Z}+ \lambda _0 \overset{m}{\underset{k = 1}{\sum }} \alpha _k \overset{\varvec{c}_k}{\underset{i = 1}{\sum }} \left\{ |\varvec{u}^{(k)}_i| {\mathbf {1}}_{\theta ^{(k)}_{0,i} = 0} + \varvec{u}^{(k)}_i \text {sgn}(\theta ^{(k)}_{0,i}){\mathbf {1}}_{\theta ^{(k)}_{0,i} \ne 0} \right\} \\&+\, \gamma _0 \overset{m}{\underset{l = 1}{\sum }} \xi _l \{\Vert \varvec{u}^{(l)}\Vert _2 {\mathbf {1}}_{\varvec{\theta }^{(l)}_{0} = {\mathbf {0}}} + \frac{\varvec{u}^{(l)'} \varvec{\theta }^{(l)}_0}{\Vert \varvec{\theta }^{(l)}_0\Vert _2} {\mathbf {1}}_{\varvec{\theta }^{(l)}_{0} \ne {\mathbf {0}}}\}. \end{aligned}$$

First, let us observe that

$$\begin{aligned} \{{\hat{{{\mathcal {A}}}}} {=} {{\mathcal {A}}}\}{=} \left\{ \forall k {=}1,\ldots ,m, \, i \!\in \! {{\mathcal {A}}}^c_k, {\hat{\theta }}^{(k)}_i {=} 0\right\} \cap \left\{ \forall k {=}1,\ldots ,m, \, i \!\in \! {\hat{{{\mathcal {A}}}}}^c_k, \theta ^{(k)}_{0,i} {=} 0\right\} . \end{aligned}$$

Both sets describing $\{{\hat{{{\mathcal {A}}}}} = {{\mathcal {A}}}\}$ are symmetric, and thus we can focus on

$$\begin{aligned} \{{\hat{{{\mathcal {A}}}}} = {{\mathcal {A}}}\} \Rightarrow \left\{ \forall k =1,\ldots ,m, \, i \in {{\mathcal {A}}}^c_k, T^{1/2} {\hat{\theta }}^{(k)}_i = 0\right\} . \end{aligned}$$

Hence

$$\begin{aligned} {{\mathbb {P}}}( {\hat{{{\mathcal {A}}}}} = {{\mathcal {A}}}) \le {{\mathbb {P}}}\left( \forall k = 1,\ldots ,m, \forall i \in {{\mathcal {A}}}^c_k, T^{1/2}{\hat{\theta }}^{(k)}_i = 0\right) . \end{aligned}$$

Denoting by $ \varvec{u}^*:= \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min } \{{{\mathbb {F}}}_{\infty }(\varvec{u})\}$, Theorem 3 corresponds to $\sqrt{T}({\hat{\varvec{\theta }}}_{{{\mathcal {A}}}} - \varvec{\theta }_{0,{{\mathcal {A}}}}) \overset{d}{\longrightarrow } \varvec{u}^*_{{{\mathcal {A}}}}$. By the Portmanteau theorem (see Wellner and van der Vaart 1996), we have

$$\begin{aligned} \underset{T \rightarrow \infty }{\lim \sup } \; {{\mathbb {P}}}( \forall k = 1,\ldots ,m, \forall i \in {{\mathcal {A}}}^c_k, T^{1/2}{\hat{\theta }}^{(k)}_i = 0) \le {{\mathbb {P}}}(\forall k =1,\ldots ,m, \forall i \in {{\mathcal {A}}}^c_k, \varvec{u}^{(k) *}_i = 0), \end{aligned}$$

as $\varvec{\theta }_{0,{{\mathcal {A}}}^c} = {\mathbf {0}}$. Consequently, we need to prove that the probability of the right-hand side is strictly inferior to 1, which is upper-bounded by

$$\begin{aligned}&{{\mathbb {P}}}(\forall k =1,\ldots ,m, \forall i \in {{\mathcal {A}}}^c_k, \varvec{u}^{(k) *}_i = 0) \le \nonumber \\&\min ({{\mathbb {P}}}(k \notin {{\mathcal {S}}}, \varvec{u}^{(k) *} = 0),{{\mathbb {P}}}(k \in {{\mathcal {S}}}, \forall i \in {{\mathcal {A}}}^c_k, \varvec{u}^{(k) *}_i = 0)). \end{aligned}$$

(10)

If $\lambda _0 = \gamma _0 = 0$, then $\varvec{u}^* = -{{\mathbb {H}}}^{-1} \varvec{Z}$ so that ${{\mathbb {P}}}_{\varvec{u}^*} = {{\mathcal {N}}}(0,{{\mathbb {H}}}^{-1} {{\mathbb {M}}}{{\mathbb {H}}}^{-1})$. Hence $c = 0$.

If $\lambda _0 \ne 0$ or $\gamma _0 \ne 0$, the necessary and sufficient optimality conditions for a group k tell us that $\varvec{u}^*$ satisfies

$$\begin{aligned} \left\{ \begin{array}{llll} ({{\mathbb {H}}}\varvec{u}^* + \varvec{Z})_{(k)} + \lambda _0 \alpha _k \varvec{p}^{(k)} + \gamma _0 \xi _k \frac{\varvec{\theta }^{(k)}_0}{\Vert \varvec{\theta }^{(k)}_0\Vert _2} = 0, &{} &{} k \in {{\mathcal {S}}},\\ ({{\mathbb {H}}}\varvec{u}^* + \varvec{Z})_{(k)} + \lambda _0 \alpha _k \varvec{w}^{(k)} + \gamma _0 \xi _k \varvec{z}^{(k)} = 0, &{} &{} \, \text {otherwise}, \end{array}\right. \end{aligned}$$

(11)

where $\varvec{w}^{(k)}$ and $\varvec{z}^{(k)}$ are the subgradients of $\Vert \varvec{u}^{(k)}\Vert _1$ and $\Vert \varvec{u}^{(k)}\Vert _2$ given by

$$\begin{aligned} \varvec{w}^{(k)}_i {\left\{ \begin{array}{ll} = \text {sgn}(\varvec{u}^{(k)}_i) \, \text {if} \, \varvec{u}^{(k)}_i \ne 0,\\ \in \{\varvec{w}^{(k)}_i : |\varvec{w}^{(k)}_i| \le 1\} \, \text {if} \, \varvec{u}^{(k)}_i = 0, \end{array}\right. } \, \varvec{z}^{(k)} {\left\{ \begin{array}{ll} = \frac{\varvec{u}^{(k)}}{\Vert \varvec{u}^{(k)}\Vert _2} \, \text {if} \, \varvec{u}^{(k)} \ne 0,\\ \in \{\varvec{z}^{(k)} : \Vert \varvec{z}^{(k)}\Vert _2 \le 1\} \, \text {if} \, \varvec{u}^{(k)} = 0, \end{array}\right. } \end{aligned}$$

and $\varvec{p}^{(k)}_i = \partial _{\varvec{u}_i} \{ |\varvec{u}^{(k)}_i| {\mathbf {1}}_{\theta ^{(k)}_{0,i} = 0} + \varvec{u}^{(k)}_i \text {sgn}(\theta ^{(k)}_{0,i}){\mathbf {1}}_{\theta ^{(k)}_{0,i} \ne 0} \}$.

If $\varvec{u}^{(m) *} = 0, \forall m \notin {{\mathcal {S}}}$, then the optimality conditions (11) become

$$\begin{aligned} \left\{ \begin{array}{llll} {{\mathbb {H}}}_{{{\mathcal {S}}}{{\mathcal {S}}}} \varvec{u}^*_{{{\mathcal {S}}}} + \varvec{Z}_{{{\mathcal {S}}}} + \lambda _0 \tau _{{{\mathcal {S}}}} + \gamma _0 \zeta _{{{\mathcal {S}}}} = 0, &{} &{}\\ \Vert -{{\mathbb {H}}}_{(l) {{\mathcal {S}}}} \varvec{u}^*_{{{\mathcal {S}}}} - \varvec{Z}_{(l)} -\lambda _0 \alpha _l \varvec{w}^{(l)}\Vert _2 \le \gamma _0 \xi _l, \, \text {as} \, \Vert \varvec{z}^{(l)}\Vert _2 \le 1, \, l \in {{\mathcal {S}}}^c, &{} &{} \end{array}\right. \end{aligned}$$

(12)

with $\tau _{{{\mathcal {S}}}} = \text {vec}(k \in {{\mathcal {S}}}, \alpha _k \varvec{p}^{(k)})$ and $\zeta _{{{\mathcal {S}}}} = \text {vec}(k \in {{\mathcal {S}}}, \xi _k \frac{\varvec{\theta }^{(k)}_0}{\Vert \varvec{\theta }^{(k)}_0\Vert _2})$, which are vectors of ${{\mathbb {R}}}^{\text {card}({{\mathcal {S}}})}$.

For $k \in {{\mathcal {S}}}$, that is, the vector $\varvec{\theta }^{(k)}_0$ is at least nonzero, then

$$\begin{aligned} \left\{ \begin{array}{llll} ({{\mathbb {H}}}\varvec{u}^* + \varvec{Z})_i + \lambda _0 \alpha _k \text {sgn}(\theta ^{(k)}_{0,i}) + \gamma _0 \xi _k \frac{\theta ^{(k)}_{0,i}}{\Vert \varvec{\theta }^{(k)}_0\Vert _2} = 0, \, \text {if} \, k \in {{\mathcal {S}}}, i \in {{\mathcal {A}}}_k, &{} &{}\\ ({{\mathbb {H}}}\varvec{u}^* + \varvec{Z})_i + \lambda _0 \alpha _k \varvec{w}^{(k)}_i = 0, \, i \in {{\mathcal {A}}}^c_k. &{} &{} \end{array}\right. \end{aligned}$$

(13)

Consequently, if $\varvec{u}^{(k) *}_i = 0, \forall i \in {{\mathcal {A}}}^c_k$, with $k \in {{\mathcal {S}}}$, then the conditions (13) become

$$\begin{aligned}\left\{ \begin{array}{llll} {{\mathbb {H}}}_{{{\mathcal {A}}}_k {{\mathcal {A}}}_k} \varvec{u}^*_{{{\mathcal {A}}}_k} + \varvec{Z}_{{{\mathcal {A}}}_k} + \lambda _0 \alpha _k \text {sgn}(\varvec{\theta }_{0,{{\mathcal {A}}}_k}) + \gamma _0 \xi _k \frac{\varvec{\theta }_{0,{{\mathcal {A}}}_k}}{\Vert \varvec{\theta }_{0,{{\mathcal {A}}}_k}\Vert _2} = 0, &{} &{}\\ |-({{\mathbb {H}}}_{{{\mathcal {A}}}^c_k {{\mathcal {A}}}_k} \varvec{u}^*_{{{\mathcal {A}}}_k} + \varvec{Z}_{{{\mathcal {A}}}^c_k})_i| \le \lambda _0 \alpha _k. &{} &{} \end{array}\right. \end{aligned}$$

Combining relationships in (12), we obtain

$$\begin{aligned} \Vert {{\mathbb {H}}}_{(l) {{\mathcal {S}}}} {{\mathbb {H}}}^{-1}_{{{\mathcal {S}}}{{\mathcal {S}}}} (\varvec{Z}_{{{\mathcal {S}}}} + \lambda _0 \tau _{{{\mathcal {S}}}} + \gamma _0 \zeta _{{{\mathcal {S}}}}) - \varvec{Z}_{(l)} -\lambda _0 \alpha _l \varvec{w}^{(l)}\Vert _2 \le \gamma _0 \xi _l, l \in {{\mathcal {S}}}^c. \end{aligned}$$

The same reasoning applies for active groups with inactive components, so that combining relationships in (13), we obtain

$$\begin{aligned} |\left( {{\mathbb {H}}}_{{{\mathcal {A}}}^c_k {{\mathcal {A}}}_k} {{\mathbb {H}}}^{-1}_{{{\mathcal {A}}}_k {{\mathcal {A}}}_k} \left( \varvec{Z}_{{{\mathcal {A}}}_k} + \lambda _0 \alpha _k \text {sgn}(\varvec{\theta }_{0,{{\mathcal {A}}}_k}) + \gamma _0 \xi _k \frac{\varvec{\theta }_{0,{{\mathcal {A}}}_k}}{\Vert \varvec{\theta }_{0,{{\mathcal {A}}}_k}\Vert _2}\right) - \varvec{Z}_{{{\mathcal {A}}}^c_k}\right) _i| \le \lambda _0 \alpha _k. \end{aligned}$$

Hence we deduce

$$\begin{aligned}&{{\mathbb {P}}}(\forall k =1,\ldots ,m, \forall \in {{\mathcal {A}}}^c_k, \varvec{u}^{(k) *}_i = 0) \le \\&\min ({{\mathbb {P}}}(k \notin {{\mathcal {S}}}, \varvec{u}^{(k) *} = 0),{{\mathbb {P}}}(k \in {{\mathcal {S}}}, \forall i \in {{\mathcal {A}}}^c_k, \varvec{u}^{(k) *}_i = 0)) := \min (a_1,a_2). \end{aligned}$$

Under the assumption that $\lambda _0 < \infty $ and $\gamma _0 < \infty $, we obtain

$$\begin{aligned} a_1= & {} {{\mathbb {P}}}(l \in {{\mathcal {S}}}^c, \Vert {{\mathbb {H}}}_{(l) {{\mathcal {S}}}} {{\mathbb {H}}}^{-1}_{{{\mathcal {S}}}{{\mathcal {S}}}} (\varvec{Z}_{{{\mathcal {S}}}} + \lambda _0 \tau _{{{\mathcal {S}}}} + \gamma _0 \zeta _{{{\mathcal {S}}}}) - \varvec{Z}_{(l)} -\lambda _0 \alpha _l \varvec{w}^{(l)}\Vert _2 \le \gamma _0 \xi _l)< 1, \\ a_2= & {} {{\mathbb {P}}}(k \in {{\mathcal {S}}}, i \in {{\mathcal {A}}}^c_k, |({{\mathbb {H}}}_{{{\mathcal {A}}}^c_k {{\mathcal {A}}}_k} {{\mathbb {H}}}^{-1}_{{{\mathcal {A}}}_k {{\mathcal {A}}}_k} (\varvec{Z}_{{{\mathcal {A}}}_k} + \lambda _0 \alpha _k \text {sgn}(\varvec{\theta }_{0,{{\mathcal {A}}}_k}) \\&+ \gamma _0 \xi _k \frac{\varvec{\theta }_{0,{{\mathcal {A}}}_k}}{\Vert \varvec{\theta }_{0,{{\mathcal {A}}}_k}\Vert _2}) - \varvec{Z}_{{{\mathcal {A}}}^c_k})_i| \le \lambda _0 \alpha _k) < 1. \end{aligned}$$

Thus $c < 1$, which proves (10), that is proposition 1. $\square $

Proof of Theorem 4

The proof relies on the same steps as in the proof of Theorem 2.

$\square $

Proof of Theorem 5

We start with the asymptotic distribution and proceed as in the proof of Theorem 3, where we used Lemma 1. To do so, we prove the finite dimensional convergence in distribution of the empirical criterion ${{\mathbb {F}}}_T(\varvec{u})$ to ${{\mathbb {F}}}_{\infty }(\varvec{u})$ with $\varvec{u}\in {{\mathbb {R}}}^d$, where these quantities are, respectively, defined as

$$\begin{aligned} {{\mathbb {F}}}_T(\varvec{u})= & {} T {{\mathbb {G}}}_T (\psi (\varvec{\theta }_0 + \varvec{u}/\sqrt{T}) - \psi (\varvec{\theta }_0)) \nonumber \\= & {} T {{\mathbb {G}}}_T (l(\varvec{\theta }_0 + \varvec{u}/\sqrt{T}) - l(\varvec{\theta }_0)) + \lambda _T \overset{m}{\underset{k=1}{\sum }} \overset{\varvec{c}_k}{\underset{i=1}{\sum }} \alpha ^{(k)}_{T,i} \left[ |\theta ^{(k)}_{0,i} + \varvec{u}^{(k)}_i/\sqrt{T}| - |\theta ^{(k)}_{0,i}|\right] \nonumber \\&+\, \gamma _T \overset{m}{\underset{l=1}{\sum }} \xi _{T,l} \left[ \Vert \varvec{\theta }^{(l)}_0 + \varvec{u}^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2\right] , \end{aligned}$$

and

$$\begin{aligned} {{\mathbb {F}}}_{\infty }(\varvec{u}) = {\left\{ \begin{array}{ll} \frac{1}{2} \varvec{u}'_{{{\mathcal {A}}}} {{\mathbb {H}}}_{{{\mathcal {A}}}{{\mathcal {A}}}} \varvec{u}_{{{\mathcal {A}}}} + \varvec{u}_{{{\mathcal {A}}}}' \varvec{Z}_{{{\mathcal {A}}}} &{} \text {if} \; \varvec{u}_i = 0, \; \text {when} \; i \notin {{\mathcal {A}}}, \, \text {and} \\ \infty &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

(14)

with $\varvec{Z}_{{{\mathcal {A}}}} \sim {{\mathcal {N}}}(0,{{\mathbb {M}}}_{{{\mathcal {A}}}{{\mathcal {A}}}})$. By Lemma 1, the finite dimensional convergence in distribution implies $\underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min }\{{{\mathbb {F}}}_T(\varvec{u})\} \overset{d}{\longrightarrow } \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min }\{{{\mathbb {F}}}_{\infty }(\varvec{u})\}$. We first consider the unpenalized empirical criterion of ${{\mathbb {F}}}_T(.)$, which can be expanded as

$$\begin{aligned} T {{\mathbb {G}}}_T (\psi (\varvec{\theta }_0 + \varvec{u}/\sqrt{T}) - \psi (\varvec{\theta }_0))= & {} T^{1/2}{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}+ \frac{\varvec{u}'}{2} \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}\\&+\, \frac{1}{6 T^{1/3}} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}})\}\varvec{u}, \end{aligned}$$

where ${\bar{\varvec{\theta }}}$ lies between $\varvec{\theta }_0$ and $\varvec{\theta }_0 + \varvec{u}/\sqrt{T}$. First, using the same reasoning on the third-order term, we obtain $\frac{1}{6 T^{1/3}} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}})\}\varvec{u}\overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0$. By the ergodic theorem, we deduce $\ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} {{\mathbb {H}}}$ and by Assumption 4, $\sqrt{T}{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \overset{d}{\longrightarrow } {{\mathcal {N}}}(0,{{\mathbb {M}}})$.

We now focus on the penalty terms of (4), we remind that $\alpha ^{(k)}_{T,i} = |{\tilde{\theta }}^{(k)}_i|^{-\eta }$, so that for $i \in {{\mathcal {A}}}_k, k \in {{\mathcal {S}}}$, ${\tilde{\theta }}^{(k)}_i \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \theta ^{(k)}_{0,i} \ne 0$. Note that

$$\begin{aligned} \sqrt{T}(|\varvec{\theta }^{(k)}_{0,i} + \varvec{u}^{(k)}_i/\sqrt{T}| - |\varvec{\theta }^{(k)}_0|] \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{u}^{(k)}_i \text {sgn}(\theta ^{(k)}_{0,i}){\mathbf {1}}_{\theta ^{(k)}_{0,i} \ne 0}. \end{aligned}$$

This implies that, for $i \in {{\mathcal {A}}}_k$, $k \in {{\mathcal {S}}}$, we have

$$\begin{aligned} \lambda _T T^{-1/2}\overset{\varvec{c}_k}{\underset{i=1}{\sum }} \alpha ^{(k)}_{T,i} \sqrt{T}(|\theta ^{(k)}_{0,i} + \varvec{u}^{(k)}_i/\sqrt{T}| - |\theta ^{(k)}_{0,i}|) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0, \end{aligned}$$

under the condition $\lambda _T T^{-1/2} \rightarrow 0$. For $i \in {{\mathcal {A}}}^c_k$, $\theta ^{(k)}_{0,i} = 0$, then $T^{\eta /2} (|{\tilde{\theta }}^{(k)}_i|)^{\eta } = O_p(1)$. Hence under the assumption $\lambda _T T^{(\eta -1)/2} \rightarrow \infty $, we obtain

$$\begin{aligned}&\lambda _T T^{-1/2} \alpha ^{(k)}_{T,i} \sqrt{T}\left( |\theta ^{(k)}_{0,i} + \varvec{u}^{(k)}_i/\sqrt{T} | - |\theta ^{(k)}_{0,i}|\right) \nonumber \\&\quad = \lambda _T T^{-1/2} |\varvec{u}^{(k)}_i| \frac{T^{\eta /2}}{(T^{1/2}|{\tilde{\theta }}^{(k)}_i|)^{\eta }} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \infty . \end{aligned}$$

(15)

As for the $l^1/l^2$ quantity, we remind that $\xi _{T,l} = \Vert {\tilde{\varvec{\theta }}}^{(l)}\Vert ^{-\mu }_2$, so that for $l \in {{\mathcal {S}}}$, ${\tilde{\varvec{\theta }}}^{(l)} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{\theta }^{(l)}_0$, and in this case

$$\begin{aligned} \sqrt{T} \left\{ \Vert \varvec{\theta }^{(l)}_0 + \varvec{u}^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2\right\} = \frac{\varvec{u}^{(l) '} \varvec{\theta }^{(l)}_0}{\Vert \varvec{\theta }^{(l)}_0\Vert _2} + o\left( T^{-1/2}\right) . \end{aligned}$$

Consequently, using $\gamma _T T^{-1/2} \rightarrow 0$, and for $l \in {{\mathcal {S}}}$, we obtain

$$\begin{aligned} \gamma _T T^{-1/2} \sqrt{T} \xi _{T,l} \left( \Vert \varvec{\theta }^{(l)}_0 + \varvec{u}^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2 \right) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0. \end{aligned}$$

Combining the fact $k \in {{\mathcal {S}}}$ and $\varvec{\theta }^{(k)}_0$ is partially zero, that is $i \in {{\mathcal {A}}}^c_k$, we obtain the divergence given in (15). Furthermore, if $l \notin {{\mathcal {S}}}$, that is $\varvec{\theta }^{(l)}_0 = 0$, then

$$\begin{aligned} \sqrt{T} \left\{ \Vert \varvec{\theta }^{(l)}_0 + \varvec{u}^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2\right\} = \Vert \varvec{u}^{(l)}\Vert _2, \end{aligned}$$

and $T^{\mu /2} (\Vert {\tilde{\varvec{\theta }}}^{(l)}\Vert _2)^{\mu } = O_p(1)$. Then by $\gamma _T T^{(\mu -1)/2} \rightarrow \infty $ we have

$$\begin{aligned}&\gamma _T T^{-1/2} \xi _{T,l} \sqrt{T}\left[ \Vert \varvec{\theta }^{(l)}_0 + \varvec{u}^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2 \right] \\&\quad = \gamma _T T^{-1/2} \Vert \varvec{u}^{(l)}\Vert _2 \frac{T^{\mu /2}}{(T^{1/2}\Vert {\tilde{\varvec{\theta }}}^{(l)}\Vert _2)^{\mu }} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \infty . \end{aligned}$$

We deduce the pointwise convergence ${{\mathbb {F}}}_T (\varvec{u}) \overset{d}{\longrightarrow } {{\mathbb {F}}}_{\infty }(\varvec{u})$, where ${{\mathbb {F}}}_{\infty }(.)$ is given in (14). As ${{\mathbb {F}}}_T(.)$ is convex and ${{\mathbb {F}}}_{\infty }(.)$ is convex and has a unique minimum $({{\mathbb {H}}}^{-1}_{{{\mathcal {A}}}{{\mathcal {A}}}} \varvec{Z}_{{{\mathcal {A}}}},{\mathbf {0}}_{{{\mathcal {A}}}^c})$ since ${{\mathbb {H}}}$ is positive definite, by Lemma 1, we obtain

$$\begin{aligned} \sqrt{T}({\hat{\varvec{\theta }}} - \varvec{\theta }_0) = \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min }\{{{\mathbb {F}}}_T(\varvec{u})\} \overset{d}{\longrightarrow } \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min }\{{{\mathbb {F}}}_{\infty }(\varvec{u})\}, \end{aligned}$$

that is to say $\sqrt{T}({\hat{\varvec{\theta }}}_{{{\mathcal {A}}}} - \varvec{\theta }_{0,{{\mathcal {A}}}}) \overset{d}{\longrightarrow } {{\mathbb {H}}}^{-1}_{{{\mathcal {A}}}{{\mathcal {A}}}} \varvec{Z}_{{{\mathcal {A}}}}, \; \text {and} \; \sqrt{T}({\hat{\varvec{\theta }}}_{{{\mathcal {A}}}^c} - \varvec{\theta }_{0,{{\mathcal {A}}}^c}) \overset{d}{\longrightarrow } \mathbf {0}_{{{\mathcal {A}}}^c}$.

We now prove the model selection consistency. Let $i \in {{\mathcal {A}}}_k$, then by the asymptotic normality result, ${\hat{\theta }}^{(k)}_i \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{\theta }^{(k)}_0$, which implies ${{\mathbb {P}}}(i \in {\hat{{{\mathcal {A}}}}}_k) \rightarrow 1$. Thus the proof consists of proving

$$\begin{aligned} \forall k = 1,\ldots ,m, \forall i \in {{\mathcal {A}}}^c_k, {{\mathbb {P}}}(i \in {\hat{{{\mathcal {A}}}}}_k) \rightarrow 0. \end{aligned}$$

This problem can be split into two parts as

$$\begin{aligned} \forall k \notin {{\mathcal {S}}}, {{\mathbb {P}}}(k \in {\hat{{{\mathcal {S}}}}}) \rightarrow 0, \; \text {and} \; \forall k \in {{\mathcal {S}}}, \forall i \in {{\mathcal {A}}}^c_k, {{\mathbb {P}}}(i \in {\hat{A}}_k) \rightarrow 0. \end{aligned}$$

(16)

Let us start with the case $k \notin {{\mathcal {S}}}$. If $k \in {\hat{{{\mathcal {S}}}}}$, by the optimality conditions given by the Karush–Kuhn–Tucker theorem applied on ${{\mathbb {G}}}_T \psi ({\hat{\varvec{\theta }}})$, we have

$$\begin{aligned} {\dot{{{\mathbb {G}}}}}_T l({\hat{\varvec{\theta }}})_{(k)} + \frac{\lambda _T}{T} \alpha ^{(k)}_T \odot {\hat{\varvec{w}}}^{(k)}+ \frac{\gamma _T}{T} \xi _{T,k} \frac{{\hat{\varvec{\theta }}}^{(k)}}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2} = 0, \end{aligned}$$

$\odot $ is the element-by-element vector product, and

$$\begin{aligned} {\hat{\varvec{w}}}^{(k)}_i {\left\{ \begin{array}{ll} = \text {sgn}({\hat{\theta }}^{(k)}_i) \; \text {if} \; {\hat{\theta }}^{(k)}_i \ne 0,\\ \in \{{\hat{\varvec{w}}}^{(k)}_i : |{\hat{\varvec{w}}}^{(k)}_i| \le 1\} \; \text {if} \; {\hat{\theta }}^{(k)}_i = 0. \end{array}\right. } \end{aligned}$$

Multiplying the unpenalized part by $T^{1/2}$, we have the expansion

$$\begin{aligned} T^{1/2} {\dot{{{\mathbb {G}}}}}_T l({\hat{\varvec{\theta }}})_{(k)}= & {} T^{1/2} {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0)_{(k)} + T^{1/2} \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0)_{(k) (k)} ({\hat{\varvec{\theta }}}-\varvec{\theta }_0)_{(k)} \\&+\, T^{1/2} \nabla '\{({\hat{\varvec{\theta }}}-\varvec{\theta }_0)'_{(k)} \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}})_{(k)(k)} ({\hat{\varvec{\theta }}}-\varvec{\theta }_0)_{(k)}\} , \end{aligned}$$

which is asymptotically normal by consistency, Assumption 6 regarding the bound on the third-order term, the Slutsky theorem and the central limit theorem of Billingsley (1961). Furthermore, we have

$$\begin{aligned} \gamma _T T^{-1/2}\xi _{T,k} \frac{{\hat{\varvec{\theta }}}^{(k)}}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2} = \gamma _T T^{(\mu -1)/2} (T^{1/2} \Vert {\tilde{\varvec{\theta }}}^{(k)}\Vert _2 )^{-\mu }\frac{{\hat{\varvec{\theta }}}^{(k)}}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \infty . \end{aligned}$$

Then using $T^{(\mu -\eta )/2} \gamma _T \lambda ^{-1}_T \rightarrow \infty $, we have

$$\begin{aligned} \forall k \notin {{\mathcal {S}}}, {{\mathbb {P}}}(k \in {\hat{{{\mathcal {S}}}}}) \le {{\mathbb {P}}}\left( -{\dot{{{\mathbb {G}}}}}_T l({\hat{\varvec{\theta }}})_{(k)} = \frac{\lambda _T}{T} \alpha ^{(k)}_T \odot {\hat{\varvec{w}}}^{(k)}_i + \frac{\gamma _T}{T} \xi _{T,k} \frac{{\hat{\varvec{\theta }}}^{(k)}}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2}\right) \rightarrow 0. \end{aligned}$$

We now pick $k \in {{\mathcal {S}}}$ and consider the event $\{i \in {\hat{{{\mathcal {A}}}}}_k\}$. Then the Karush–Kuhn–Tucker conditions for ${{\mathbb {G}}}_T \psi ({\hat{\varvec{\theta }}})$ are given by

$$\begin{aligned} ({\dot{{{\mathbb {G}}}}}_T l({\hat{\theta }}))_{(k),i} + \frac{\lambda _T}{T} \alpha ^{(k)}_{T,i} \text {sgn}({\hat{\theta }}^{(k)}_{T,i}) + \frac{\gamma _T}{T} \xi _{T,k} \frac{{\hat{\theta }}^{(k)}_i}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2} = 0. \end{aligned}$$

Using the same reasoning as previously, $T^{1/2}({\dot{{{\mathbb {G}}}}}_T l({\hat{\varvec{\theta }}}))_{(k),i}$ is also asymptotically normal, and ${\tilde{\varvec{\theta }}}^{(k)} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{\theta }^{(k)}_0$ for $k \in {{\mathcal {S}}}$, and besides

$$\begin{aligned} \lambda _T T^{-1/2}\alpha ^{(k)}_{T,i} \text {sgn}({\hat{\theta }}^{(k)}_i) = \lambda _T \frac{T^{(\eta -1)/2}}{(T^{1/2}|{\tilde{\theta }}^{(k)}_i|)^{\eta }} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \infty , \end{aligned}$$

so that we obtain the same when adding $\gamma _T T^{-1/2}\xi _{T,k} \frac{{\hat{\theta }}^{(k)}_i}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2}$. Therefore, we have for any $k \in {{\mathcal {S}}}$ and $i \notin {{\mathcal {A}}}_k$

$$\begin{aligned} {{\mathbb {P}}}(i \in {\hat{{{\mathcal {A}}}}}_k) \le {{\mathbb {P}}}\left( -({\dot{{{\mathbb {G}}}}}_T l({\hat{\varvec{\theta }}}))_{(k),i} = \frac{\lambda _T}{T} \alpha ^{(k)}_{T,i} \text {sgn}({\hat{\theta }}^{(k)}_i) + \frac{\gamma _T}{T} \xi _{T,k} \frac{{\hat{\theta }}^{(k)}_i}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2}\right) \rightarrow 0. \end{aligned}$$

We have proved (16). $\square $

About this article

Cite this article

Poignard, B. Asymptotic theory of the adaptive Sparse Group Lasso. Ann Inst Stat Math 72, 297–328 (2020). https://doi.org/10.1007/s10463-018-0692-7

Download citation

Received: 09 January 2018
Revised: 03 August 2018
Published: 11 October 2018
Issue Date: February 2020
DOI: https://doi.org/10.1007/s10463-018-0692-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Asymptotic theory of the adaptive Sparse Group Lasso

Abstract

Access this article

Similar content being viewed by others

The linearized alternating direction method of multipliers for sparse group LAD model

Convergence and sparsity of Lasso and group Lasso in high-dimensional generalized linear models

On the oracle property of adaptive group Lasso in high-dimensional linear models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 279 KB)

Appendix

Corollary 1

Theorem 9

Corollary 2

Lemma 1

Theorem 10

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Proposition 1

Proof of Theorem 4

Proof of Theorem 5

About this article

Cite this article

Keywords

Navigation

Asymptotic theory of the adaptive Sparse Group Lasso

Abstract

Access this article

Similar content being viewed by others

The linearized alternating direction method of multipliers for sparse group LAD model

Convergence and sparsity of Lasso and group Lasso in high-dimensional generalized linear models

On the oracle property of adaptive group Lasso in high-dimensional linear models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 279 KB)

Appendix

Appendix

Corollary 1

Theorem 9

Corollary 2

Lemma 1

Theorem 10

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Proposition 1

Proof of Theorem 4

Proof of Theorem 5

About this article

Cite this article

Share this article

Keywords

Search

Navigation