Skip to main content
Log in

Asymptotic theory of the adaptive Sparse Group Lasso

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

We study the asymptotic properties of a new version of the Sparse Group Lasso estimator (SGL), called adaptive SGL. This new version includes two distinct regularization parameters, one for the Lasso penalty and one for the Group Lasso penalty, and we consider the adaptive version of this regularization, where both penalties are weighted by preliminary random coefficients. The asymptotic properties are established in a general framework, where the data are dependent and the loss function is convex. We prove that this estimator satisfies the oracle property: the sparsity-based estimator recovers the true underlying sparse model and is asymptotically normally distributed. We also study its asymptotic properties in a double-asymptotic framework, where the number of parameters diverges with the sample size. We show by simulations and on real data that the adaptive SGL outperforms other oracle-like methods in terms of estimation precision and variable selection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Anderson, P. K., Gill, R. D. (1982). Cox’s regression model for counting processes: A large sample study. The Annals of Statistics, 10(4), 1100–1120.

    Article  MathSciNet  Google Scholar 

  • Bertsekas, D. (1995). Nonlinear programming. Belmont, MA: Athena Scientific.

    MATH  Google Scholar 

  • Billingsley, P. (1961). The Lindeberg–Levy theorem for martingales. Proceedings of the American Mathematical Society, 12, 788792.

    MATH  Google Scholar 

  • Billingsley, P. (1995). Probability and measure. New York: Wiley.

    MATH  Google Scholar 

  • Bühlmann, P., van de Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications. Springer series in statistics Berlin: Springer.

    Chapter  Google Scholar 

  • Chernozhukov, V. (2005). Extremal quantile regression. The Annals of Statistics, 33(2), 806–839.

    Article  MathSciNet  Google Scholar 

  • Chernozhukov, V., Hong, H. (2004). Likelihood estimation and inference in a class of nonregular econometric models. Econometrica, 72(5), 1445–1480.

    Article  MathSciNet  Google Scholar 

  • Davis, R. A., Knight, K., Liu, J. (1992). M-estimation for autoregressions with infinite variance. Stochastic Processes and Their Applications, 40, 145–180.

    Article  MathSciNet  Google Scholar 

  • Fan, J. (1997). Comments on wavelets in statistics: A review by A. Antoniadis. Journal of the Italian Statistical Association, 6, 131138.

    Google Scholar 

  • Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.

    Article  MathSciNet  Google Scholar 

  • Fan, J., Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32(3), 928–961.

    Article  MathSciNet  Google Scholar 

  • Francq, C., Thieu, L. Q. (2015). QML inference for volatility models with covariates. MPRA paper no. 63198.

  • Francq, C., Zakoïan, J. M. (2010). GARCH models. Chichester: Wiley.

    Book  Google Scholar 

  • Fu, W. J. (1998). Penalized regression: the Bridge versus the Lasso. Journal of Computational and Graphical Statistics, 7, 397–416.

    MathSciNet  Google Scholar 

  • Geyer, C. J. (1996). On the asymptotics of convex stochastic optimization. Unpublished manuscript.

  • Hjort, N. L., Pollard, D. (1993). Asymptotics for minimisers of convex processes. Unpublished manuscript.

  • Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and Monte Carlo. The Annals of Statistics, 1(5), 799821.

    Article  MathSciNet  Google Scholar 

  • Hunter, D. R., Li, R. (2005). Variable selection using MM algorithms. The Annals of Statistics, 33(4), 1617–1642.

    Article  MathSciNet  Google Scholar 

  • Kato, K. (2009). Asymptotics for argmin processes: Convexity arguments. Journal of Multivariate Analysis, 100, 1816–1829.

    Article  MathSciNet  Google Scholar 

  • Knight, K., Fu, W. (2000). Asymptotics for Lasso-type estimators. The Annals of Statistics, 28(5), 1356–1378.

    Article  MathSciNet  Google Scholar 

  • Li, X., Mo, L., Yuan, X., Zhang, J. (2014). Linearized alternating direction method of multipliers for Sparse Group and Fused Lasso models. Computational Statistics and Data Analysis, 79, 203–221.

    Article  MathSciNet  Google Scholar 

  • Nardi, Y., Rinaldo, A. (2008). On the asymptotic properties of the Group Lasso estimator for linear models. Electronic Journal of Statistics, 2, 605–633.

    Article  MathSciNet  Google Scholar 

  • Neumann, M. H. (2013). A central limit theorem for triangular arrays of weakly dependent random variables, with applications in statistics. Probability and Statistics, 17, 120–134.

    Article  MathSciNet  Google Scholar 

  • Newey, W. K., Powell, J. L. (1987). Asymmetric least squares estimation and testing. Econometrica, 55(4), 819–847.

    Article  MathSciNet  Google Scholar 

  • Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory, 7(2), 186–199.

    Article  MathSciNet  Google Scholar 

  • Racine, J. (2000). Consistent cross-validatory model-selection for dependent data: hv-block cross-validation. Journal of Econometrics, 99, 39–61.

    Article  Google Scholar 

  • Rio, E. (2013). Inequalities and limit theorems for weakly dependent sequences. 3 ème Cycle, cel–00867106, 170.

    Google Scholar 

  • Rockafeller, R. T. (1970). Convex analysis. Princeton: Princeton University Press.

    Book  Google Scholar 

  • Shiryaev, A. N. (1991). Probability. Berlin: Springer.

    MATH  Google Scholar 

  • Simon, N., Friedman, J., Hastie, T., Tibshirani, R. (2013). A Sparse Group Lasso. Journal of Computational and Graphical Statistics, 22(2), 231–245.

    Article  MathSciNet  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58(1), 267–288.

    MathSciNet  MATH  Google Scholar 

  • Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using \(l^1\)-constrained quadratic programming. IEEE Transactions on Information Theory, 55(5), 2183–2202.

    Article  MathSciNet  Google Scholar 

  • Wellner, J. A., van der Vaart, A. W. (1996). Weak convergence and empirical processes. With applications to statistics. New York, NY: Springer.

    MATH  Google Scholar 

  • Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B, 68(1), 49–67.

    Article  MathSciNet  Google Scholar 

  • Zou, H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.

    Article  MathSciNet  Google Scholar 

  • Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320.

    Article  MathSciNet  Google Scholar 

  • Zou, H., Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37(4), 1733–1751.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

I would like to thank Alexandre Tsybakov, Arnak Dalalyan, Jean-Michel Zakoïan and Christian Francq for all the theoretical references they provided. And I thank warmly Jean-David Fermanian for his significant help and helpful comments. I gratefully acknowledge the Ecodec Laboratory for its support and the Japan Society for the Promotion of Science.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benjamin Poignard.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 279 KB)

Appendix

Appendix

We first introduce some preleminary results. The dependent setting requires the use of more sophisticated probabilistic tools to derive asymptotic results than the i.i.d. case. Assumptions 1 and 4 allow for using the central limit theorem of Billingsley (1961). We remind this result stated as a corollary in Billingsley (1961).

Corollary 1

(Billingsley 1961) If \((x_t,{{\mathcal {F}}}_t)\) is a stationary and ergodic sequence of square integrable martingale increments such that \(\sigma ^2_x = \text {Var}(x_t) \ne 0\), then \(T^{-1/2} \sum ^{T}_{t=1} x_t \overset{d}{\rightarrow } {{\mathcal {N}}}(0,\sigma ^2_x)\).

Note that the square martingale difference condition can be relaxed by \(\alpha \)-mixing and moment conditions. For instance, Rio (2013) provides a central limit theorem for strongly mixing and stationary sequences.

To prove Theorem 1, we remind of Theorem II.1 of Anderson and Gill (1982) which proves that pointwise convergence in probability of random concave functions implies uniform convergence on compact subspaces.

Theorem 9

(Anderson and Gill 1982) Let E be an open convex subset of \({{\mathbb {R}}}^p\), and let \(F_1, F_2,\ldots ,\) be a sequence of random concave functions on E such that \(F_n(x) \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} f(x)\) for every \(x \in E\) where f is some real function on E. Then f is also concave, and for all compact \(A \subset E\),

$$\begin{aligned} \underset{x \in A}{\sup } |F_n(x) - f(x)| \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} 0. \end{aligned}$$

The proof of this theorem is based on a diagonal argument and Theorem 10.8 of Rockafeller (1970), that is, the pointwise convergence of concave random functions on a dense and countable subset of an open set implies uniform convergence on any compact subset of the open set. Then the following corollary is stated.

Corollary 2

(Anderson and Gill 1982) Assume \(F_n(x) \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} f(x)\), for every \(x \in E\), an open convex subset of \({{\mathbb {R}}}^p\). Suppose f has a unique maximum at \(x_0 \in E\). Let \({\hat{X}}_n\) maximize \(F_n\). Then \({\hat{X}}_n \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} x_0\).

Newey and Powell (1987) use a similar theorem to prove the consistency of asymmetric least squares estimators without any compacity assumption on \(\varTheta \). We apply these results in our framework, where the parameter set \(\varTheta \) is supposed to be convex.

We used the convexity argument to derive the asymptotic distribution of the SGL estimator. Chernozhukov and Hong (2004) and Chernozhukov (2005) use this convexity argument to obtain the asymptotic distribution of quantile regression-type estimators. This argument relies on the convexity lemma, which is a key result to obtain an asymptotic distribution when the objective function is not differentiable. It only requires the lower-semicontinuity and convexity of the empirical criterion. The convexity lemma, as in Chernozhukov (2005), proof of Theorem 4.1, can be stated as follows:

Lemma 1

(Chernozhukov 2005) Suppose

  1. (i)

    a sequence of convex lower-semicontinuous \({{\mathbb {F}}}_T: {{\mathbb {R}}}^d \rightarrow {\bar{{{\mathbb {R}}}}}\) marginally converges to \({{\mathbb {F}}}_{\infty }: {{\mathbb {R}}}^d \rightarrow {\bar{{{\mathbb {R}}}}}\) over a dense subset of \({{\mathbb {R}}}^d\);

  2. (ii)

    \({{\mathbb {F}}}_{\infty }\) is finite over a non-empty open set \(E \subset {{\mathbb {R}}}^d\);

  3. (iii)

    \({{\mathbb {F}}}_{\infty }\) is uniquely minimized at a random vector \(\varvec{u}_{\infty }\).

Then

$$\begin{aligned} \underset{\varvec{z}\in {{\mathbb {R}}}^d}{\arg \, \min } \, {{\mathbb {F}}}_T(\varvec{z}) \overset{d}{\longrightarrow } \underset{\varvec{z}\in {{\mathbb {R}}}^d}{\arg \, \min } \, {{\mathbb {F}}}_{\infty }(\varvec{z}), \, \text {that is} \; \varvec{u}_T \overset{d}{\longrightarrow } \varvec{u}_{\infty }. \end{aligned}$$

This is a key argument used in Theorem 3, Proposition 1 and Theorem 5.

When we consider a diverging number of parameters, the empirical criterion can be viewed as a sequence of dependent arrays for which we need refined asymptotic results. Shiryaev (1991) proposed a version of the central limit theorem for dependent sequence of arrays, provided this sequence is a square integrable martingale difference satisfying the so-called Lindeberg condition. A similar theorem can be found in Billingsley (1995, Theorem 35.12, p.476). We provide here the theorem of Shiryaev (see Theorem 4, p.543 of Shiryaev 1991) that we will use to derive the asymptotic distribution of the adaptive SGL estimator.

Theorem 10

(Shiryaev 1991) Let a sequence of square integrable martingale differences \(\xi ^n = (\xi _{nk},{{\mathcal {F}}}^n_k),n \ge 1\), with \({{\mathcal {F}}}^n_k = \sigma (\xi _{ns},s \le k)\), satisfy the Lindeberg condition for any \(0<t\le 1\), for \(\epsilon > 0\), given by

$$\begin{aligned} \overset{\lfloor nt \rfloor }{\underset{k=0}{\sum }} {{\mathbb {E}}}\left[ \xi ^2_{nk} {\mathbf {1}}_{|\xi _{nk}| > \epsilon } | {{\mathcal {F}}}^n_{k-1}\right] \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} 0, \end{aligned}$$

then if \(\overset{\lfloor nt \rfloor }{\underset{k=0}{\sum }} {{\mathbb {E}}}[\xi ^2_{nk}| {{\mathcal {F}}}^n_{k-1} ] \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} \sigma ^2_t\), or \(\overset{\lfloor nt \rfloor }{\underset{k=0}{\sum }} \xi ^2_{nk} \overset{{{\mathbb {P}}}}{\underset{n \rightarrow \infty }{\longrightarrow }} \sigma ^2_t\), then \(\overset{\lfloor nt \rfloor }{\underset{k=0}{\sum }} \xi _{nk} \overset{d}{\longrightarrow } {{\mathcal {N}}}(0,\sigma ^2_t).\)

There exist central limit results relaxing the stationarity and martingale difference assumptions for sequences of arrays. Neumann (2013) proposed such a central limit theorem for weakly dependent sequences of arrays. Such sequences should also satisfy a Lindeberg condition and conditions on covariances. Equipped with these preliminary results, we now report the proofs of Sect. 4.

Proof of Theorem 1

By definition, \({\hat{\varvec{\theta }}} = \underset{\varvec{\theta }\in \varTheta }{\arg \, \min } \, \{{{\mathbb {G}}}_T \varphi (\varvec{\theta })\}\). In a first step, we prove the uniform convergence of \({{\mathbb {G}}}_T \varphi (.)\) to the limit quantity \({{\mathbb {G}}}_{\infty }\varphi (.)\) on any compact set \(\varvec{B}\subset \varTheta \), idest

$$\begin{aligned} \underset{\varvec{x}\in \varvec{B}}{\sup } |{{\mathbb {G}}}_T \varphi (\varvec{x}) - {{\mathbb {G}}}_{\infty }\varphi (\varvec{x}) | \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0. \end{aligned}$$
(7)

We define \({{\mathcal {C}}}\subset \varTheta \) an open convex set and pick \(\varvec{x}\in {{\mathcal {C}}}\). Then by Assumption 1, the law of large number implies

$$\begin{aligned} {{\mathbb {G}}}_T l(\varvec{x}) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} {{\mathbb {G}}}_{\infty } l(\varvec{x}). \end{aligned}$$

Consequently, if \(\lambda _T / T \rightarrow \lambda _0 \ge 0\) and \(\gamma _T / T \rightarrow \gamma _0 \ge 0\), we obtain the pointwise convergence

$$\begin{aligned} |{{\mathbb {G}}}_T \varphi (\varvec{x}) - {{\mathbb {G}}}_{\infty }\varphi (\varvec{x})| \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0. \end{aligned}$$

By Theorem 9 of Anderson and Gill (1982), \({{\mathbb {G}}}_{\infty } \varphi (.)\) is a convex function and we deduce the desired uniform convergence over any compact subset of \(\varTheta \), that is (7).

Now we would like that \(\arg \, \min \, \{{{\mathbb {G}}}_T \varphi (.)\} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \arg \, \min \, \{{{\mathbb {G}}}_{\infty } \varphi (.)\}\). By Assumption 3, \(\varphi (.)\) is convex, which implies

$$\begin{aligned} |{{\mathbb {G}}}_T \varphi (\varvec{\theta })| \overset{{{\mathbb {P}}}}{\underset{\Vert \varvec{\theta }\Vert \rightarrow \infty }{\longrightarrow }} \infty . \end{aligned}$$

Consequently, \(\arg \, \min \{{{\mathbb {G}}}_T \varphi (\varvec{x})\} = O(1)\), such that \({\hat{\varvec{\theta }}} \in {{\mathcal {B}}}_o(\varvec{\theta }_0,C)\) with probability approaching one for C large enough, with \({{\mathcal {B}}}_o(\varvec{\theta }_0,C)\) an open ball centered at \(\varvec{\theta }_0\) and of radius C. Furthermore, as \({{\mathbb {G}}}_{\infty } \varphi (.)\) is convex, continuous, then \(\underset{\varvec{x}\in B}{\arg \, \min } \, \{{{\mathbb {G}}}_{\infty } \varphi (\varvec{x})\}\) exists and is unique. Then by Corollary 2 of Andersen and Gill, we obtain

$$\begin{aligned} \underset{\varvec{x}\in \varvec{B}}{\arg \, \min } \{{{\mathbb {G}}}_T \varphi (\varvec{x})\} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \underset{\varvec{x}\in \varvec{B}}{\arg \, \min } \{{{\mathbb {G}}}_{\infty } \varphi (\varvec{x})\}, \; \; \text {that is} \; \; {\hat{\varvec{\theta }}} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{\theta }^*_0. \end{aligned}$$

\(\square \)

Proof of Theorem 2

We denote \(\nu _T = T^{-1/2} + \lambda _T T^{-1} a + \gamma _T T^{-1} b\), with \(a = \text {card}({{\mathcal {A}}})(\underset{k}{\max } \; \alpha _k)\) and \(b = \text {card}({{\mathcal {A}}})(\underset{l}{\max } \; \xi _l)\). We would like to prove that for any \(\varvec{\epsilon }> 0\), there exists \(C_{\varvec{\epsilon }} > 0\) such that \({{\mathbb {P}}}(\nu ^{-1}_T\Vert {\hat{\varvec{\theta }}} - \varvec{\theta }_0\Vert > C_{\varvec{\epsilon }}) < \varvec{\epsilon }\). We have

$$\begin{aligned} {{\mathbb {P}}}(\nu ^{-1}_T \Vert {\hat{\varvec{\theta }}} - \varvec{\theta }_0\Vert > C_{\varvec{\epsilon }}) \le {{\mathbb {P}}}\left( \exists \varvec{u}\in {{\mathbb {R}}}^d, \Vert \varvec{u}\Vert _2 \ge C_{\varvec{\epsilon }}: {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0 + \nu _T \varvec{u}) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)\right) . \end{aligned}$$

\(\Vert \varvec{u}\Vert _2\) can potentially be large as it represents the discrepancy \({\hat{\varvec{\theta }}}-\varvec{\theta }_0\) normalized by \(\nu _T\). Now based on the convexity of the objective function, we have

$$\begin{aligned}&\left\{ \exists \varvec{u}^*, \Vert \varvec{u}^*\Vert _2 \ge C_{\varvec{\epsilon }}, {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0 + \nu _T \varvec{u}^*) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)\right\} \nonumber \\&\subset \big \{\exists {\bar{\varvec{u}}}, \Vert {\bar{\varvec{u}}}\Vert _2 = C_{\varvec{\epsilon }}, {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0 + \nu _T {\bar{\varvec{u}}}) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)\big \}, \end{aligned}$$
(8)

a relationship that allows us to work with a fixed \(\Vert \varvec{u}\Vert _2\). Let us define \(\varvec{\theta }_1 = \varvec{\theta }_0 + \nu _T \varvec{u}^*\) such that \({{\mathbb {G}}}_T \varphi (\varvec{\theta }_1) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)\). Let \(\alpha \in (0,1)\) and \(\varvec{\theta }= \alpha \varvec{\theta }_1 + (1-\alpha ) \varvec{\theta }_0\). Then by convexity of \({{\mathbb {G}}}_T \varphi (.)\), we obtain

$$\begin{aligned} {{\mathbb {G}}}_T \varphi (\varvec{\theta }) \le \alpha {{\mathbb {G}}}_T \varphi (\varvec{\theta }_1) + (1-\alpha ) {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0). \end{aligned}$$

We pick \(\alpha \) such that \(\Vert {\bar{\varvec{u}}}\Vert = C_{\varvec{\epsilon }}\) with \({\bar{\varvec{u}}} := \alpha \varvec{\theta }_1 + (1-\alpha ) \varvec{\theta }_0\). Hence (8) holds, which implies

$$\begin{aligned} {{\mathbb {P}}}(\Vert {\hat{\varvec{\theta }}} - \varvec{\theta }_0 \Vert > C_{\varvec{\epsilon }} \nu _T)\le & {} {{\mathbb {P}}}(\exists \varvec{u}\in {{\mathbb {R}}}^d, \Vert \varvec{u}\Vert _2 \ge C_{\varvec{\epsilon }}: {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0 + \nu _T \varvec{u}) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)) \\\le & {} {{\mathbb {P}}}(\exists {\bar{\varvec{u}}}, \Vert {\bar{\varvec{u}}}\Vert _2 = C_{\varvec{\epsilon }}: {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0 + \nu _T {\bar{\varvec{u}}}) \le {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)). \end{aligned}$$

Hence, we pick a \(\varvec{u}\) such that \(\Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}\). Using \(\varvec{p}_1(\lambda _T,\alpha ,0) = 0\) and \(\varvec{p}_2(\gamma _T,\xi ,0) = 0\), by a Taylor expansion to \({{\mathbb {G}}}_T l(\varvec{\theta }_0 + \nu _T \varvec{u})\), we obtain

$$\begin{aligned} {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0 + \nu _T \varvec{u}) - {{\mathbb {G}}}_T \varphi (\varvec{\theta }_0)= & {} \nu _T {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}+ \frac{\nu ^2_T}{2} \varvec{u}' \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}\\&+\, \frac{\nu ^3_T}{6} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}+ \varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T)\\&-\,\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0) + \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T)-\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0), \end{aligned}$$

where \({\bar{\varvec{\theta }}}\) is defined as \(\Vert {\bar{\varvec{\theta }}} - \varvec{\theta }_0\Vert \le \Vert \varvec{\theta }_T - \varvec{\theta }_0\Vert \). We want to prove

$$\begin{aligned} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2= & {} C_{\varvec{\epsilon }}: {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}+ \frac{\nu _T}{2} {{\mathbb {E}}}[\varvec{u}' \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}] + \frac{\nu _T}{2} {{\mathcal {R}}}_T(\varvec{\theta }_0) \nonumber \\&+\, \frac{\nu ^2_T}{6} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}+\nu ^{-1}_T\{\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T)-\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0)\nonumber \\&+\, \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T)-\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0)\} \le 0) < \varvec{\epsilon }, \end{aligned}$$
(9)

where \({{\mathcal {R}}}_T(\varvec{\theta }_0) = \overset{d}{\underset{k,l=1}{\sum }} \varvec{u}_k \varvec{u}_l \{\partial ^2_{\theta _k \theta _l} {{\mathbb {G}}}_T l(\varvec{\theta }_0) - {{\mathbb {E}}}[\partial ^2_{\theta _k \theta _l} {{\mathbb {G}}}_T l(\varvec{\theta }_0)]\}\). By Assumption 1, \((\varvec{\epsilon }_t)\) is a non-anticipative stationary solution and is ergodic. As a square integrable martingale difference by Assumption 4,

$$\begin{aligned} \sqrt{T} {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}\overset{d}{\longrightarrow } {{\mathcal {N}}}(0, \varvec{u}' {{\mathbb {M}}}\varvec{u}), \end{aligned}$$

by the central limit theorem of Billingsley (1961), which implies \({\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}= O_p(T^{-1/2} \varvec{u}' {{\mathbb {M}}}\varvec{u})\). By the ergodic theorem of Billingsley (1995), we have

$$\begin{aligned} \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} {{\mathbb {H}}}. \end{aligned}$$

This implies \({{\mathcal {R}}}_T(\varvec{\theta }_0) = o_p(1)\). Furthermore, by the Markov inequality, for \(b > 0\)

$$\begin{aligned} \begin{array}{llll} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: \underset{{\bar{\varvec{\theta }}}:\Vert \varvec{\theta }-\varvec{\theta }_0\Vert _2 \le \nu _T C_{\varvec{\epsilon }}}{\sup }|\frac{\nu ^2_T}{6} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}| > b)\le & {} \frac{\nu ^4_T C^6_{\varvec{\epsilon }}}{36 b^2} \eta (C_{\varvec{\epsilon }}), \end{array} \end{aligned}$$

where \(\eta (C_{\varvec{\epsilon }})\) is defined in Assumption 6. We now focus on the penalty terms. As \(\varvec{p}_1(\lambda _T,\alpha ,0)=0\), for the \(l^1\) norm penalty, we have

$$\begin{aligned} \varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T) - \varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0)= & {} \lambda _T T^{-1} \underset{k \in {{\mathcal {S}}}}{\sum } \alpha _k \left\{ \Vert \varvec{\theta }^{(k)}_0 + \nu _T \varvec{u}^{(k)}\Vert _1 - \Vert \varvec{\theta }^{(k)}_0\Vert _1 \right\} , \nonumber \\ \text {and} \; |\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T) - \varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0)|\le & {} \text {card}({{\mathcal {S}}}) \{ \underset{k \in {{\mathcal {S}}}}{\max } \; \alpha _k \} \lambda _T T^{-1} \nu _T \Vert \varvec{u}\Vert _1. \end{aligned}$$

As for the \(l^1/l^2\) norm, we obtain

$$\begin{aligned} \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T) - \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0)= & {} \gamma _T T^{-1} \underset{l \in {{\mathcal {S}}}}{\sum } \xi _l \left\{ \Vert \varvec{\theta }^{(l)}_T\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2\right\} , \nonumber \\ \text {and} \; |\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T) - \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0) |\le & {} \gamma _T T^{-1} \underset{l \in {{\mathcal {S}}}}{\sum } \xi _l \nu _T \Vert \varvec{u}^{(l)}\Vert _2 \\\le & {} \text {card}({{\mathcal {S}}}) \left\{ \underset{l\in {{\mathcal {S}}}}{\max } \; \xi _l \right\} \gamma _T T^{-1} \nu _T \Vert \varvec{u}\Vert _2. \nonumber \end{aligned}$$

Then denoting by \(\delta _T = \lambda _{\min }({{\mathbb {H}}}) C^2_{\varvec{\epsilon }} \nu _T/2\), and using \(\frac{\nu _T}{2} {{\mathbb {E}}}[\varvec{u}' \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}] \ge \delta _T\), we deduce that (9) can be bounded as

$$\begin{aligned} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2= & {} C_{\varvec{\epsilon }}: {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}+ \frac{\nu _T}{2} \varvec{u}' \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}+ \frac{\nu ^2_T}{6} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}\\&+\, \nu ^{-1}_T \{\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T)-\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0) + \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T)\\&-\,\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0) \} \le 0) \\\le & {} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}|> \delta _T/8) + {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2\\&= C_{\varvec{\epsilon }}: \frac{\nu _T}{2} |{{\mathcal {R}}}_T(\varvec{\theta }_0)|> \delta _T/8) \\&\quad +\, {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}:| \frac{\nu ^2_T}{6} \nabla '\left\{ \varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\right\} \varvec{u}|> \delta _T/8)\\&\quad +\, {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T)-\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0)|> \nu _T \delta _T/8) \\&\quad +\, {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T)-\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0)| > \nu _T \delta _T/8). \end{aligned}$$

We also have for \(C_{\varvec{\epsilon }}\) and T large enough, and using norm equivalences that

$$\begin{aligned} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2= & {} C_{\varvec{\epsilon }}: |\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T)-\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0)|> \nu _T \delta _T/8) \\\le & {} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: \text {card}({{\mathcal {S}}}) \{ \underset{k \in {{\mathcal {S}}}}{\max } \; \alpha _k \} \lambda _T T^{-1} \nu _T \Vert \varvec{u}\Vert _1> \nu _T \delta _T/8)< \varvec{\epsilon }/5, \\&{{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T)-\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0)|> \nu _T \delta _T/8) \\\le & {} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: \text {card}({{\mathcal {S}}}) \{ \underset{l\in {{\mathcal {S}}}}{\max } \; \xi _l \} \gamma _T T^{-1} \nu _T \Vert \varvec{u}\Vert _2 > \nu _T \delta _T/8) < \varvec{\epsilon }/5. \end{aligned}$$

Moreover, if \(\nu _T = T^{-1/2} + \lambda _T T^{-1} a + \gamma _T T^{-1} b\), then for \(C_{\varvec{\epsilon }}\) large enough

$$\begin{aligned} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}| > \delta _T/8) \le \frac{C^2_{\varvec{\epsilon }} C_{st}}{T \delta ^2_T} \le \frac{C_{st}}{C^4_{\varvec{\epsilon }}} < \varvec{\epsilon }/5. \end{aligned}$$

Moreover

$$\begin{aligned} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2= & {} C_{\varvec{\epsilon }}: \underset{{\bar{\varvec{\theta }}}:\Vert {\bar{\varvec{\theta }}}-\varvec{\theta }_0\Vert _2 < \nu _T C_{\varvec{\epsilon }}}{\sup }|\frac{\nu ^2_T}{6} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}|> \delta _T/8) \\\le & {} \frac{C_{st} \nu ^4_T \eta (C_{\varvec{\epsilon }})}{\delta ^2_T} \le C_{st} \nu ^2_T C^2_{\varvec{\epsilon }} \eta (C_{\varvec{\epsilon }}) \end{aligned}$$

where \(C_{st} > 0\) is a generic constant. We obtain

$$\begin{aligned} {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2= & {} C_{\varvec{\epsilon }}: |{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}|> \delta _T/8) + {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: \frac{\nu _T}{2} |{{\mathcal {R}}}_T(\varvec{\theta }_0)|> \delta _T/8) \\&+\, {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}:| \frac{\nu ^2_T}{6} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}|> \delta _T/8)\\&+\, {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0)-\varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_T)|> \nu _T \delta _T/8) \\&+\, {{\mathbb {P}}}(\exists \varvec{u}, \Vert \varvec{u}\Vert _2 = C_{\varvec{\epsilon }}: |\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0)-\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_T)| > \nu _T \delta _T/8) \\\le & {} \frac{C_{st}}{C^4_{\varvec{\epsilon }}} + \nu ^2_T C^2_{\varvec{\epsilon }} \eta (C_{\varvec{\epsilon }}) C_{st} + 3 \varvec{\epsilon }/5 \le \varvec{\epsilon }, \end{aligned}$$

for \(C_{\varvec{\epsilon }}\) and T large enough. We then deduce \(\Vert {\hat{\varvec{\theta }}} - \varvec{\theta }_0\Vert = O_p(\nu _T)\). \(\square \)

Proof of Theorem 3

Let \(\varvec{u}\in {{\mathbb {R}}}^d\) such that \(\varvec{\theta }= \varvec{\theta }_0 + \varvec{u}/T^{1/2}\) and we define the empirical criterion \({{\mathbb {F}}}_T(\varvec{u}) = T {{\mathbb {G}}}_T (\varphi (\varvec{\theta }_0 + \varvec{u}/T^{1/2}) - \varphi (\varvec{\theta }_0))\). First, we are going to prove the finite distributional convergence of \({{\mathbb {F}}}_T\) to \({{\mathbb {F}}}_{\infty }\). Then we use the convexity of \({{\mathbb {F}}}_T(.)\) to obtain the convergence in distribution of the \(\arg \, \min \) empirical criterion to the \(\arg \, \min \) process limit. To do so, let \(\varvec{u}= \sqrt{T}(\varvec{\theta }- \varvec{\theta }_0)\). We have

$$\begin{aligned} {{\mathbb {F}}}_T(\varvec{u})= & {} T \left\{ {{\mathbb {G}}}_T (l(\varvec{\theta }) - l(\varvec{\theta }_0)) + \varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }) - \varvec{p}_1(\lambda _T,\alpha ,\varvec{\theta }_0) + \varvec{p}_2(\gamma _T,\xi ,\varvec{\theta })\right. \\&\left. -\,\varvec{p}_2(\gamma _T,\xi ,\varvec{\theta }_0) \right\} \\= & {} T {{\mathbb {G}}}_T (l(\varvec{\theta }_0 + \varvec{u}/T^{1/2}) - l(\varvec{\theta }_0)) + \lambda _T \overset{m}{\underset{k = 1}{\sum }} \alpha _k \left[ \Vert \varvec{\theta }^{(k)}_0 + \varvec{u}^{(k)}/\sqrt{T}\Vert _1 - \Vert \varvec{\theta }^{(k)}_0\Vert _1\right] \\&+\, \gamma _T \overset{m}{\underset{l = 1}{\sum }} \xi _l \left[ \Vert \varvec{\theta }^{(l)}_0 + \varvec{u}^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2\right] , \end{aligned}$$

where \({{\mathbb {F}}}_T(.)\) is convex and \(C^0({{\mathbb {R}}}^d)\). We now prove the finite dimensional distribution of \({{\mathbb {F}}}_T\) to \({{\mathbb {F}}}_{\infty }\) to apply Lemma 1. For the \(l^1\) penalty, for any group k, we have for T sufficiently large

$$\begin{aligned} \Vert \varvec{\theta }^{(k)}_0 + \varvec{u}^{(k)}/\sqrt{T}\Vert _1 - \Vert \varvec{\theta }^{(k)}_0\Vert _1 = T^{-1/2} \overset{\varvec{c}_k}{\underset{i = 1}{\sum }} \left\{ |\varvec{u}^{(k)}_i| {\mathbf {1}}_{\theta ^{(k)}_{0,i} = 0} + \varvec{u}^{(k)}_i \text {sgn}(\theta ^{(k)}_{0,i}){\mathbf {1}}_{\theta ^{(k)}_{0,i} \ne 0} \right\} , \end{aligned}$$

which implies that

$$\begin{aligned}&\lambda _T \overset{m}{\underset{k = 1}{\sum }} \alpha _k \left[ \Vert \varvec{\theta }^{(k)}_0 + \varvec{u}^{(k)}/\sqrt{T}\Vert _1 - \Vert \varvec{\theta }^{(k)}_0\Vert _1\right] \underset{T \rightarrow \infty }{\longrightarrow } \lambda _0 \overset{m}{\underset{k = 1}{\sum }} \alpha _k \overset{\varvec{c}_k}{\underset{i = 1}{\sum }} \left\{ |\varvec{u}^{(k)}_i| {\mathbf {1}}_{\theta ^{(k)}_{0,i} = 0}\right. \\&\left. \quad +\, \varvec{u}^{(k)}_i \text {sgn}(\theta ^{(k)}_{0,i}){\mathbf {1}}_{\theta ^{(k)}_{0,i} \ne 0} \right\} , \end{aligned}$$

under the condition that \(\lambda _T / \sqrt{T} \rightarrow \lambda _0\). As for the \(l^1/l^2\) quantity, for any group l, we have

$$\begin{aligned} \Vert \varvec{\theta }^{(l)}_0 + u^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2 = T^{-1/2} \left\{ \Vert u^{(l)}\Vert _2 {\mathbf {1}}_{\varvec{\theta }^{(l)}_{0} = {\mathbf {0}}} + \frac{u^{(l)'} \varvec{\theta }^{(l)}_0}{\Vert \varvec{\theta }^{(l)}_0\Vert _2} {\mathbf {1}}_{\varvec{\theta }^{(l)}_{0} \ne {\mathbf {0}}}\right\} + o(T^{-1}). \end{aligned}$$

Consequently, if \(\gamma _T T^{-1/2} \rightarrow \gamma _0 \ge 0\), we obtain

$$\begin{aligned} \gamma _T \overset{m}{\underset{l = 1}{\sum }} \xi _l \left[ \Vert \varvec{\theta }^{(l)}_0 + u^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2 \right]= & {} \gamma _0 \overset{m}{\underset{l = 1}{\sum }} \xi _l \left\{ \Vert u^{(l)}\Vert _2 {\mathbf {1}}_{\theta ^{(l)}_{0,k} = {\mathbf {0}}}\right. \\&\left. + \frac{u^{(l)'} \varvec{\theta }^{(l)}_0}{\Vert \varvec{\theta }^{(l)}_0\Vert _2} {\mathbf {1}}_{\varvec{\theta }^{(l)}_0 \ne {\mathbf {0}}}\right\} + o(T^{-1}) \gamma _T. \end{aligned}$$

Now for the unpenalized criterion \({{\mathbb {G}}}_T l(.)\), by a Taylor expansion, we have

$$\begin{aligned}&T {{\mathbb {G}}}_T (l(\varvec{\theta }_0 + \varvec{u}/T^{1/2}) - l(\varvec{\theta }_0)) = \varvec{u}' T^{1/2}{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) + \frac{1}{2} \varvec{u}' \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}\\&\quad +\, \frac{1}{6 T^{1/3}}\nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}, \end{aligned}$$

where \({\bar{\varvec{\theta }}}\) is defined as \(\Vert {\bar{\varvec{\theta }}} - \varvec{\theta }_0\Vert \le \Vert \varvec{u}\Vert /\sqrt{T}\). Then by Assumption 4, we have the central limit theorem of Billingsley (1961)

\(\sqrt{T} {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \overset{d}{\longrightarrow } {{\mathcal {N}}}(0,{{\mathbb {M}}})\), and by the ergodic theorem \(\ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} {{\mathbb {H}}}\). Furthermore, we have by Assumption 6

$$\begin{aligned}&|\nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}|^2 \\&\quad \le \frac{1}{T^2} \overset{T}{\underset{t,t'=1}{\sum }} \overset{d}{\underset{k_1,l_1,m_1}{\sum }}\overset{d}{\underset{k_2,l_2,m_2}{\sum }} \varvec{u}_{k_1} \varvec{u}_{l_1} \varvec{u}_{m_1} \varvec{u}_{k_2} \varvec{u}_{l_2} \varvec{u}_{m_2} |\partial ^3_{\theta _{k_1} \theta _{l_1} \theta _{m_1}} l(\varvec{\epsilon }_t;{\bar{\varvec{\theta }}}) . \partial ^3_{\theta _{k_2} \theta _{l_2} \theta _{m_2}} l(\varvec{\epsilon }_{t'};{\bar{\varvec{\theta }}}) | \\&\quad \le \frac{1}{T^2} \overset{T}{\underset{t,t'=1}{\sum }} \overset{d}{\underset{k_1,l_1,m_1}{\sum }}\overset{d}{\underset{k_2,l_2,m_2}{\sum }} \varvec{u}_{k_1} \varvec{u}_{l_1} \varvec{u}_{m_1} \varvec{u}_{k_2} \varvec{u}_{l_2} \varvec{u}_{m_2} \upsilon _t(C) \upsilon _{t'}(C), \end{aligned}$$

for C large enough, such that \(\upsilon _t(C) = \underset{k,l,m=1,\ldots ,d}{\sup } \{ \underset{\varvec{\theta }:\Vert \varvec{\theta }-\varvec{\theta }_0\Vert _2 \le \nu _T C}{\sup } |\partial ^3_{\theta _k \theta _l \theta _m} l(\varvec{\epsilon }_t;\varvec{\theta })|\}\) with \(\nu _T = T^{-1/2} + \lambda _T T^{-1} a_T + \gamma _T T^{-1} b_T\). We deduce \(\nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}= O_p(\Vert \varvec{u}\Vert ^3_2 \eta (C))\). We obtain

$$\begin{aligned} \frac{1}{6 T^{1/3}}\nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}}) \varvec{u}\} \varvec{u}\overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0. \end{aligned}$$

Then we proved that \({{\mathbb {F}}}_T(\varvec{u}) \overset{d}{\longrightarrow } {{\mathbb {F}}}_{\infty }(\varvec{u})\), for a fixed \(\varvec{u}\). Let us observe that

$$\begin{aligned} \varvec{u}^*_T = \underset{\varvec{u}}{\arg \; \min } \, \{ {{\mathbb {F}}}_T(\varvec{u}) \}, \end{aligned}$$

and \({{\mathbb {F}}}_T(.)\) admits as a minimizer \(\varvec{u}^*_T = \sqrt{T}({\hat{\varvec{\theta }}} - \varvec{\theta }_0)\). As \({{\mathbb {F}}}_T\) is convex and \({{\mathbb {F}}}_{\infty }\) is continuous, convex and has a unique minimum by Assumption 5, then by convexity Lemma 1, we obtain

$$\begin{aligned} \sqrt{T}({\hat{\varvec{\theta }}} - \varvec{\theta }_0) = \underset{\varvec{u}}{\arg \; \min } \{ {{\mathbb {F}}}_T \} \overset{d}{\longrightarrow } \underset{\varvec{u}}{\arg \; \min } \{ {{\mathbb {F}}}_{\infty } \}. \end{aligned}$$

\(\square \)

Proof of Proposition 1

In Theorem 3, we proved \(\sqrt{T}({\hat{\varvec{\theta }}} - \varvec{\theta }_0) := \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min } \{{{\mathbb {F}}}_T\} \overset{d}{\longrightarrow } \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min } \{{{\mathbb {F}}}_{\infty }\}\) for \(\lambda _T/\sqrt{T} \rightarrow \lambda _0\) and \(\gamma _T / \sqrt{T} \rightarrow \gamma _0\). The limit random function is

$$\begin{aligned} {{\mathbb {F}}}_{\infty }(\varvec{u})= & {} \frac{1}{2}\varvec{u}' {{\mathbb {H}}}\varvec{u}+ \varvec{u}' \varvec{Z}+ \lambda _0 \overset{m}{\underset{k = 1}{\sum }} \alpha _k \overset{\varvec{c}_k}{\underset{i = 1}{\sum }} \left\{ |\varvec{u}^{(k)}_i| {\mathbf {1}}_{\theta ^{(k)}_{0,i} = 0} + \varvec{u}^{(k)}_i \text {sgn}(\theta ^{(k)}_{0,i}){\mathbf {1}}_{\theta ^{(k)}_{0,i} \ne 0} \right\} \\&+\, \gamma _0 \overset{m}{\underset{l = 1}{\sum }} \xi _l \{\Vert \varvec{u}^{(l)}\Vert _2 {\mathbf {1}}_{\varvec{\theta }^{(l)}_{0} = {\mathbf {0}}} + \frac{\varvec{u}^{(l)'} \varvec{\theta }^{(l)}_0}{\Vert \varvec{\theta }^{(l)}_0\Vert _2} {\mathbf {1}}_{\varvec{\theta }^{(l)}_{0} \ne {\mathbf {0}}}\}. \end{aligned}$$

First, let us observe that

$$\begin{aligned} \{{\hat{{{\mathcal {A}}}}} {=} {{\mathcal {A}}}\}{=} \left\{ \forall k {=}1,\ldots ,m, \, i \!\in \! {{\mathcal {A}}}^c_k, {\hat{\theta }}^{(k)}_i {=} 0\right\} \cap \left\{ \forall k {=}1,\ldots ,m, \, i \!\in \! {\hat{{{\mathcal {A}}}}}^c_k, \theta ^{(k)}_{0,i} {=} 0\right\} . \end{aligned}$$

Both sets describing \(\{{\hat{{{\mathcal {A}}}}} = {{\mathcal {A}}}\}\) are symmetric, and thus we can focus on

$$\begin{aligned} \{{\hat{{{\mathcal {A}}}}} = {{\mathcal {A}}}\} \Rightarrow \left\{ \forall k =1,\ldots ,m, \, i \in {{\mathcal {A}}}^c_k, T^{1/2} {\hat{\theta }}^{(k)}_i = 0\right\} . \end{aligned}$$

Hence

$$\begin{aligned} {{\mathbb {P}}}( {\hat{{{\mathcal {A}}}}} = {{\mathcal {A}}}) \le {{\mathbb {P}}}\left( \forall k = 1,\ldots ,m, \forall i \in {{\mathcal {A}}}^c_k, T^{1/2}{\hat{\theta }}^{(k)}_i = 0\right) . \end{aligned}$$

Denoting by \( \varvec{u}^*:= \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min } \{{{\mathbb {F}}}_{\infty }(\varvec{u})\}\), Theorem 3 corresponds to \(\sqrt{T}({\hat{\varvec{\theta }}}_{{{\mathcal {A}}}} - \varvec{\theta }_{0,{{\mathcal {A}}}}) \overset{d}{\longrightarrow } \varvec{u}^*_{{{\mathcal {A}}}}\). By the Portmanteau theorem (see Wellner and van der Vaart 1996), we have

$$\begin{aligned} \underset{T \rightarrow \infty }{\lim \sup } \; {{\mathbb {P}}}( \forall k = 1,\ldots ,m, \forall i \in {{\mathcal {A}}}^c_k, T^{1/2}{\hat{\theta }}^{(k)}_i = 0) \le {{\mathbb {P}}}(\forall k =1,\ldots ,m, \forall i \in {{\mathcal {A}}}^c_k, \varvec{u}^{(k) *}_i = 0), \end{aligned}$$

as \(\varvec{\theta }_{0,{{\mathcal {A}}}^c} = {\mathbf {0}}\). Consequently, we need to prove that the probability of the right-hand side is strictly inferior to 1, which is upper-bounded by

$$\begin{aligned}&{{\mathbb {P}}}(\forall k =1,\ldots ,m, \forall i \in {{\mathcal {A}}}^c_k, \varvec{u}^{(k) *}_i = 0) \le \nonumber \\&\min ({{\mathbb {P}}}(k \notin {{\mathcal {S}}}, \varvec{u}^{(k) *} = 0),{{\mathbb {P}}}(k \in {{\mathcal {S}}}, \forall i \in {{\mathcal {A}}}^c_k, \varvec{u}^{(k) *}_i = 0)). \end{aligned}$$
(10)

If \(\lambda _0 = \gamma _0 = 0\), then \(\varvec{u}^* = -{{\mathbb {H}}}^{-1} \varvec{Z}\) so that \({{\mathbb {P}}}_{\varvec{u}^*} = {{\mathcal {N}}}(0,{{\mathbb {H}}}^{-1} {{\mathbb {M}}}{{\mathbb {H}}}^{-1})\). Hence \(c = 0\).

If \(\lambda _0 \ne 0\) or \(\gamma _0 \ne 0\), the necessary and sufficient optimality conditions for a group k tell us that \(\varvec{u}^*\) satisfies

$$\begin{aligned} \left\{ \begin{array}{llll} ({{\mathbb {H}}}\varvec{u}^* + \varvec{Z})_{(k)} + \lambda _0 \alpha _k \varvec{p}^{(k)} + \gamma _0 \xi _k \frac{\varvec{\theta }^{(k)}_0}{\Vert \varvec{\theta }^{(k)}_0\Vert _2} = 0, &{} &{} k \in {{\mathcal {S}}},\\ ({{\mathbb {H}}}\varvec{u}^* + \varvec{Z})_{(k)} + \lambda _0 \alpha _k \varvec{w}^{(k)} + \gamma _0 \xi _k \varvec{z}^{(k)} = 0, &{} &{} \, \text {otherwise}, \end{array}\right. \end{aligned}$$
(11)

where \(\varvec{w}^{(k)}\) and \(\varvec{z}^{(k)}\) are the subgradients of \(\Vert \varvec{u}^{(k)}\Vert _1\) and \(\Vert \varvec{u}^{(k)}\Vert _2\) given by

$$\begin{aligned} \varvec{w}^{(k)}_i {\left\{ \begin{array}{ll} = \text {sgn}(\varvec{u}^{(k)}_i) \, \text {if} \, \varvec{u}^{(k)}_i \ne 0,\\ \in \{\varvec{w}^{(k)}_i : |\varvec{w}^{(k)}_i| \le 1\} \, \text {if} \, \varvec{u}^{(k)}_i = 0, \end{array}\right. } \, \varvec{z}^{(k)} {\left\{ \begin{array}{ll} = \frac{\varvec{u}^{(k)}}{\Vert \varvec{u}^{(k)}\Vert _2} \, \text {if} \, \varvec{u}^{(k)} \ne 0,\\ \in \{\varvec{z}^{(k)} : \Vert \varvec{z}^{(k)}\Vert _2 \le 1\} \, \text {if} \, \varvec{u}^{(k)} = 0, \end{array}\right. } \end{aligned}$$

and \(\varvec{p}^{(k)}_i = \partial _{\varvec{u}_i} \{ |\varvec{u}^{(k)}_i| {\mathbf {1}}_{\theta ^{(k)}_{0,i} = 0} + \varvec{u}^{(k)}_i \text {sgn}(\theta ^{(k)}_{0,i}){\mathbf {1}}_{\theta ^{(k)}_{0,i} \ne 0} \}\).

If \(\varvec{u}^{(m) *} = 0, \forall m \notin {{\mathcal {S}}}\), then the optimality conditions (11) become

$$\begin{aligned} \left\{ \begin{array}{llll} {{\mathbb {H}}}_{{{\mathcal {S}}}{{\mathcal {S}}}} \varvec{u}^*_{{{\mathcal {S}}}} + \varvec{Z}_{{{\mathcal {S}}}} + \lambda _0 \tau _{{{\mathcal {S}}}} + \gamma _0 \zeta _{{{\mathcal {S}}}} = 0, &{} &{}\\ \Vert -{{\mathbb {H}}}_{(l) {{\mathcal {S}}}} \varvec{u}^*_{{{\mathcal {S}}}} - \varvec{Z}_{(l)} -\lambda _0 \alpha _l \varvec{w}^{(l)}\Vert _2 \le \gamma _0 \xi _l, \, \text {as} \, \Vert \varvec{z}^{(l)}\Vert _2 \le 1, \, l \in {{\mathcal {S}}}^c, &{} &{} \end{array}\right. \end{aligned}$$
(12)

with \(\tau _{{{\mathcal {S}}}} = \text {vec}(k \in {{\mathcal {S}}}, \alpha _k \varvec{p}^{(k)})\) and \(\zeta _{{{\mathcal {S}}}} = \text {vec}(k \in {{\mathcal {S}}}, \xi _k \frac{\varvec{\theta }^{(k)}_0}{\Vert \varvec{\theta }^{(k)}_0\Vert _2})\), which are vectors of \({{\mathbb {R}}}^{\text {card}({{\mathcal {S}}})}\).

For \(k \in {{\mathcal {S}}}\), that is, the vector \(\varvec{\theta }^{(k)}_0\) is at least nonzero, then

$$\begin{aligned} \left\{ \begin{array}{llll} ({{\mathbb {H}}}\varvec{u}^* + \varvec{Z})_i + \lambda _0 \alpha _k \text {sgn}(\theta ^{(k)}_{0,i}) + \gamma _0 \xi _k \frac{\theta ^{(k)}_{0,i}}{\Vert \varvec{\theta }^{(k)}_0\Vert _2} = 0, \, \text {if} \, k \in {{\mathcal {S}}}, i \in {{\mathcal {A}}}_k, &{} &{}\\ ({{\mathbb {H}}}\varvec{u}^* + \varvec{Z})_i + \lambda _0 \alpha _k \varvec{w}^{(k)}_i = 0, \, i \in {{\mathcal {A}}}^c_k. &{} &{} \end{array}\right. \end{aligned}$$
(13)

Consequently, if \(\varvec{u}^{(k) *}_i = 0, \forall i \in {{\mathcal {A}}}^c_k\), with \(k \in {{\mathcal {S}}}\), then the conditions (13) become

$$\begin{aligned}\left\{ \begin{array}{llll} {{\mathbb {H}}}_{{{\mathcal {A}}}_k {{\mathcal {A}}}_k} \varvec{u}^*_{{{\mathcal {A}}}_k} + \varvec{Z}_{{{\mathcal {A}}}_k} + \lambda _0 \alpha _k \text {sgn}(\varvec{\theta }_{0,{{\mathcal {A}}}_k}) + \gamma _0 \xi _k \frac{\varvec{\theta }_{0,{{\mathcal {A}}}_k}}{\Vert \varvec{\theta }_{0,{{\mathcal {A}}}_k}\Vert _2} = 0, &{} &{}\\ |-({{\mathbb {H}}}_{{{\mathcal {A}}}^c_k {{\mathcal {A}}}_k} \varvec{u}^*_{{{\mathcal {A}}}_k} + \varvec{Z}_{{{\mathcal {A}}}^c_k})_i| \le \lambda _0 \alpha _k. &{} &{} \end{array}\right. \end{aligned}$$

Combining relationships in (12), we obtain

$$\begin{aligned} \Vert {{\mathbb {H}}}_{(l) {{\mathcal {S}}}} {{\mathbb {H}}}^{-1}_{{{\mathcal {S}}}{{\mathcal {S}}}} (\varvec{Z}_{{{\mathcal {S}}}} + \lambda _0 \tau _{{{\mathcal {S}}}} + \gamma _0 \zeta _{{{\mathcal {S}}}}) - \varvec{Z}_{(l)} -\lambda _0 \alpha _l \varvec{w}^{(l)}\Vert _2 \le \gamma _0 \xi _l, l \in {{\mathcal {S}}}^c. \end{aligned}$$

The same reasoning applies for active groups with inactive components, so that combining relationships in (13), we obtain

$$\begin{aligned} |\left( {{\mathbb {H}}}_{{{\mathcal {A}}}^c_k {{\mathcal {A}}}_k} {{\mathbb {H}}}^{-1}_{{{\mathcal {A}}}_k {{\mathcal {A}}}_k} \left( \varvec{Z}_{{{\mathcal {A}}}_k} + \lambda _0 \alpha _k \text {sgn}(\varvec{\theta }_{0,{{\mathcal {A}}}_k}) + \gamma _0 \xi _k \frac{\varvec{\theta }_{0,{{\mathcal {A}}}_k}}{\Vert \varvec{\theta }_{0,{{\mathcal {A}}}_k}\Vert _2}\right) - \varvec{Z}_{{{\mathcal {A}}}^c_k}\right) _i| \le \lambda _0 \alpha _k. \end{aligned}$$

Hence we deduce

$$\begin{aligned}&{{\mathbb {P}}}(\forall k =1,\ldots ,m, \forall \in {{\mathcal {A}}}^c_k, \varvec{u}^{(k) *}_i = 0) \le \\&\min ({{\mathbb {P}}}(k \notin {{\mathcal {S}}}, \varvec{u}^{(k) *} = 0),{{\mathbb {P}}}(k \in {{\mathcal {S}}}, \forall i \in {{\mathcal {A}}}^c_k, \varvec{u}^{(k) *}_i = 0)) := \min (a_1,a_2). \end{aligned}$$

Under the assumption that \(\lambda _0 < \infty \) and \(\gamma _0 < \infty \), we obtain

$$\begin{aligned} a_1= & {} {{\mathbb {P}}}(l \in {{\mathcal {S}}}^c, \Vert {{\mathbb {H}}}_{(l) {{\mathcal {S}}}} {{\mathbb {H}}}^{-1}_{{{\mathcal {S}}}{{\mathcal {S}}}} (\varvec{Z}_{{{\mathcal {S}}}} + \lambda _0 \tau _{{{\mathcal {S}}}} + \gamma _0 \zeta _{{{\mathcal {S}}}}) - \varvec{Z}_{(l)} -\lambda _0 \alpha _l \varvec{w}^{(l)}\Vert _2 \le \gamma _0 \xi _l)< 1, \\ a_2= & {} {{\mathbb {P}}}(k \in {{\mathcal {S}}}, i \in {{\mathcal {A}}}^c_k, |({{\mathbb {H}}}_{{{\mathcal {A}}}^c_k {{\mathcal {A}}}_k} {{\mathbb {H}}}^{-1}_{{{\mathcal {A}}}_k {{\mathcal {A}}}_k} (\varvec{Z}_{{{\mathcal {A}}}_k} + \lambda _0 \alpha _k \text {sgn}(\varvec{\theta }_{0,{{\mathcal {A}}}_k}) \\&+ \gamma _0 \xi _k \frac{\varvec{\theta }_{0,{{\mathcal {A}}}_k}}{\Vert \varvec{\theta }_{0,{{\mathcal {A}}}_k}\Vert _2}) - \varvec{Z}_{{{\mathcal {A}}}^c_k})_i| \le \lambda _0 \alpha _k) < 1. \end{aligned}$$

Thus \(c < 1\), which proves (10), that is proposition 1. \(\square \)

Proof of Theorem 4

The proof relies on the same steps as in the proof of Theorem 2.

\(\square \)

Proof of Theorem 5

We start with the asymptotic distribution and proceed as in the proof of Theorem 3, where we used Lemma 1. To do so, we prove the finite dimensional convergence in distribution of the empirical criterion \({{\mathbb {F}}}_T(\varvec{u})\) to \({{\mathbb {F}}}_{\infty }(\varvec{u})\) with \(\varvec{u}\in {{\mathbb {R}}}^d\), where these quantities are, respectively, defined as

$$\begin{aligned} {{\mathbb {F}}}_T(\varvec{u})= & {} T {{\mathbb {G}}}_T (\psi (\varvec{\theta }_0 + \varvec{u}/\sqrt{T}) - \psi (\varvec{\theta }_0)) \nonumber \\= & {} T {{\mathbb {G}}}_T (l(\varvec{\theta }_0 + \varvec{u}/\sqrt{T}) - l(\varvec{\theta }_0)) + \lambda _T \overset{m}{\underset{k=1}{\sum }} \overset{\varvec{c}_k}{\underset{i=1}{\sum }} \alpha ^{(k)}_{T,i} \left[ |\theta ^{(k)}_{0,i} + \varvec{u}^{(k)}_i/\sqrt{T}| - |\theta ^{(k)}_{0,i}|\right] \nonumber \\&+\, \gamma _T \overset{m}{\underset{l=1}{\sum }} \xi _{T,l} \left[ \Vert \varvec{\theta }^{(l)}_0 + \varvec{u}^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2\right] , \end{aligned}$$

and

$$\begin{aligned} {{\mathbb {F}}}_{\infty }(\varvec{u}) = {\left\{ \begin{array}{ll} \frac{1}{2} \varvec{u}'_{{{\mathcal {A}}}} {{\mathbb {H}}}_{{{\mathcal {A}}}{{\mathcal {A}}}} \varvec{u}_{{{\mathcal {A}}}} + \varvec{u}_{{{\mathcal {A}}}}' \varvec{Z}_{{{\mathcal {A}}}} &{} \text {if} \; \varvec{u}_i = 0, \; \text {when} \; i \notin {{\mathcal {A}}}, \, \text {and} \\ \infty &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(14)

with \(\varvec{Z}_{{{\mathcal {A}}}} \sim {{\mathcal {N}}}(0,{{\mathbb {M}}}_{{{\mathcal {A}}}{{\mathcal {A}}}})\). By Lemma 1, the finite dimensional convergence in distribution implies \(\underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min }\{{{\mathbb {F}}}_T(\varvec{u})\} \overset{d}{\longrightarrow } \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min }\{{{\mathbb {F}}}_{\infty }(\varvec{u})\}\). We first consider the unpenalized empirical criterion of \({{\mathbb {F}}}_T(.)\), which can be expanded as

$$\begin{aligned} T {{\mathbb {G}}}_T (\psi (\varvec{\theta }_0 + \varvec{u}/\sqrt{T}) - \psi (\varvec{\theta }_0))= & {} T^{1/2}{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \varvec{u}+ \frac{\varvec{u}'}{2} \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \varvec{u}\\&+\, \frac{1}{6 T^{1/3}} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}})\}\varvec{u}, \end{aligned}$$

where \({\bar{\varvec{\theta }}}\) lies between \(\varvec{\theta }_0\) and \(\varvec{\theta }_0 + \varvec{u}/\sqrt{T}\). First, using the same reasoning on the third-order term, we obtain \(\frac{1}{6 T^{1/3}} \nabla '\{\varvec{u}' \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}})\}\varvec{u}\overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0\). By the ergodic theorem, we deduce \(\ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} {{\mathbb {H}}}\) and by Assumption 4, \(\sqrt{T}{\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0) \overset{d}{\longrightarrow } {{\mathcal {N}}}(0,{{\mathbb {M}}})\).

We now focus on the penalty terms of (4), we remind that \(\alpha ^{(k)}_{T,i} = |{\tilde{\theta }}^{(k)}_i|^{-\eta }\), so that for \(i \in {{\mathcal {A}}}_k, k \in {{\mathcal {S}}}\), \({\tilde{\theta }}^{(k)}_i \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \theta ^{(k)}_{0,i} \ne 0\). Note that

$$\begin{aligned} \sqrt{T}(|\varvec{\theta }^{(k)}_{0,i} + \varvec{u}^{(k)}_i/\sqrt{T}| - |\varvec{\theta }^{(k)}_0|] \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{u}^{(k)}_i \text {sgn}(\theta ^{(k)}_{0,i}){\mathbf {1}}_{\theta ^{(k)}_{0,i} \ne 0}. \end{aligned}$$

This implies that, for \(i \in {{\mathcal {A}}}_k\), \(k \in {{\mathcal {S}}}\), we have

$$\begin{aligned} \lambda _T T^{-1/2}\overset{\varvec{c}_k}{\underset{i=1}{\sum }} \alpha ^{(k)}_{T,i} \sqrt{T}(|\theta ^{(k)}_{0,i} + \varvec{u}^{(k)}_i/\sqrt{T}| - |\theta ^{(k)}_{0,i}|) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0, \end{aligned}$$

under the condition \(\lambda _T T^{-1/2} \rightarrow 0\). For \(i \in {{\mathcal {A}}}^c_k\), \(\theta ^{(k)}_{0,i} = 0\), then \(T^{\eta /2} (|{\tilde{\theta }}^{(k)}_i|)^{\eta } = O_p(1)\). Hence under the assumption \(\lambda _T T^{(\eta -1)/2} \rightarrow \infty \), we obtain

$$\begin{aligned}&\lambda _T T^{-1/2} \alpha ^{(k)}_{T,i} \sqrt{T}\left( |\theta ^{(k)}_{0,i} + \varvec{u}^{(k)}_i/\sqrt{T} | - |\theta ^{(k)}_{0,i}|\right) \nonumber \\&\quad = \lambda _T T^{-1/2} |\varvec{u}^{(k)}_i| \frac{T^{\eta /2}}{(T^{1/2}|{\tilde{\theta }}^{(k)}_i|)^{\eta }} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \infty . \end{aligned}$$
(15)

As for the \(l^1/l^2\) quantity, we remind that \(\xi _{T,l} = \Vert {\tilde{\varvec{\theta }}}^{(l)}\Vert ^{-\mu }_2\), so that for \(l \in {{\mathcal {S}}}\), \({\tilde{\varvec{\theta }}}^{(l)} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{\theta }^{(l)}_0\), and in this case

$$\begin{aligned} \sqrt{T} \left\{ \Vert \varvec{\theta }^{(l)}_0 + \varvec{u}^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2\right\} = \frac{\varvec{u}^{(l) '} \varvec{\theta }^{(l)}_0}{\Vert \varvec{\theta }^{(l)}_0\Vert _2} + o\left( T^{-1/2}\right) . \end{aligned}$$

Consequently, using \(\gamma _T T^{-1/2} \rightarrow 0\), and for \(l \in {{\mathcal {S}}}\), we obtain

$$\begin{aligned} \gamma _T T^{-1/2} \sqrt{T} \xi _{T,l} \left( \Vert \varvec{\theta }^{(l)}_0 + \varvec{u}^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2 \right) \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} 0. \end{aligned}$$

Combining the fact \(k \in {{\mathcal {S}}}\) and \(\varvec{\theta }^{(k)}_0\) is partially zero, that is \(i \in {{\mathcal {A}}}^c_k\), we obtain the divergence given in (15). Furthermore, if \(l \notin {{\mathcal {S}}}\), that is \(\varvec{\theta }^{(l)}_0 = 0\), then

$$\begin{aligned} \sqrt{T} \left\{ \Vert \varvec{\theta }^{(l)}_0 + \varvec{u}^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2\right\} = \Vert \varvec{u}^{(l)}\Vert _2, \end{aligned}$$

and \(T^{\mu /2} (\Vert {\tilde{\varvec{\theta }}}^{(l)}\Vert _2)^{\mu } = O_p(1)\). Then by \(\gamma _T T^{(\mu -1)/2} \rightarrow \infty \) we have

$$\begin{aligned}&\gamma _T T^{-1/2} \xi _{T,l} \sqrt{T}\left[ \Vert \varvec{\theta }^{(l)}_0 + \varvec{u}^{(l)}/\sqrt{T}\Vert _2 - \Vert \varvec{\theta }^{(l)}_0\Vert _2 \right] \\&\quad = \gamma _T T^{-1/2} \Vert \varvec{u}^{(l)}\Vert _2 \frac{T^{\mu /2}}{(T^{1/2}\Vert {\tilde{\varvec{\theta }}}^{(l)}\Vert _2)^{\mu }} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \infty . \end{aligned}$$

We deduce the pointwise convergence \({{\mathbb {F}}}_T (\varvec{u}) \overset{d}{\longrightarrow } {{\mathbb {F}}}_{\infty }(\varvec{u})\), where \({{\mathbb {F}}}_{\infty }(.)\) is given in (14). As \({{\mathbb {F}}}_T(.)\) is convex and \({{\mathbb {F}}}_{\infty }(.)\) is convex and has a unique minimum \(({{\mathbb {H}}}^{-1}_{{{\mathcal {A}}}{{\mathcal {A}}}} \varvec{Z}_{{{\mathcal {A}}}},{\mathbf {0}}_{{{\mathcal {A}}}^c})\) since \({{\mathbb {H}}}\) is positive definite, by Lemma 1, we obtain

$$\begin{aligned} \sqrt{T}({\hat{\varvec{\theta }}} - \varvec{\theta }_0) = \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min }\{{{\mathbb {F}}}_T(\varvec{u})\} \overset{d}{\longrightarrow } \underset{\varvec{u}\in {{\mathbb {R}}}^d}{\arg \, \min }\{{{\mathbb {F}}}_{\infty }(\varvec{u})\}, \end{aligned}$$

that is to say \(\sqrt{T}({\hat{\varvec{\theta }}}_{{{\mathcal {A}}}} - \varvec{\theta }_{0,{{\mathcal {A}}}}) \overset{d}{\longrightarrow } {{\mathbb {H}}}^{-1}_{{{\mathcal {A}}}{{\mathcal {A}}}} \varvec{Z}_{{{\mathcal {A}}}}, \; \text {and} \; \sqrt{T}({\hat{\varvec{\theta }}}_{{{\mathcal {A}}}^c} - \varvec{\theta }_{0,{{\mathcal {A}}}^c}) \overset{d}{\longrightarrow } \mathbf {0}_{{{\mathcal {A}}}^c}\).

We now prove the model selection consistency. Let \(i \in {{\mathcal {A}}}_k\), then by the asymptotic normality result, \({\hat{\theta }}^{(k)}_i \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{\theta }^{(k)}_0\), which implies \({{\mathbb {P}}}(i \in {\hat{{{\mathcal {A}}}}}_k) \rightarrow 1\). Thus the proof consists of proving

$$\begin{aligned} \forall k = 1,\ldots ,m, \forall i \in {{\mathcal {A}}}^c_k, {{\mathbb {P}}}(i \in {\hat{{{\mathcal {A}}}}}_k) \rightarrow 0. \end{aligned}$$

This problem can be split into two parts as

$$\begin{aligned} \forall k \notin {{\mathcal {S}}}, {{\mathbb {P}}}(k \in {\hat{{{\mathcal {S}}}}}) \rightarrow 0, \; \text {and} \; \forall k \in {{\mathcal {S}}}, \forall i \in {{\mathcal {A}}}^c_k, {{\mathbb {P}}}(i \in {\hat{A}}_k) \rightarrow 0. \end{aligned}$$
(16)

Let us start with the case \(k \notin {{\mathcal {S}}}\). If \(k \in {\hat{{{\mathcal {S}}}}}\), by the optimality conditions given by the Karush–Kuhn–Tucker theorem applied on \({{\mathbb {G}}}_T \psi ({\hat{\varvec{\theta }}})\), we have

$$\begin{aligned} {\dot{{{\mathbb {G}}}}}_T l({\hat{\varvec{\theta }}})_{(k)} + \frac{\lambda _T}{T} \alpha ^{(k)}_T \odot {\hat{\varvec{w}}}^{(k)}+ \frac{\gamma _T}{T} \xi _{T,k} \frac{{\hat{\varvec{\theta }}}^{(k)}}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2} = 0, \end{aligned}$$

\(\odot \) is the element-by-element vector product, and

$$\begin{aligned} {\hat{\varvec{w}}}^{(k)}_i {\left\{ \begin{array}{ll} = \text {sgn}({\hat{\theta }}^{(k)}_i) \; \text {if} \; {\hat{\theta }}^{(k)}_i \ne 0,\\ \in \{{\hat{\varvec{w}}}^{(k)}_i : |{\hat{\varvec{w}}}^{(k)}_i| \le 1\} \; \text {if} \; {\hat{\theta }}^{(k)}_i = 0. \end{array}\right. } \end{aligned}$$

Multiplying the unpenalized part by \(T^{1/2}\), we have the expansion

$$\begin{aligned} T^{1/2} {\dot{{{\mathbb {G}}}}}_T l({\hat{\varvec{\theta }}})_{(k)}= & {} T^{1/2} {\dot{{{\mathbb {G}}}}}_T l(\varvec{\theta }_0)_{(k)} + T^{1/2} \ddot{{{\mathbb {G}}}}_T l(\varvec{\theta }_0)_{(k) (k)} ({\hat{\varvec{\theta }}}-\varvec{\theta }_0)_{(k)} \\&+\, T^{1/2} \nabla '\{({\hat{\varvec{\theta }}}-\varvec{\theta }_0)'_{(k)} \ddot{{{\mathbb {G}}}}_T l({\bar{\varvec{\theta }}})_{(k)(k)} ({\hat{\varvec{\theta }}}-\varvec{\theta }_0)_{(k)}\} , \end{aligned}$$

which is asymptotically normal by consistency, Assumption 6 regarding the bound on the third-order term, the Slutsky theorem and the central limit theorem of Billingsley (1961). Furthermore, we have

$$\begin{aligned} \gamma _T T^{-1/2}\xi _{T,k} \frac{{\hat{\varvec{\theta }}}^{(k)}}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2} = \gamma _T T^{(\mu -1)/2} (T^{1/2} \Vert {\tilde{\varvec{\theta }}}^{(k)}\Vert _2 )^{-\mu }\frac{{\hat{\varvec{\theta }}}^{(k)}}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \infty . \end{aligned}$$

Then using \(T^{(\mu -\eta )/2} \gamma _T \lambda ^{-1}_T \rightarrow \infty \), we have

$$\begin{aligned} \forall k \notin {{\mathcal {S}}}, {{\mathbb {P}}}(k \in {\hat{{{\mathcal {S}}}}}) \le {{\mathbb {P}}}\left( -{\dot{{{\mathbb {G}}}}}_T l({\hat{\varvec{\theta }}})_{(k)} = \frac{\lambda _T}{T} \alpha ^{(k)}_T \odot {\hat{\varvec{w}}}^{(k)}_i + \frac{\gamma _T}{T} \xi _{T,k} \frac{{\hat{\varvec{\theta }}}^{(k)}}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2}\right) \rightarrow 0. \end{aligned}$$

We now pick \(k \in {{\mathcal {S}}}\) and consider the event \(\{i \in {\hat{{{\mathcal {A}}}}}_k\}\). Then the Karush–Kuhn–Tucker conditions for \({{\mathbb {G}}}_T \psi ({\hat{\varvec{\theta }}})\) are given by

$$\begin{aligned} ({\dot{{{\mathbb {G}}}}}_T l({\hat{\theta }}))_{(k),i} + \frac{\lambda _T}{T} \alpha ^{(k)}_{T,i} \text {sgn}({\hat{\theta }}^{(k)}_{T,i}) + \frac{\gamma _T}{T} \xi _{T,k} \frac{{\hat{\theta }}^{(k)}_i}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2} = 0. \end{aligned}$$

Using the same reasoning as previously, \(T^{1/2}({\dot{{{\mathbb {G}}}}}_T l({\hat{\varvec{\theta }}}))_{(k),i}\) is also asymptotically normal, and \({\tilde{\varvec{\theta }}}^{(k)} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \varvec{\theta }^{(k)}_0\) for \(k \in {{\mathcal {S}}}\), and besides

$$\begin{aligned} \lambda _T T^{-1/2}\alpha ^{(k)}_{T,i} \text {sgn}({\hat{\theta }}^{(k)}_i) = \lambda _T \frac{T^{(\eta -1)/2}}{(T^{1/2}|{\tilde{\theta }}^{(k)}_i|)^{\eta }} \overset{{{\mathbb {P}}}}{\underset{T \rightarrow \infty }{\longrightarrow }} \infty , \end{aligned}$$

so that we obtain the same when adding \(\gamma _T T^{-1/2}\xi _{T,k} \frac{{\hat{\theta }}^{(k)}_i}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2}\). Therefore, we have for any \(k \in {{\mathcal {S}}}\) and \(i \notin {{\mathcal {A}}}_k\)

$$\begin{aligned} {{\mathbb {P}}}(i \in {\hat{{{\mathcal {A}}}}}_k) \le {{\mathbb {P}}}\left( -({\dot{{{\mathbb {G}}}}}_T l({\hat{\varvec{\theta }}}))_{(k),i} = \frac{\lambda _T}{T} \alpha ^{(k)}_{T,i} \text {sgn}({\hat{\theta }}^{(k)}_i) + \frac{\gamma _T}{T} \xi _{T,k} \frac{{\hat{\theta }}^{(k)}_i}{\Vert {\hat{\varvec{\theta }}}^{(k)}\Vert _2}\right) \rightarrow 0. \end{aligned}$$

We have proved (16). \(\square \)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Poignard, B. Asymptotic theory of the adaptive Sparse Group Lasso. Ann Inst Stat Math 72, 297–328 (2020). https://doi.org/10.1007/s10463-018-0692-7

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-018-0692-7

Keywords

Navigation