Skip to main content

Advertisement

Log in

Simultaneous subgroup identification and variable selection for high dimensional data

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

The high dimensionality of genetic data poses many challenges for subgroup identification, both computationally and theoretically. This paper proposes a double-penalized regression model for subgroup analysis and variable selection for heterogeneous high-dimensional data. The proposed approach can automatically identify the underlying subgroups, recover the sparsity, and simultaneously estimate all regression coefficients without prior knowledge of grouping structure or sparsity construction within variables. We optimize the objective function using the alternating direction method of multipliers with a proximal gradient algorithm and demonstrate the convergence of the proposed procedure. We show that the proposed estimator enjoys the oracle property. Simulation studies demonstrate the effectiveness of the novel method with finite samples, and a real data example is provided for illustration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Boyd S, Parikh N, Chu E, Peleato B, Eckstein J et al (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122

    Article  MATH  Google Scholar 

  • Brito PQ, Soares C, Almeida S, Monte A, Byvoet M (2015) Customer segmentation in a large database of an online customized fashion business. Robot Comput Integr Manuf 36:93–100

    Article  Google Scholar 

  • Chen J, Tran-Dinh Q, Kosorok MR, Liu Y (2021) Identifying heterogeneous effect using latent supervised clustering with adaptive fusion. J Comput Graph Stat 30(1):43–54

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B (Stat Methodol) 70(5):849–911

    Article  MathSciNet  MATH  Google Scholar 

  • Gramfort A, Kowalski M, Hämäläinen M (2012) Mixed-norm estimates for the M/EEG inverse problem using accelerated gradient methods. Phys Med Biol 57(7):1937

    Article  Google Scholar 

  • Hofree M, Shen JP, Carter H, Gross A, Ideker T (2013) Network-based stratification of tumor mutations. Nat Methods 10(11):1108–1115

    Article  Google Scholar 

  • Kim Y, Hao J, Mallavarapu T, Park J, Kang M (2019) Hi-lasso: high-dimensional lasso. IEEE Access 7:44562–44573

    Article  Google Scholar 

  • Li J, Li Y, Jin B, Kosorok MR (2021) Multithreshold change plane model: estimation theory and applications in subgroup identification. Stat Med 40(15):3440–3459

    Article  MathSciNet  Google Scholar 

  • Lipkovich I, Dmitrienko A, D’Agostino R Sr (2017) Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Stat Med 36(1):136–196

    Article  MathSciNet  Google Scholar 

  • Luo Z, Yao X, Sun Y, Fan X (2022) Regression-based heterogeneity analysis to identify overlapping subgroup structure in high-dimensional data. Biom J 64(6):1109–1141

    Article  MathSciNet  Google Scholar 

  • Ma S, Huang J (2017) A concave pairwise fusion approach to subgroup analysis. J Am Stat Assoc 112(517):410–423

    Article  MathSciNet  Google Scholar 

  • Ma S, Huang J, Zhang Z, Liu M (2020) Exploration of heterogeneous treatment effects via concave fusion. Int J Biostat 16(1):20180026

    Article  Google Scholar 

  • Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

  • Schwalbe E, Lindsey J, Nakjang S et al (2017) Novel molecular subgroups for clinical classification and outcome prediction in childhood medulloblastoma: a cohort study. Lancet Oncol 18:958–971

    Article  Google Scholar 

  • Shen J, He X (2015) Inference for subgroup analysis with a structured logistic-normal mixture model. J Am Stat Assoc 110(509):303–312

    Article  MathSciNet  MATH  Google Scholar 

  • Shen X, Pan W, Zhu Y (2012) Likelihood-based selection and sharp parameter estimation. J Am Stat Assoc 107(497):223–232

    Article  MathSciNet  MATH  Google Scholar 

  • Shen X, Huang H-C, Pan W (2012) Simultaneous supervised clustering and feature selection over a graph. Biometrika 99(4):899–914

    Article  MathSciNet  MATH  Google Scholar 

  • Tavallali P, Tavallali P, Singhal M (2021) K-means tree: an optimal clustering tree for unsupervised learning. J Supercomput 77(5):5239–5266

    Article  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Stat Methodol) 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • Wang H, Li B, Leng C (2009) Shrinkage tuning parameter selection with a diverging number of parameters. J R Stat Soc Ser B (Stat Methodol) 71(3):671–683

    Article  MathSciNet  MATH  Google Scholar 

  • Wang S, Nan B, Rosset S, Zhu J (2011) Random lasso. Ann Appl Stat 5(1):468

    Article  MathSciNet  MATH  Google Scholar 

  • Wang W, Zhu Z (2022) Homogeneity and sparsity analysis for high-dimensional panel data models. J Bus Econ Stat 1–10

  • Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67

    Article  MathSciNet  MATH  Google Scholar 

  • Yue M, Huang L (2020) A new approach of subgroup identification for high-dimensional longitudinal data. J Stat Comput Simul 90(11):2098–2116

    Article  MathSciNet  MATH  Google Scholar 

  • Yue M, Li J, Cheng M-Y (2019) Two-step sparse boosting for high-dimensional longitudinal data with varying coefficients. Comput Stat Data Anal 131:222–234

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang C-H (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38(2):894–942

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang P, Ma J, Chen X, Yue S (2020) A nonparametric method for value function guided subgroup identification via gradient tree boosting for censored survival data. Stat Med 39(28):4133–4146

    Article  MathSciNet  Google Scholar 

  • Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429

    Article  MathSciNet  MATH  Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67(2):301–320

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors sincerely appreciate the insightful feedback provided by the Editor, Associate Editor, and two anonymous reviewers, which has greatly improved the quality of this manuscript. This research is supported by the NSF of China under Grant Numbers 12171450 and 71921001.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiping Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proofs of theorems

Appendix: Proofs of theorems

Proof of Theorem 1

Since the objective function is convex, we only need to find the global minimizer in the local of the truth value. Let us first prove asymptotic normality part. We first constrain \(Q(\varvec{\psi })\) on the S-dimensional subspace \(\left\{ \mathbf {\psi } \in {\mathbb {R}}^{Kp}:\mathbf {\psi }_{{\mathcal {K}}2} = {\textbf{0}}\right\} \) of \({\mathbb {R}}^{Kp}\). This constrained penalized least square objective function is given by

$$\begin{aligned} {\bar{Q}}(\varvec{\varphi }) = \left\| {\textbf{Y}} - \check{{\textbf{X}}}_1\varvec{\varphi } \right\| ^2 + \lambda _2 \sum _{s=1}^{S} {\tilde{\zeta }}_s |\varvec{\varphi }_s|. \end{aligned}$$
(14)

where \(\varvec{\varphi } = \left( \varphi _1, \ldots , \varphi _S \right) ^T\), S is the number of parameter of \(\varvec{\varphi }_{{\mathcal {K}}1}\). We then show that there exists a local minimizer \(\tilde{\varvec{\varphi }}_{{\mathcal {K}}1}\) of \({\bar{Q}}(\varvec{\varphi })\) such that \(\left\| \tilde{\varvec{\varphi }}_{{\mathcal {K}}1} - \varvec{\psi }_{0{\mathcal {K}}1} \right\| = O_p(\sqrt{S/\left| {\mathcal {G}}_{\min }\right| })\).

We can set \(\tilde{\varvec{\varphi }} = \varvec{\psi }_{0{\mathcal {K}}1} + \varvec{\mu }/\sqrt{\left| {\mathcal {G}}_{\min }\right| }\) for \(\varvec{\mu } = \left( \mu _1,\ldots ,\mu _{S}\right) ^T \in {\mathbb {R}}^{S}\), so we can denote \({\bar{Q}}(\varvec{\varphi })\) as:

$$\begin{aligned} {\bar{Q}}(\varvec{\mu }) = \left\| {\textbf{Y}} - \check{{\textbf{X}}}_1 (\varvec{\psi }_{0{\mathcal {K}}1} + \frac{\varvec{\mu }}{\sqrt{\left| {\mathcal {G}}_{\min }\right| }}) \right\| ^2 + \lambda _2 \sum _{s=1}^{S} {\tilde{\zeta }}_s \left| \varvec{\psi }_{0{\mathcal {K}}1s} + \frac{\mu _s}{\sqrt{\left| {\mathcal {G}}_{\min }\right| }}\right| . \end{aligned}$$
(15)

Consider \(V_n(\varvec{\mu }) = {\bar{Q}}(\varvec{\mu }) - {\bar{Q}}({\textbf{0}})\) as follows.

A routine decomposition yields the following decomposition:

$$\begin{aligned} V_n(\varvec{\mu })&= \frac{\varvec{\mu }^T \check{{\textbf{X}}}_1^T \check{{\textbf{X}}}_1 \varvec{\mu }}{\left| {\mathcal {G}}_{\min }\right| } - \frac{2 \varvec{\epsilon }^T \check{{\textbf{X}}}_1 \varvec{\mu }}{\sqrt{\left| {\mathcal {G}}_{\min }\right| }} + \lambda _2 \sum _{s=1}^{S} \left\{ {\tilde{\zeta }}_s \left( \left| \varvec{\psi }_{0{\mathcal {K}}1s} + \frac{\mu _s}{\sqrt{\left| {\mathcal {G}}_{\min }\right| }}\right| - \left| \varvec{\psi }_{0{\mathcal {K}}1s} \right| \right) \right\} . \end{aligned}$$
(16)

For \(\varvec{\psi }_{s} = 0\), then by condition (C3) (ii):

$$\begin{aligned} \lambda _2 \tilde{\zeta _s} \left( \left| \varvec{\psi }_{0{\mathcal {K}}1s} + \frac{\mu _s}{\sqrt{\left| {\mathcal {G}}_{\min }\right| }}\right| - \left| \varvec{\psi }_{0{\mathcal {K}}1s} \right| \right) = \sum _{i \in {\mathcal {G}}_k}\lambda _2 \frac{\zeta _{ij} \left| \mu _s\right| }{\sqrt{\left| {\mathcal {G}}_{\min }\right| }} \rightarrow \infty . \end{aligned}$$

For \(\varvec{\psi }_{s} \ne 0\), we have

$$\begin{aligned} \lambda _2 {\tilde{\zeta }}_s \left( \left| \varvec{\psi }_{0{\mathcal {K}}1s} + \frac{\mu _s}{\sqrt{\left| {\mathcal {G}}_{\min }\right| }}\right| - \left| \varvec{\psi }_{0{\mathcal {K}}1s} \right| \right) \le \frac{\lambda _2 {\tilde{\zeta }}_s |\mu _s|}{\sqrt{\left| {\mathcal {G}}_{\min }\right| }}. \end{aligned}$$

By condition (C3) (i):

$$\begin{aligned} \lambda _2 {\tilde{\zeta }}_s = \sum _{i \in {\mathcal {G}}_k}\lambda _2 \zeta _{ij} = o(\left| {\mathcal {G}}_{\min } \right| ^{-\frac{1}{2}} \left| {\mathcal {G}}_{\max } \right| ) = o(\left| {\mathcal {G}}_{\min } \right| ^{\frac{1}{2}}). \end{aligned}$$

Therefore

$$\begin{aligned} \frac{\lambda _2 {\tilde{\zeta }}_s |\mu _s|}{\sqrt{\left| {\mathcal {G}}_{\min }\right| }} = \frac{o(\left| {\mathcal {G}}_{\min } \right| ^{\frac{1}{2}})|\mu _s|}{\sqrt{\left| {\mathcal {G}}_{\min }\right| }} \rightarrow 0. \end{aligned}$$

So we have:

$$\begin{aligned} V_n(\varvec{\mu }) \rightarrow \varvec{\mu }^T \varvec{\Omega } \varvec{\mu } - \frac{2}{\sqrt{\left| {\mathcal {G}}_{\min }\right| }} \varvec{\epsilon }^T \check{{\textbf{X}}}_1 \varvec{\mu }, \end{aligned}$$

condition (C4) tell us \(\check{{\textbf{X}}}_1^T \check{{\textbf{X}}}_1 /\left| {\mathcal {G}}_{\min }\right| \rightarrow \varvec{\Omega }\) as \(n \rightarrow \infty \). It follows from the Lindeberg central limit theorem, that \(\check{{\textbf{X}}}_1^T \varvec{\epsilon } /\sqrt{\left| {\mathcal {G}}_{\min }\right| }\) converges in distribution to a normal \({\mathcal {N}}({\textbf{0}},\sigma ^2 \varvec{\Omega })\).

We obtained that

$$\begin{aligned} \sqrt{\left| {\mathcal {G}}_{\min }\right| }\left( \tilde{\varvec{\varphi }} - \varvec{\psi }_{0{\mathcal {K}}1} \right) \rightarrow {\text {arg}} \min _{\varvec{\mu }} \left\{ \varvec{\mu }^T \varvec{\Omega } \varvec{\mu } - \frac{2}{\sqrt{\left| {\mathcal {G}}_{\min }\right| }} \varvec{\epsilon }^T \check{{\textbf{X}}}_1 \varvec{\mu } \right\} = \varvec{\Omega }^{-1}W, \end{aligned}$$

where \(W \sim {\mathcal {N}}({\textbf{0}},\sigma ^2 \varvec{\Omega })\). So we prove the asymptotic normality part (ii). And we have \(\left\| \tilde{\varvec{\psi }}_{{\mathcal {K}}1} - \varvec{\psi }_{0 {\mathcal {K}}1} \right\| = O_p(\sqrt{S/\left| {\mathcal {G}}_{\min }\right| })\).

Next, we will prove the consistency part (i). Suppose that, for a sufficient large n, there exists some \(j \in J_0^c\) such that \(\varvec{\psi }_{j} \ne 0\) with a positive probability. Then from the sub-gradient equation (12), we have

$$\begin{aligned} \check{{\textbf{X}}}_{\cdot s}^T({\textbf{Y}} - \check{{\textbf{X}}}\tilde{\varvec{\psi }}) = \lambda _2 {\tilde{\zeta }}_s ST(\tilde{\psi _s}). \end{aligned}$$

On one hand, according to Condition (C3) (ii), the first item of the right-hand side of converges to infinite with a positive probability, namely,

$$\begin{aligned} \left| \lambda _2 {\tilde{\zeta }}_s \frac{\tilde{\psi _s}}{|\tilde{\psi _s}|}\right| = \left| \sum _{i \in {\mathcal {G}}_k} \lambda _2 \zeta _{ij} \right| \rightarrow \infty . \end{aligned}$$

On the other hand, the left-hand side of is bounded in probability since

$$\begin{aligned} \check{{\textbf{X}}}_{\cdot s}^T({\textbf{Y}} - \check{{\textbf{X}}}\tilde{\varvec{\psi }}) =\frac{1}{\sqrt{\left| {\mathcal {G}}_{\min }\right| }} \check{{\textbf{X}}}_{\cdot s}^T \check{{\textbf{X}}}(\sqrt{\left| {\mathcal {G}}_{\min }\right| }(\tilde{\varvec{\psi }} - \varvec{\psi }_0)) = O_p(1). \end{aligned}$$

We get a contradiction that two sides of have different orders as n tends to infinity. So we show that

$$\begin{aligned} P(\tilde{\varvec{\psi }}_{J_0^c} = {\textbf{0}}) \rightarrow 1 \quad \text {as} \quad n \rightarrow \infty . \end{aligned}$$

\(\square \)

Proof of Theorem 2

First, for a matrix \({\textbf{A}}\), let \(\left\| {\textbf{A}} \right\| _{\infty }\) denote the maximum absolute value of the elements of \({\textbf{A}}\). Let \(T: {\mathcal {M}}_{{\mathcal {G}}} \rightarrow {\mathbb {R}}^{K \times p}\) be the mapping such that \(T(\varvec{\beta })\) is the matrix whose s’th row coordinate equals the common value of \(\varvec{\beta }_{i \cdot }\) for \(i \in {\mathcal {G}}_k, k \in \{1,\ldots ,K\}\). Let \(T^{*}: {\mathbb {R}}^{n \times p} \rightarrow {\mathbb {R}}^{K \times p}\) be the mapping such that \(T^{*}(\varvec{\beta }) = \left\{ |{\mathcal {G}}_k|^{-1} \sum _{i \in {\mathcal {G}}_k} \varvec{\beta }_{i \cdot },k \in \{1,\ldots ,K\} \right\} \). Then through ranking the nonzero part of parameters ahead of zeros, the vector form of \(\varvec{\beta }\) can be written as \(\left( \varvec{\beta }_{{\mathcal {K}}1}^T,\varvec{\beta }_{{\mathcal {K}}2}^T\right) ^T\), and the true value is \(\left( \varvec{\beta }_{0{\mathcal {K}}1}^T,{\textbf{0}}\right) ^T\).

Consider the neighborhood of \(\varvec{\beta }_0\):

$$\begin{aligned} \Theta =\left\{ \varvec{\beta } \in {\mathbb {R}}^{n \times p}: \varvec{\beta }_{{\mathcal {K}} 2}=0,\left\| \varvec{\beta }_{{\mathcal {K}} 1}-\varvec{\beta }_{0 {\mathcal {K}} 1}\right\| _{\infty } \le \tau /\sqrt{\left| {\mathcal {G}}_{\min }\right| }\right\} , \end{aligned}$$

where \(\tau \) is a constant and \(\tau > 0\). For any \(\varvec{\beta } \in {\mathbb {R}}^{n \times p}\), let \(\varvec{\beta }^{*} = T^{-1} (T^{*}(\varvec{\beta }))\) and by the result in Theorem 4.1, there is an event \(E_1\) such that in the event \(E_1\),

$$\begin{aligned} \tilde{\varvec{\beta }}_{{\mathcal {K}}2} = {\textbf{0}}, \left\| \tilde{\varvec{\beta }}_{{\mathcal {K}}1} - \varvec{\beta }_{0{\mathcal {K}}1} \right\| _{\infty } \le \tau /\sqrt{\left| {\mathcal {G}}_{\min }\right| } \end{aligned}$$

So \(\tilde{\varvec{\beta }} \in \Theta \) in \(E_1\), where \(\tilde{\varvec{\beta }}\) is the parameter estimation corresponding to \(\tilde{\varvec{\psi }}\). There is an event \(E_2\) such that \(P(E_2^c) \le 2/n\). In \(E_1 \cap E_2\), there is a neighborhood of \(\tilde{\varvec{\beta }}\), denoted by \(\Theta _n\), and \(P(E_1 \cap E_2) \ge 1 - 2/n\). We show that \(\tilde{\varvec{\beta }}\) is a local minimizer of the objective function with probability approaching 1 such that (1) \(\ell _p(\varvec{\beta }^{*}) > \ell _p(\tilde{\varvec{\beta }})\) for any \(\varvec{\beta }^{*} \in \Theta \) and \(\varvec{\beta }^{*} \ne \tilde{\varvec{\beta }}\). (2) \(\ell _p(\varvec{\beta }) > \ell _p(\varvec{\beta }^{*})\) for any \(\varvec{\beta } \in \Theta _n \cap \Theta \) for sufficiently large n and \(\left| {\mathcal {G}}_{\min }\right| \).

Step 1 We just need to consider the term: \(\sum _{i<j} w_{ij} \left\| \tilde{\varvec{\beta }}_{i \cdot } - \tilde{\varvec{\beta }}_{j \cdot } \right\| _2\): When \(i,j \in {\mathcal {G}}_k, k \in \{1,\ldots ,K\}\), we have \(w_{ij} \left\| \tilde{\varvec{\beta }}_{i \cdot } - \tilde{\varvec{\beta }}_{j \cdot } \right\| _2 = 0\); when \(i \in {\mathcal {G}}_k, j \in {\mathcal {G}}_{k^{\prime }}\) and \(k \ne k^{\prime }\), we have

$$\begin{aligned} \left\| \tilde{\varvec{\beta }}_{i \cdot } - \tilde{\varvec{\beta }}_{j \cdot } \right\|&\le \left\| \tilde{\varvec{\beta }}_{i \cdot } - \varvec{\beta }_{0i \cdot } + \varvec{\beta }_{0i \cdot } - \varvec{\beta }_{0j \cdot } + \varvec{\beta }_{0j \cdot } - \tilde{\varvec{\beta }}_{j \cdot } \right\| \\&\le \left\| \tilde{\varvec{\beta }}_{i \cdot } - \varvec{\beta }_{0i \cdot } \right\| + \left\| \varvec{\beta }_{0i \cdot } - \varvec{\beta }_{0j \cdot } \right\| + \left\| \varvec{\beta }_{0j \cdot } - \tilde{\varvec{\beta }}_{j \cdot } \right\| \\&\le b_n + 2\tau /\sqrt{\left| {\mathcal {G}}_{\min } \right| }. \end{aligned}$$

By condition (C5) (ii):

$$\begin{aligned}&\sum _{k< k^{\prime }}\sum _{i \in {\mathcal {G}}_k, j \in {\mathcal {G}}_{k^{\prime }}} \lambda _1 w_{ij} \left\| \tilde{\varvec{\beta }}_{i \cdot } - \tilde{\varvec{\beta }}_{j \cdot } \right\| \\ \le&\sum _{k < k^{\prime }} \sum _{i \in {\mathcal {G}}_k, j \in {\mathcal {G}}_{k^{\prime }}} \lambda _1 \max _{i,j} \left\{ w_{ij}|i \in {\mathcal {G}}_k, j \in {\mathcal {G}}_{k^{\prime }}\right\} \left( b_n + 2\tau /\sqrt{\left| {\mathcal {G}}_{\min } \right| } \right) \rightarrow 0. \end{aligned}$$

By the result of Theorem 1, (1) is proved.

Step 2 For a positive sequence \(t_n\), let \(\Theta _n = \left\{ \varvec{\beta }: \left\| \varvec{\beta } - \tilde{\varvec{\beta }}\right\| _{\infty } \le t_n \right\} \). For \(\varvec{\beta } \in \Theta _n \cap \Theta \), by Taylor’s expansion we have

$$\begin{aligned} \ell _p(\varvec{\beta }) - \ell _p(\varvec{\beta }^{*}) = \Gamma _1 + \Gamma _2 + \Gamma _3, \end{aligned}$$

where

$$\begin{aligned} \Gamma _1&= -2\sum _{i=1}^n \left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i\cdot }^{*}\right) ^T {\textbf{x}}_{i\cdot } (y_i - {\textbf{x}}_{i\cdot }^T \varvec{\beta }_{i \cdot }^m), \\ \Gamma _2&= \lambda _1 \sum _{i=1}^n \left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i\cdot }^{*}\right) ^T \left( \partial \sum _{i < i^{\prime }} w_{i i^{\prime }} \left\| \varvec{\beta }_{i \cdot } - \varvec{\beta }_{i^{\prime } \cdot } \right\| _2 / \partial \varvec{\beta }_{i \cdot }\right) |_{\varvec{\beta } = \varvec{\beta }^m},\\ \Gamma _3&= \lambda _2 \sum _{i=1}^n \sum _{j=1}^J \zeta _{ij} {\text {ST}}(\beta _{ij}^m) (\beta _{ij} - \beta _{ij}^{*}) I(\beta _{0ij} \ne 0), \end{aligned}$$

in which \(\varvec{\beta }^m = \eta \varvec{\beta } + (1-\eta )\varvec{\beta }^{*}\) for some \(\eta \in (0,1)\). Moreover,

$$\begin{aligned} \Gamma _{3}&= \lambda _2 \sum _{i=1}^n \sum _{j=1}^J \zeta _{ij} {\text {ST}}(\beta _{ij}^m) (\beta _{ij} - \beta _{ij}^{*}) I(\beta _{0ij} \ne 0) \\&= \lambda _2 \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k\}} \sum _{j=1}^p \zeta _{ij} {\text {ST}}(\beta _{ij}^m) \frac{(\beta _{ij} - \beta _{i^{\prime }j})}{\left| {\mathcal {G}}_k \right| } I(\beta _{0ij} \ne 0 \, \text {or} \, \beta _{0i^{\prime }j} \ne 0) \\&\ge - \lambda _2 \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k\}} \frac{\left\| \varvec{\beta }_{i \cdot } - \varvec{\beta }_{i^{\prime } \cdot }\right\| }{|{\mathcal {G}}_k|} \left\| \left( \varvec{\zeta }_{i1 \cdot }I(\beta _{0i1} \ne 0), \ldots , \varvec{\zeta }_{ip \cdot }I(\beta _{0ip} \ne 0) \right) \right\| \\&= -2 \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k,i<i^{\prime }\}} \frac{\left\| \varvec{\beta }_{i \cdot } - \varvec{\beta }_{i^{\prime } \cdot }\right\| }{|{\mathcal {G}}_k|} \lambda _2 \left\| \left( \varvec{\zeta }_{i1 \cdot }I(\beta _{0i1} \ne 0), \ldots , \varvec{\zeta }_{ip \cdot }I(\beta _{0ip} \ne 0) \right) \right\| , \end{aligned}$$

where \(\varvec{\zeta }_{i \cdot } = \left( \zeta _{i1}, \ldots , \zeta _{ip} \right) ^T\). And we have:

$$\begin{aligned} \lambda _2 \left\| \left( \varvec{\zeta }_{i1 \cdot }I(\beta _{0i1} \ne 0), \ldots , \varvec{\zeta }_{ip \cdot }I(\beta _{0ip} \ne 0) \right) \right\| = \sqrt{\sum _{j=1}^p(\lambda _2 \zeta _{ij} I(\beta _{0ij} \ne 0))^2} = o\left( \sqrt{\frac{q_{\max }}{\left| {\mathcal {G}}_{\min } \right| }}\right) . \end{aligned}$$

Furthermore,

$$\begin{aligned} \Gamma _2&= \lambda _1 \sum _{i=1}^n \left\{ \sum _{i< i^{\prime }} w_{i i^{\prime }} \left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i\cdot }^{*}\right) ^T \frac{\varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m}{\left\| \varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m \right\| _2} + \sum _{i > i^{\prime }} w_{i i^{\prime }} \left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i\cdot }^{*}\right) ^T \frac{\varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m}{\left\| \varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m \right\| _2} \right\} \\&= \lambda _1 \sum _{i=1}^n \left\{ \sum _{i< i^{\prime }} w_{i i^{\prime }} \left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i\cdot }^{*}\right) ^T \frac{\varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m}{\left\| \varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m \right\| _2} + \sum _{i< i^{\prime }} w_{i i^{\prime }} \left( \varvec{\beta }_{i^{\prime }\cdot } - \varvec{\beta }_{i^{\prime }\cdot }^{*}\right) ^T \frac{\varvec{\beta }_{i^{\prime } \cdot }^m - \varvec{\beta }_{i \cdot }^m}{\left\| \varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m \right\| _2} \right\} \\&= \lambda _1 \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k,i<i^{\prime }\}} \left\{ \left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i^{\prime }\cdot }\right) ^T \frac{w_{i i^{\prime }}(\varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m)}{\left\| \varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m \right\| _2} \right\} \\&\qquad + \lambda _1 \sum _{k < k^{\prime }} \sum _{i \in {\mathcal {G}}_k,i^{\prime } \in {\mathcal {G}}_{k^{\prime }}}w_{i i^{\prime }} \left\{ \left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i\cdot }^{*}\right) ^T \frac{\varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m}{\left\| \varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m \right\| _2} - \left( \varvec{\beta }_{i^{\prime }\cdot } - \varvec{\beta }_{i^{\prime }\cdot }^{*}\right) ^T \frac{\varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m}{\left\| \varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m \right\| _2} \right\} , \end{aligned}$$

and we have \(\varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m = \eta \varvec{\beta }_{i \cdot } + (1-\eta )\varvec{\beta }_{i \cdot }^{*} - \eta \varvec{\beta }_{i^{\prime } \cdot } - (1-\eta )\varvec{\beta }_{i^{\prime } \cdot }^{*} = \eta (\varvec{\beta }_{i \cdot } - \varvec{\beta }_{i^{\prime } \cdot })\).

For \(k \ne k^{\prime },i \in {\mathcal {G}}_k,i^{\prime } \in {\mathcal {G}}_{k^{\prime }}\), according to condition (C5)

$$\begin{aligned}&\lambda _1 \sum _{k< k^{\prime }} \sum _{i \in {\mathcal {G}}_k,i^{\prime } \in {\mathcal {G}}_{k^{\prime }}} w_{i i^{\prime }} \left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i\cdot }^{*}\right) ^T \frac{\varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m}{\left\| \varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m \right\| _2}\\ \le&\sum _{k < k^{\prime }} \sum _{i \in {\mathcal {G}}_k,i^{\prime } \in {\mathcal {G}}_{k^{\prime }}} b_n w_{i i^{\prime }} \left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i\cdot }^{*}\right) ^T \frac{\varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m}{\left\| \varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m \right\| _2} \rightarrow 0. \end{aligned}$$

As a result,

$$\begin{aligned} \Gamma _2&= \lambda _1 \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k,i<i^{\prime }\}} \left\{ \left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i^{\prime }\cdot }\right) ^T \frac{w_{i i^{\prime }}(\varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m)}{\left\| \varvec{\beta }_{i \cdot }^m - \varvec{\beta }_{i^{\prime } \cdot }^m \right\| _2} \right\} \\&= \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k,i<i^{\prime }\}} \left\{ \lambda _1 w_{i i^{\prime }}\left\| \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i^{\prime }\cdot }\right\| \right\} . \end{aligned}$$

We have

$$\begin{aligned} \sup _{k} \left\| {\textbf{B}}_{k\cdot } - {\textbf{B}}_{ok\cdot } \right\| _{\infty }&= \sup _{k} \left\| \left| {\mathcal {G}}_k\right| ^{-1} \sum _{i \in {\mathcal {G}}_k} \varvec{\beta }_{i} - {\textbf{B}}_{ok\cdot } \right\| _{\infty } = \sup _{k} \left\| \left| {\mathcal {G}}_k\right| ^{-1} \sum _{i \in {\mathcal {G}}_k} (\varvec{\beta }_{i} - \varvec{\beta }_{oi}) \right\| _{\infty } \\&\le \sup _{k} \left| {\mathcal {G}}_k\right| ^{-1} \sum _{i \in {\mathcal {G}}_k}\left\| (\varvec{\beta }_{i} - \varvec{\beta }_{oi}) \right\| _{\infty } \le \sup _{i} \left\| \varvec{\beta }_{i} - \varvec{\beta }_{0i} \right\| _{\infty } \le \tau /\sqrt{\left| {\mathcal {G}}_{\min }\right| }. \end{aligned}$$

Moreover,

$$\begin{aligned} \sup _{i} \left\| \varvec{\beta }_{i}^{*} - \varvec{\beta }_{0i} \right\| _{\infty } = \sup _{k} \left\| {\textbf{B}}_{k\cdot } - {\textbf{B}}_{ok\cdot } \right\| _{\infty } \le \tau /\sqrt{\left| {\mathcal {G}}_{\min }\right| }. \end{aligned}$$

Since \(\varvec{\beta }^m\) is between \(\varvec{\beta }\) and \(\varvec{\beta }^{*}\),

$$\begin{aligned} \sup _{i} \left\| \varvec{\beta }_{i}^m - \varvec{\beta }_{0i} \right\| _{\infty }&\le \eta \left\| \varvec{\beta }_{i} - \varvec{\beta }_{0i} \right\| _{\infty } + (1-\eta ) \left\| \varvec{\beta }_{i}^{*} - \varvec{\beta }_{0i} \right\| _{\infty } \\&\le \eta \tau /\sqrt{\left| {\mathcal {G}}_{\min }\right| } + (1 - \eta ) \tau /\sqrt{\left| {\mathcal {G}}_{\min }\right| } = \tau /\sqrt{\left| {\mathcal {G}}_{\min }\right| }. \end{aligned}$$

Then

$$\begin{aligned} \Gamma _1&= -2\sum _{i=1}^n \left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i\cdot }^{*}\right) ^T {\textbf{x}}_{i\cdot }( y_i - {\textbf{x}}_{i\cdot }^T \varvec{\beta }_{i \cdot }^m) \\&= -2 \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k\}} \frac{\left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i^{\prime }\cdot }\right) ^T}{|{\mathcal {G}}_k|} {\textbf{x}}_{i\cdot }( y_i - {\textbf{x}}_{i\cdot }^T \varvec{\beta }_{i \cdot }^m) \\&= - \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k\}} \frac{\left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i^{\prime }\cdot }\right) ^T}{|{\mathcal {G}}_k|} {\textbf{x}}_{i\cdot }( y_i - {\textbf{x}}_{i\cdot }^T \varvec{\beta }_{i \cdot }^m) - \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k\}} \frac{\left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i^{\prime }\cdot }\right) ^T}{|{\mathcal {G}}_k|} {\textbf{x}}_{i\cdot }( y_i - {\textbf{x}}_{i\cdot }^T \varvec{\beta }_{i \cdot }^m) \\&= - \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k\}} \frac{\left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i^{\prime }\cdot }\right) ^T}{|{\mathcal {G}}_k|} \left[ {\textbf{x}}_{i\cdot }( y_i - {\textbf{x}}_{i\cdot }^T \varvec{\beta }_{i \cdot }^m) - {\textbf{x}}_{i^{\prime }\cdot }( y_{i^{\prime }} - {\textbf{x}}_{i^{\prime }\cdot }^T \varvec{\beta }_{i^{\prime } \cdot }^m) \right] \\&= -2 \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k, i<i^{\prime }\}} \frac{\left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i^{\prime }\cdot }\right) ^T}{|{\mathcal {G}}_k|} \left[ {\textbf{x}}_{i\cdot }( y_i - {\textbf{x}}_{i\cdot }^T \varvec{\beta }_{i \cdot }^m) - {\textbf{x}}_{i^{\prime }\cdot }( y_{i^{\prime }} - {\textbf{x}}_{i^{\prime }\cdot }^T \varvec{\beta }_{i^{\prime } \cdot }^m) \right] \\&= -2 \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k, i<i^{\prime }\}} \frac{\left( \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i^{\prime }\cdot }\right) ^T}{|{\mathcal {G}}_k|} \left[ {\textbf{Q}}_i - {\textbf{Q}}_{i^{\prime }} \right] \end{aligned}$$

where \({\textbf{Q}}_i = {\textbf{x}}_{i\cdot }( y_i - {\textbf{x}}_{i\cdot }^T \varvec{\beta }_{i \cdot }^m)\) and

$$\begin{aligned} {\textbf{Q}}_i = {\textbf{x}}_{i\cdot }( y_i - {\textbf{x}}_{i\cdot }^T \varvec{\beta }_{i \cdot }^m) = {\textbf{x}}_{i\cdot }( \epsilon _i + {\textbf{x}}_{i\cdot }^T \left( \varvec{\beta }_{0i \cdot }-\varvec{\beta }_{i \cdot }^m\right) ), \end{aligned}$$

then

$$\begin{aligned} \sup _{i} \left\| {\textbf{Q}}_i \right\| \le \sup _{i} \left\{ \left\| {\textbf{x}}_{i\cdot }\right\| \left( \left\| \varvec{\epsilon } \right\| _{\infty } + \left\| {\textbf{x}}_{i\cdot }\right\| \left\| \varvec{\beta }_{0i \cdot }-\varvec{\beta }_{i \cdot }^m\right\| \right) \right\} \end{aligned}$$

By condition (C2) (iii) that \(\sup _i \left\| {\textbf{x}}_{i\cdot }\right\| \le c_2 \sqrt{q_{\max }}\) and \(\sup _i \left\| \varvec{\beta }_{0i \cdot }-\varvec{\beta }_{i \cdot }^m\right\| \le \tau \sqrt{q_{\max }/\left| {\mathcal {G}}_{\min }\right| }\), we have

$$\begin{aligned} \sup _{i} \left\| {\textbf{Q}}_i \right\| \le c_2 \sqrt{q_{\max }} \left( \left\| \varvec{\epsilon } \right\| _{\infty } + \tau c_2 q_{\max } / \sqrt{\left| {\mathcal {G}}_{\min }\right| } \right) . \end{aligned}$$

By Condition (C1), for \(i \in \{1,\ldots ,n\}\), and \(\{j: \varvec{\beta }_{0ij} \ne 0\}\),

$$\begin{aligned} P(\left\| \varvec{\epsilon } \right\| _{\infty }> \sqrt{2/c_1} \sqrt{\ln n}) \le \sum _{i=1}^n P(\left| \epsilon _i \right| > \sqrt{2/c_1} \sqrt{\ln n}) \le 2/n. \end{aligned}$$

Thus there is an event \(E_2\) such that \(P(E_2^c) \le 2/n\), and over the event \(E_2\),

$$\begin{aligned} \sup _{i} \left\| {\textbf{Q}}_i \right\| \le c_2 \sqrt{q_{\max }} \left( \sqrt{2/c_1} \sqrt{\ln n} + \tau c_2 q_{\max } / \sqrt{\left| {\mathcal {G}}_{\min }\right| } \right) . \end{aligned}$$

Then

$$\begin{aligned}&\left| \left( \varvec{\beta }_{i\cdot } -\varvec{\beta }_{i^{\prime }\cdot }\right) ^T \left( {\textbf{Q}}_i - {\textbf{Q}}_{i^{\prime }} \right) \right| \le \left\| \varvec{\beta }_{i\cdot } -\varvec{\beta }_{i^{\prime }\cdot } \right\| \left\| {\textbf{Q}}_i - {\textbf{Q}}_{i^{\prime }} \right\| \le 2 \sup _{i} \left\| {\textbf{Q}}_i \right\| \left\| \varvec{\beta }_{i\cdot } -\varvec{\beta }_{i^{\prime }\cdot } \right\| \\&\le 2 c_2 \sqrt{q_{\max }} \left( \sqrt{2/c_1} \sqrt{\ln n} + \tau c_2 q_{\max } / \sqrt{\left| {\mathcal {G}}_{\min }\right| } \right) \left\| \varvec{\beta }_{i\cdot } -\varvec{\beta }_{i^{\prime }\cdot } \right\| . \end{aligned}$$

Therefore,

$$\begin{aligned}&\ell _{p}(\varvec{\beta })-\ell _{p}\left( \varvec{\beta }^{*}\right) = \Gamma _{1}+\Gamma _{2}+\Gamma _{3} \\&\quad \ge \sum _{k=1}^K \sum _{\{i,i^{\prime } \in {\mathcal {G}}_k,i<i^{\prime }\}} \left\{ \frac{\lambda _1 w_{i i^{\prime }} |{\mathcal {G}}_k|}{\sqrt{n}} -2 \left( \frac{c_2\sqrt{q_{\max }}\sqrt{2\ln n/c_1}}{\sqrt{n}} + \frac{\tau c_2^2 q_{\max }^{3/2}}{\sqrt{n\left| {\mathcal {G}}_{\min }\right| }}\right) \right. \\&\qquad \left. -2 \frac{\lambda _2 \left\| \left( \varvec{\zeta }_{i1 \cdot }I(\beta _{0i1} \ne 0), \ldots , \varvec{\zeta }_{ip \cdot }I(\beta _{0ip} \ne 0) \right) \right\| }{\sqrt{n}} \right\} \frac{\sqrt{n}\left\| \varvec{\beta }_{i\cdot } - \varvec{\beta }_{i^{\prime }\cdot }\right\| }{|{\mathcal {G}}_k|}. \end{aligned}$$

Let \(t_n = o(1)\), since \(\lambda _2 \left\| \left( \varvec{\zeta }_{i1 \cdot }I(\beta _{0i1} \ne 0), \ldots , \varvec{\zeta }_{ip \cdot }I(\beta _{0ip} \ne 0) \right) \right\| = o\left( \sqrt{\frac{q_{\max }}{\left| {\mathcal {G}}_{\min }\right| }}\right) \), and \(\sqrt{\frac{\ln {n}}{n}} \rightarrow 0, \frac{q_{\max }^{3/2}}{\sqrt{n\left| {\mathcal {G}}_{\min }\right| }} \rightarrow 0\) as \(n \rightarrow \infty \). Therefore, by condition (C5) (i) we have \(\ell _{p}(\varvec{\beta })-\ell _{p}\left( \varvec{\beta }^{*}\right) \ge 0\) for sufficiently large n. This completes the proof. \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, H., Wu, J. & Zhang, W. Simultaneous subgroup identification and variable selection for high dimensional data. Comput Stat (2023). https://doi.org/10.1007/s00180-023-01436-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00180-023-01436-3

Keywords

Navigation