1 Introduction

Statistical learning models have been extensively used for data interpretations and predictions in many fields collecting a large amount of data, such as computer vision and computational biology. Given a dataset, a statistical learning model is utilized to estimate unknown model coefficients by minimizing a certain loss function such as mean squared error and cross-entropy [17]. With a wide range of features, only a small subset of them actually corresponds to nonzero model coefficients; as a result, fitting a model by blindly incorporating all the available features may cause a misleading interpretation of contributing features and an inaccurate prediction. In an attempt to select meaningful features automatically, feature selection techniques [4, 16, 21] have been developed to produce a sparsely fitted model with many zero coefficients. For example, regularization methods, appending a penalty function to the loss function, discard insignificant features by setting small model coefficients to zero. A well-known regularization term is the \(\ell _1\)-norm of the model coefficients, known as least absolute shrinkage and selection operator (LASSO) [53]. LASSO has been widely used in diverse settings to tackle practical problems such as forecasting corporate bankruptcy and assessing drug efficacy [52, 54]. However, the convex \(\ell _1\) penalty may mistakenly suppress large coefficients while shrinking small coefficients to zero, which is referred to as biasedness [14]. To alleviate this issue, nonconvex regularization functions have been proposed, including smoothly clipped absolute deviation (SCAD) [14], minimax concave penalty (MCP) [68], transformed \(\ell _1\) (TL1) [37, 70], \(\ell _{1/2}\) penalty [26, 65], scale-invariant \(\ell _1\) [45, 56], and logarithm penalty [9]. These nonconvex regularizers try to retain the large model coefficients while applying a similar shrinkage of smaller coefficients as the convex \(\ell _1\)-norm.

Group selection is a variant of feature selection when features have a group structure; specifically, each feature belongs to one or more groups. A group structure is non-overlapping if each feature is assigned to exactly one group. In the non-overlapping group selection, the features in the same group are required to have the corresponding coefficients altogether nonzero or zero, which is called group sparsity. The group structure of the features is reasonable in many applications. For example, a categorical variable should be converted to a group of several binary dummy variables prior to model fitting. Group sparsity can be enforced into a model by extending regularization techniques to a group setting. Group LASSO [35, 67] achieves group sparsity with the convex \(\ell _{2,1}\) norm, i.e., the sum of the \(\ell _2\) norms of the coefficients from the same group. Exploiting the convexity, many algorithms were applied and developed to solve for the group LASSO, including the gradient descent [72], the coordinate descent [25, 46], the second-order cone program [28], the semismooth Newton’s method [71], the subspace acceleration method [11], and the alternating direction method of multipliers (ADMM) [6, 12]. However, group LASSO may inherently have bias in the same way as LASSO [21]. Consequently, nonconvex regularizers such as \(\ell _{0,2}\) penalty [22], group capped-\(\ell _1\) [38, 44], group MCP [21], group SCAD [57], and group LOG [23], were introduced. A variety of numerical algorithms are considered to solve for the nonconvex optimization, including primal dual active set algorithm [22], smoothing penalty algorithm [38], difference-of-convex algorithm [44], coordinate descent algorithms [42, 60], group descent algorithm [8], coordinate majorization descent algorithm [59], and iterative reweighted algorithm [23].

We propose a generalized formulation for fitting a statistical learning model with a non-overlapping group structure. For example, we establish the connection of loss functions under our setting to generalized linear models (GLMs) [34]. Many existing regularizations can be regarded as special cases of our generalized framework; we consider the group variants of \(\ell _1\) penalty, SCAD, MCP, and TL1 as case studies. Our optimization problem is nonconvex if we use nonconvex regularization terms. Instead of a global optimum that might be hard to obtain, our analysis is based on the directional stationary solution, which can be provably achieved by using our algorithm. Directional stationary solutions were studied for nonconvex programming analyses, especially for the difference of convex (DC) problems [33, 39, 55]. In the DC literature, it has been shown that such stationarity is a necessary condition for local optimality [39]. Furthermore, the directional stationary solution can achieve local and even global optimality [3] under suitable conditions. In this paper, we identify such conditions based on a restricted strong convexity (RSC) assumption [18]. RSC was originally analyzed for the convex LASSO problem where the global minimizer is compared to the ground-truth. We relax the requirement of having the ground-truth vector by a reference point that can be tied to the ground-truth in a probabilistic interpretation. We further provide a bound for the distance from the stationary solution obtained by our approach to this reference point.

To solve our proposed model, we design an iterative scheme based on the ADMM framework, which is an efficient method for solving large-scale optimization problems in statistics, machine learning, and related fields [10, 15, 49, 63, 73]. Although ADMM is originally designed for convex optimization, it can be extended to nonconvex problems, and its global convergence can be proved under certain conditions [19, 40, 58]. In our problem formulation, ADMM provides an iterative scheme that involves two subproblems that can be minimized sequentially. One subproblem involving the loss function is assumed to be convex, and hence it can be solved efficiently by closed-form solution or iterative methods. The other subproblem is related to the nonconvex regularization terms. We rely on proximal operators [41] to produce a stationary solution and characterize conditions for the global optimality of the directional stationary solutions. Theoretically, we prove the subsequence convergence of the ADMM framework. In other words, the sequence generated by the ADMM has a subsequence convergence to a directional stationary point of our proposed model.

We conduct in-depth experiments on various datasets under different learning settings. We use synthetic datasets for linear regression and Poisson regression, and one real dataset for logistic regression. With hyperparameters tuned by cross-validation, our framework with various group regularization terms is applied to those datasets. An overall evaluation indicates that nonconvex penalty functions consistently outperform convex penalty functions.

We summarize our major contributions as follows,

  1. (1)

    We introduce a generalized formulation for a non-overlapping group sparsity problem that can be reduced to many existing works;

  2. (2)

    We investigate properties of the directional stationary solutions of the problem to achieve optimality as well as the bound on the distance from the solution to a reference point which is closely related to the optima of the loss function;

  3. (3)

    We prove the subsequence convergence of ADMM iterates to a directional stationary point of our proposed model;

  4. (4)

    Our numerical experiments on various models under different group structure settings evaluate the performance of existing group sparsity functions under the proposed formulation, showing the advantages of nonconvex penalties over convex ones.

The rest of the paper is organized as follows. In Sect. 2, we introduce a generalized formulation for non-overlapping group selection. Section 3 analyzes the conditions for global optimality of directional stationary solutions to our optimization problem and their properties under the RSC assumption on the loss function. In Sect. 4, we present the ADMM framework for minimizing the proposed model with convergence analysis. In Sect. 5, we present in-depth numerical experiments on both synthetic and real datasets for linear, Poisson, and logistic regressions. Section 6 concludes the paper.

2 The Proposed Framework

We consider a statistical learning problem defined by a coefficient vector \(\textbf{x} \in \mathbb {R}^d\) with a pre-determined group structure. We restrict our attention to a non-overlapping group selection by assuming that components of the variable \(\textbf{x}\) are from m non-overlapping groups, denoted by \(\mathcal {G}_k\) for \(k \in \{1, \dots , m\}.\) Specifically \(\mathcal {G}_k\) is a set of indices of \(\textbf{x}\) that belong to the kth group. The setting of non-overlapping groups implies that all \(\mathcal {G}_k\)’s are mutually exclusive. The notation \(| \mathcal {G}_k |\) represents the cardinality of the set \(\mathcal {G}_k\), and \(\textbf{x}_{\mathcal {G}_k}\in \mathbb {R}^{|\mathcal {G}_k|}\) is defined as a subvector of \(\textbf{x}\) that only consists of the coefficients in the group \(\mathcal {G}_k\). For the rest of the paper, we denote \(\mathcal {G}^{\max }\triangleq \max \limits _{1 \le k \le m} \sqrt{| \mathcal {G}_k|}\).

We aim to penalize the complexity of the group structure while minimizing the loss of the model simultaneously. To this end, we propose a general framework

$$\begin{aligned} \begin{aligned} \min _{\textbf{x}} \hspace{0.5pc}&F_\lambda (\textbf{x}) \triangleq L(\textbf{x}) + \lambda \, \displaystyle {\sum \limits _{k=1}^m} \displaystyle \sqrt{|\mathcal {G}_k| } \ \underbrace{p (\Vert \textbf{x}_{\mathcal {G}_k}\Vert _2)}_{\triangleq P_k(\textbf{x}_{\mathcal {G}_k})}, \end{aligned} \end{aligned}$$
(1)

where the loss function \(L(\cdot )\) measures the fit of the model to the observed data, the regularizer \(P_k(\cdot )\) determines the group complexity for the kth group of the model coefficients, and a hyperparameter \(\lambda > 0\) balances between the model fitting and the group complexity. Note that \(P_k:\mathbb {R}^{|\mathcal {G}_k|} \rightarrow \mathbb {R}\) is a composite function consisting of a univariate sparsity function \(p: \mathbb {R} \rightarrow \mathbb {R}\) and the norm of the coefficients corresponding to \(\mathcal {G}_k\). The following set of assumptions is considered on \(p(\cdot )\):

  1. (P1)

    . p is symmetric about zero on \(\mathbb {R}\), i.e., \(p(t) = p(-t),\) concave and non-decreasing on \([0,\infty )\) with \(p(0) = 0.\)

  2. (P2)

    . The derivative of p is well-defined except at 0, finite with \(p'(t) \ge 0\), and \(u \triangleq \sup \limits _{t} p'(t)\) for any \(t > 0\).

  3. (P3)

    . There exists a constant \(\textrm{Lip}_p > 0 \) such that

    $$\begin{aligned} |p'(t_1) - p'(t_2) | \le \mathrm{Lip_p} | t_1-t_2 | \quad \forall \ t_1, t_2 > 0. \end{aligned}$$

It is straightforward to verify that many popular sparsity-promoting functions, including SCAD [14], MCP [68], transformed \(\ell _1\) [37, 70], and logarithm penalty [9], satisfy all the assumptions. We assume (P1) throughout this paper and impose (P2) or (P3) wherever needed.

The assumptions on the loss function are summarized as follows.

  1. (A1)

    . \(L(\cdot )\) is lower-bounded.

  2. (A2)

    . There exists a constant \(\textrm{Lip}_L > 0\) such that

    $$\begin{aligned} \Vert \nabla L(\textbf{x}_1) - \nabla L(\textbf{x}_2) \Vert _2 \le \mathrm{Lip_L} \Vert \textbf{x}_1- \textbf{x}_2 \Vert _2 \quad \forall \textbf{x}_1, \, \textbf{x}_2. \end{aligned}$$
  3. (A3)

    . \(L(\cdot )\) is convex with modulus \(\sigma ,\) i.e., there exists a constant \(\sigma \ge 0\) such that

    $$\begin{aligned} L(\textbf{x}_1) - L(\textbf{x}_2) - \langle \nabla L(\textbf{x}_2), \textbf{x}_1- \textbf{x}_2\rangle \ge \displaystyle {\frac{\sigma }{2}} \Vert \textbf{x}_1- \textbf{x}_2 \Vert _2^2 \quad \forall \textbf{x}_1, \, \textbf{x}_2. \end{aligned}$$

    If \(\sigma \) is strictly positive, \(L(\cdot )\) is strongly convex.

Some sparsity functions in the literature have been written as DC functions [3, 32, 66, 70]. Using a unified DC representation [3] of p with a convex loss, the problem (1) can be written as a DC formulation, e.g., \(F_\lambda (\textbf{x}) = g(\textbf{x}) - h(\textbf{x})\) where both g and h are convex functions. Subsequently, existing algorithms [24, 27, 29, 43] can be applied to compute a critical point, e.g., \(\bar{\textbf{x}}\) with \(\textbf{0} \in \partial F_\lambda (\bar{\textbf{x}}\)). Our approach bypasses the use of a DC representation of (1) and introduces a numerical method that computes a directional stationary point.

The loss function used in our optimization problem (1) can be specified by generalized linear models (GLMs) that are widely used for supervised learning, where input features explain a response [34]. GLMs are an extension of ordinary linear regression beyond Gaussian response variables, where the responses \(b_i\), \(i=1,\ldots ,n\), follow any exponential family distribution with a parameter \(\theta _i\). Specifically, the probability density function of \(b_i\) takes the following canonical form:

$$\begin{aligned} f(b_i; \theta _i)=\phi (b_i) \textrm{exp} \bigg \{b_i \theta _i - \psi (\theta _i)\bigg \}, \end{aligned}$$
(2)

where \(\psi \) is a cumulant function and \(\phi \) is a function that is independent of \(\theta _i\) [34, 64]. Both \(\psi \) and \(\phi \) are given functions by the exponential family distribution in consideration, and the first-order derivative of the cumulant function \(\psi '(\theta _i)\) is the expected value of \(b_i\) [30, 34]. Taking a Gaussian distribution with mean \(\theta _i\) and variance 1, for example, one sets \( \psi (\theta _i)=\frac{\theta _i^2}{2}\) and \(\phi (b_i)= \frac{\exp (-{b_i^2}/{2})}{\sqrt{2\pi }}\).

Denote the ith observation of the input features by a row vector \(A_i \in \mathbb {R}^d\) of the matrix \(A \in \mathbb {R}^{n \times d}\). A GLM associates \(\psi '(\theta _i)\) with a linear function of the features \(A_i \textbf{x}\) such that \(\theta _i=A_i \textbf{x}\). Omitting \(\phi (b_i)\) that is free of \(\theta _i\), the loss function L for the GLM is written by summing the negative exponent in (2) for all the observations \((A_i, b_i)\), \(i=1,\ldots ,n\), i.e.,

$$\begin{aligned} \displaystyle \frac{1}{n}\displaystyle \sum _{i=1}^n \{\psi (\theta _i)-b_i \theta _i \} = \displaystyle \frac{1}{n}\displaystyle \sum _{i=1}^n \{\psi (A_i \textbf{x}) - b_i A_i\textbf{x}\}. \end{aligned}$$

For example, the logistic regression models binary responses \(b_i \in \{0, 1\}\) by

$$\begin{aligned} \psi (\theta _i)=\log (1+\exp (\theta _i))=\log \{1+\exp (A_i \textbf{x})\}. \end{aligned}$$

The Poisson regression for count data \(b_i \in \{0, 1, 2, \ldots \}\) has \(\psi (\theta _i)=\exp (\theta _i)=\exp (A_i\textbf{x})\). Under the assumptions that the second-order derivative \(\psi ''\) is continuous, positive, and bounded above by a constant, we establish the Lipschitz continuity of the gradient of the loss function in (A2) and the convexity in (A3). The boundedness assumption on \(\psi ''\) holds for loss functions of most GLMs such as ordinary linear regression, logistic regression, and multinomial regression [30], but not Poisson regression.

3 Theoretical Analysis

We present two theoretical results of the proposed model (1). Specifically, Sect. 3.1 establishes the global optimality of a directional stationary solution of (1) when the loss function is strongly convex. In Sect. 3.2, we define restricted strong convexity (RSC), under which the loss function is strongly convex over a subset of the feasible space, followed by providing an upper bound of the distance from the directional stationary solution to a reference point which we define in Sect. 3.2.

3.1 Optimality of Directional Stationary Solutions

Notations: For a function \(f: \mathbb {R} \rightarrow \mathbb {R}\), its derivative at a point t is denoted as \(f'(t)\). The notation \(f'(t^+)\) represents the right-side derivative at t, i.e.,

$$f'(t^+)\triangleq \lim _{h \rightarrow 0^+} \frac{f(t+h) - f(t) }{h}.$$

The notation \(F'(\textbf{x}; \textbf{d})\), for \(F: \mathbb {R}^d \rightarrow \mathbb {R}\), is the directional derivative of F at a point \(\textbf{x}\) along the direction \(\textbf{d},\) which is formally defined in Definition 1.

Definition 1

Given a function \(F: \mathbb {R}^d \rightarrow \mathbb {R}\), the directional derivative of F at point \(\textbf{x} \in \mathbb {R}^d\) along direction \(\textbf{d} \in \mathbb {R}^d\) is denoted as \(F'(\textbf{x}; \textbf{d})\) and defined by

$$\begin{aligned} F'(\textbf{x}; \textbf{d}) \triangleq \lim _{h \rightarrow 0^+}\frac{F(\textbf{x} + h \textbf{d})- F(\textbf{x})}{h}. \end{aligned}$$

Next, we provide the definition of a directional stationary point used in this paper.

Definition 2

The point \(\textbf{x}^*\) be a directional stationary point of an unconstrained optimization problem with an objective function \(F:\mathbb {R}^d \rightarrow \mathbb {R}\) if \(F'(\textbf{x}^*;\textbf{x} - \textbf{x}^*) \ge 0, \ \forall \textbf{x} \in \mathbb {R}^d\).

Our analysis is built on the directional stationary solutions. Hereafter, we simply denote a directional stationary solution as a stationary solution. It has been shown that directional stationarity is a necessary condition for a point to be a local minimum for certain DC programs [39].

We establish in Theorem 2 that a stationary solution of (1) is a global minimizer under certain conditions. The proof of Theorem 2 requires Lemma 1.

Lemma 1

If p satisfies Assumption (P3), we have

$$\begin{aligned} p(\Vert \textbf{y} \Vert _2) - p(\Vert \textbf{x} \Vert _2) \ge P'(\textbf{x}; \textbf{y} -\textbf{x}) - \displaystyle \frac{\mathrm{Lip_p}}{2}\Vert \textbf{x} - \textbf{y}\Vert _2^2 \quad \forall \textbf{x}, \; \textbf{y} \in \mathbb R^n, \end{aligned}$$
(3)

where \(P(\textbf{x}) \triangleq p(\Vert \textbf{x}\Vert _2)\) and n is an ambient dimension of \(\textbf{x}\) and \(\textbf{y}\).

Proof

Let \(G(\textbf{x}) \triangleq p(\Vert \textbf{x} \Vert _2) + \displaystyle \frac{\mathrm{Lip_p}}{2}\Vert \textbf{x}\Vert _2^2\). Observe that \(G(\textbf{x}) = G_1 \circ G_2 (\textbf{x})\) where \(G_1(t) \triangleq p (t) + \displaystyle \frac{\mathrm{Lip_p}}{2} t^2\) and \(G_2(\textbf{x}) \triangleq \Vert \textbf{x}\Vert _2\). Then for any \(0 < t_1 \le t_2\), we have

$$\begin{aligned} G'_1(t_1) - G'_1(t_2)&= p'(t_1) - p'(t_2) + {\mathrm{Lip_{p}}} (t_1 - t_2) \\&\le |p'(t_1) - p'(t_2)| - {\mathrm{Lip_{p}}} |t_2 - t_1| \quad \text { since } t_1 \le t_2 \\&\le {\mathrm{Lip_{p}}} |t_1-t_2 | - {\mathrm{Lip_{p}}} |t_1-t_2| = 0 \quad \text { by }(P3), \end{aligned}$$

which implies that \(G'_1(t)\) is monotonically non-decreasing on \((0,\infty ).\) By choosing sufficiently small \(0 < t_1 \le t_2\) and taking limit on \(t_1\) to 0, we obtain \(|p'(0^+) - p'(t_2)| \le \mathrm{Lip_p} \, t_2\) from (P3). This implies \(G_1'(0^+) = p'(0^+) \le p'(t_2) + {\mathrm{Lip_{p}}} \, t_2 = G_2'(t_2)\). Therefore, \(G_1(t)\) is a convex function on the interval \([0,\infty )\). Since \(G_1, G_2\) are both convex and \(G_1\) is non-decreasing on \([0,\infty )\) by (P1), we conclude that their composite function \(G(\textbf{x})\) is also a convex function. Using the first order condition, i.e., \(G(\textbf{y}) \ge G(\textbf{x}) + G'(\textbf{x}; \textbf{y} - \textbf{x})\), we have

$$\begin{aligned} p(\Vert \textbf{y} \Vert _2) + \displaystyle \frac{\mathrm{Lip_{p}}}{2}\Vert \textbf{y}\Vert _2^2 \ge p(\Vert \textbf{x} \Vert _2) + \displaystyle \frac{\mathrm{Lip_{p}}}{2}\Vert \textbf{x}\Vert _2^2 + \mathrm{Lip_{p}} \langle \textbf{x} ,\textbf{y} - \textbf{x} \rangle + P'(\textbf{x}; \textbf{y} -\textbf{x}). \end{aligned}$$

After simple manipulations, we deduce the desired inequality (3). \(\square \)

Theorem 2

Let Assumptions (P3) and (A3) hold with \(\sigma > 0\). If \(\sigma \ge \lambda \textrm{Lip}_p \mathcal {G}^{\max }\), then any stationary solution of (1) is a global minimizer.

Proof

Denote \(\textbf{x}^*\) as a stationary solution of (1). By Assumption (A3) and applying Lemma 1, we have

$$\begin{aligned} F_\lambda (\textbf{x}) - F_\lambda (\textbf{x}^*) =&\ L(\textbf{x}) + \lambda \, \displaystyle {\sum \limits _{k=1}^m} \displaystyle \sqrt{|\mathcal {G}_k| } \, p( \Vert \textbf{x}_{\mathcal {G}_k}\Vert _2) - L(\textbf{x}^*) - \lambda \, \displaystyle {\sum \limits _{k=1}^m} \displaystyle \sqrt{|\mathcal {G}_k| } \, p( \Vert \textbf{x}_{\mathcal {G}_k}^* \Vert _2 ) \quad \forall \textbf{x}\in \mathbb {R}^d \nonumber \\ \ge&\ \displaystyle {\frac{\sigma }{2}} \, \Vert \textbf{x}- \textbf{x}^* \Vert _2^2 + \langle \nabla L(\textbf{x}^*), \textbf{x} - \textbf{x}^*\rangle + \lambda \, \displaystyle {\sum \limits _{k=1}^m} \displaystyle \sqrt{|\mathcal {G}_k| } \, \Big \{ p( \Vert \textbf{x}_{\mathcal {G}_k}\Vert _2) - p(\Vert \textbf{x}_{\mathcal {G}_k}^*\Vert _2)\Big \} \nonumber \\ \ge&\ \displaystyle {\frac{\sigma }{2}} \, \Vert \textbf{x} - \textbf{x}^* \Vert _2^2 + \langle \nabla L(\textbf{x}^*), \textbf{x} - \textbf{x}^*\rangle \nonumber \\&\hspace{2.5pc} + \lambda \, \displaystyle {\sum \limits _{k=1}^m} \displaystyle \sqrt{|\mathcal {G}_k| } \, \Big \{ P_k' (\textbf{x}_{\mathcal {G}_k}^*; \textbf{x}_{\mathcal {G}_k}- \textbf{x}_{\mathcal {G}_k}^*) - \displaystyle {\frac{\textrm{Lip}_p}{2}} \Vert \textbf{x}_{\mathcal {G}_k}- \textbf{x}_{\mathcal {G}_k}^* \Vert _2^2 \Big \}. \end{aligned}$$
(4)

Due to the stationarity (Definition 2), \(\textbf{x}^*\) satisfies

$$\begin{aligned} \langle \nabla L(\textbf{x}^*), \textbf{x} - \textbf{x}^* \rangle + \lambda \, \displaystyle {\sum \limits _{k=1}^m} \displaystyle \sqrt{|\mathcal {G}_k| } \, P_k' (\textbf{x}_{\mathcal {G}_k}^*; \textbf{x}_{\mathcal {G}_k}- \textbf{x}_{\mathcal {G}_k}^*) \ge 0. \end{aligned}$$

Consequently, the inequality (4) can be simplified as

$$\begin{aligned} F_\lambda (\textbf{x}) - F_\lambda (\textbf{x}^*)&\ge \displaystyle {\frac{\sigma }{2}} \, \Vert \textbf{x} - \textbf{x}^* \Vert _2^2 - \displaystyle {\frac{\lambda \, \textrm{Lip}_p}{2}} \, \displaystyle {\sum \limits _{k=1}^m} \displaystyle \sqrt{|\mathcal {G}_k| } \, \Vert \textbf{x}_{\mathcal {G}_k}- \textbf{x}_{\mathcal {G}_k}^* \Vert _2^2 \\&\ge \displaystyle {\frac{\sigma }{2}} \, \Vert \textbf{x} - \textbf{x}^* \Vert _2^2 - \displaystyle {\frac{\lambda \, \textrm{Lip}_p}{2}} \, \mathcal {G}^{\max }\, \Vert \textbf{x} - \textbf{x}^*\Vert _2^2 = \frac{\zeta }{2}\Vert \textbf{x} -\textbf{x}^*\Vert _2^2, \end{aligned}$$

where we define \(\zeta \triangleq \sigma - \lambda \textrm{Lip}_p \mathcal {G}^{\max }.\) If \(\sigma -\lambda \textrm{Lip}_p \mathcal {G}^{\max }\ge 0,\) then \(\zeta \ge 0\) and hence \(F_\lambda (\textbf{x})\ge F_\lambda (\textbf{x}^*)\ \forall \textbf{x},\) which implies that any stationary solution \(\textbf{x}^*\) is a global minimizer. \(\square \)

Our result identifies the condition on the hyperparamter \(\lambda \) which guarantees the global optimality of any stationary solutions. This is a generalization of [23, Theorem 2.2] which is for a special case of the group-LOG regularizer.

3.2 Stationarity Under Restricted Strong Convexity

Before discussing the concept of restricted strong convexity, we first consider two related problems of (1). In the first problem, we assume there exist data points \(\tilde{\varvec{\xi }}\) that are generated by an unknown distribution \(\mathcal {D}\). The following problem minimizes the loss of the model associated with the model coefficient \(\textbf{x}\) with respect to the data distribution:

$$\begin{aligned} \min \limits _{\textbf{x}} \ \mathrm{I\!E}_{\tilde{\varvec{\xi }} \sim \mathcal {D}} \ L(\textbf{x}; \tilde{\varvec{\xi }}) \triangleq \tilde{L}(\textbf{x}). \end{aligned}$$
(5)

With a set of observed data points \(\{ \varvec{\xi }_i \}_{i=1}^{n}\), we can also minimize the loss over the observed data points by solving the following sample average approximation problem:

$$\begin{aligned} \min \limits _{\textbf{x}} \frac{1}{n} \sum _{i=1}^{n} L(\textbf{x}; \varvec{\xi }_i) \triangleq \widehat{L}_n(\textbf{x}). \end{aligned}$$
(6)

Some works investigate the relationships between (5) and (6). For example, [48, Theorem 7.77] states that under suitable conditions, for any \(\epsilon ' > 0\), there exists positive constants \(\alpha \) and \(\beta \) independent of n such that

$$\begin{aligned} \text {P} \Big ( \, \sup _{\Vert \textbf{d} \Vert = 1} \, \left| \, \langle \nabla \widehat{L}_n (\bar{\textbf{x}}), \textbf{d} \rangle - \langle \nabla \tilde{L} (\bar{\textbf{x}}), \textbf{d} \rangle \, \right| \le \epsilon ' \, \Big ) \, > \, 1 - \alpha \, e^{- n \, \beta }, \end{aligned}$$

for any given vector \(\bar{\textbf{x}}\). The problem (1) is considered as sample average approximation which employs regularizers to help recover groupwise sparsity structure in the variables. The connection between the sample average approximation and the population risk motivates our analysis. If the ground-truth is the unique solution for (5), then the gradient of \(\tilde{L}\) at that point is zero, and the above result yields a probability involving the gradient of the sample average approximation function \(\widehat{L}_n\) at the ground-truth. However, since the ground-truth is unlikely to be attained in practice, we consider a reference point, denoted as \(\textbf{x}^\epsilon \), which satisfies

$$\begin{aligned} \Vert \nabla L(\textbf{x}^\epsilon )\Vert _2 \le \epsilon \end{aligned}$$
(7)

for a given \(\epsilon >0\). Our analysis is based on the reference point and we compare it to the stationary solutions of (1).

The assumption of strong convexity may not hold for some loss functions, e.g., ordinary linear regression with an under-determined system of linear equations. As a remedy, we consider a setting where strong convexity only holds over a smaller region rather than the entire domain of the loss function. Since the scope of the strong convexity is limited, such an assumption is referred to as the restricted strong convexity (RSC) [18]. In statistics, the RSC condition for a subset of possible vector differences between the ground-truth and an estimator is imposed [36].

We first derive a region for RSC to hold, then provide a bound of the distance from a stationary solution to a reference point. Let S be the group-wise support set of the reference point \(\textbf{x}^\epsilon \), i.e., \(S \triangleq \{\,k \in \{ 1,\dots , m \} \ | \ \Vert \textbf{x}_{\mathcal {G}_k}^\epsilon \Vert _2 \ne 0\}\). Given \(\delta >0\), define the set \(\mathcal {V}_\delta (S)\) as

$$\begin{aligned} \mathcal {V}_\delta (S) \triangleq \left\{ \, \varvec{\nu } \, \left| \, \sum _{k \notin S}\Vert \varvec{\nu }_{\mathcal {G}_k}\Vert _2 \le \delta \sum _{k \in S} \Vert \varvec{\nu }_{\mathcal {G}_k}\Vert _2\right. \right\} . \end{aligned}$$

It is not difficult to verify that \(\mathcal {V}_\delta (S)\) is a cone; i.e., if \(\textbf{x} \in \mathcal {V}_\delta (S)\) then \(\alpha \textbf{x} \in \mathcal {V}_\delta (S)\) for any \(\alpha \ge 0\). Furthermore, the set is a nonconvex cone. For example, let \(\delta = 0.5, \mathcal {G}_1 = \{1,2\},\mathcal {G}_2 = \{3,4\}\) and \(S=\{2\}\). For two points \(\nu _1 = (1,1,1,3)^T\) and \(\nu _2 =(1,2,4,2)^T\), we have \(\nu _1, \nu _2 \in \mathcal {V}_\delta (S)\) but \(0.5\nu _1+0.5\nu _2 \notin \mathcal {V}_\delta (S)\).

Given a reference point \(\textbf{x}^\epsilon \), there exists a region that includes the vector differences between the stationary solutions of (1) and \(\textbf{x}^\epsilon \) under certain conditions; refer to the following lemma.

Lemma 3

Let Assumptions (A3) and (P2) hold. Let \(\textbf{x}^\epsilon \) be a vector such that, for a given \(\epsilon > 0\), \(\Vert \nabla L(\textbf{x}^\epsilon )\Vert _2 \le \epsilon \). If \(\textbf{x}^*\) is a stationary solution of (1) with

$$\begin{aligned} \epsilon < \underset{\ k \notin S}{\min }\sqrt{| \mathcal {G}_k|} \, \lambda \, p'(\Vert \textbf{x}_{\mathcal {G}_k}^* \Vert _2), \end{aligned}$$
(8)

then there exists \(\delta ^* >0\) such that \((\textbf{x}^\epsilon - \textbf{x}^*) \in \mathcal {V}_{\delta ^*}(S)\).

Proof

For any stationary solution \(\textbf{x}^*\), one has

$$\begin{aligned} \langle \nabla L(\textbf{x}^*), \textbf{x}^\epsilon - \textbf{x}^* \rangle + \lambda \, \displaystyle {\sum \limits _{k=1}^m} \displaystyle \sqrt{|\mathcal {G}_k| } \, P_k' (\textbf{x}_{\mathcal {G}_k}^*; \textbf{x}_{\mathcal {G}_k}^\epsilon - \textbf{x}_{\mathcal {G}_k}^*) \ge 0. \end{aligned}$$
(9)

Since the support of \(\textbf{x}^\epsilon \) may not be the same as that of \(\textbf{x}^*\), the four possible cases for each summand of (9) are considered. If \(\textbf{x}_{\mathcal {G}_k}^* \ne \textbf{0}\), we use (P2) to obtain

$$\begin{aligned}&P_k' (\textbf{x}_{\mathcal {G}_k}^*; \textbf{x}_{\mathcal {G}_k}^\epsilon - \textbf{x}_{\mathcal {G}_k}^*) \nonumber \\ =&\Big \langle p'(\Vert \textbf{x}_{\mathcal {G}_k}^*\Vert _2) \frac{\textbf{x}_{\mathcal {G}_k}^*}{\Vert \textbf{x}_{\mathcal {G}_k}^*\Vert _2}, \textbf{x}_{\mathcal {G}_k}^\epsilon - \textbf{x}_{\mathcal {G}_k}^* \Big \rangle {\left\{ \begin{array}{ll} \le p'(\Vert \textbf{x}_{\mathcal {G}_k}^*\Vert _2)\Vert \textbf{x}_{\mathcal {G}_k}^\epsilon - \textbf{x}_{\mathcal {G}_k}^*\Vert _2, &{} \text {if } k \in S \\ = - p'(\Vert \textbf{x}_{\mathcal {G}_k}^*\Vert _2)\Vert \textbf{x}_{\mathcal {G}_k}^*\Vert _2 , &{} \text {if } k \notin S. \end{array}\right. } \end{aligned}$$
(10)

If \(\textbf{x}_{\mathcal {G}_k}^* = \textbf{0}\), it follows from the definition of the directional derivative that

$$\begin{aligned} P_k' (\textbf{0}; \textbf{x}_{\mathcal {G}_k}^\epsilon )&= p'(0^+)\Vert \textbf{x}_{\mathcal {G}_k}^\epsilon \Vert _2 . \end{aligned}$$
(11)

The above captures the cases \(k \in S\) and \(k \notin S\). By (7) and the convexity of L, we obtain

$$\begin{aligned} \langle \nabla L(\textbf{x}^*), \textbf{x}^\epsilon - \textbf{x}^* \rangle&\le \langle \nabla L(\textbf{x}^\epsilon ), \textbf{x}^\epsilon - \textbf{x}^* \rangle \nonumber \\&\le \Vert \nabla L(\textbf{x}^\epsilon )\Vert _2 \Vert \textbf{x}^\epsilon - \textbf{x}^*\Vert _2 \le \epsilon \sum _{k=1}^m \Vert \textbf{x}_{\mathcal {G}_k}^\epsilon - \textbf{x}_{\mathcal {G}_k}^*\Vert _2 . \end{aligned}$$
(12)

Substituting (10)-(12) into (9) yields

$$\begin{aligned} \sum \limits _{k \notin S}&\sqrt{|\mathcal {G}_k| } p'(\Vert \textbf{x}_{\mathcal {G}_k}^*\Vert _2)\Vert \textbf{x}_{\mathcal {G}_k}^*\Vert _2 - \frac{\epsilon }{\lambda } \sum _{k=1}^m \Vert \textbf{x}_{\mathcal {G}_k}^\epsilon - \textbf{x}_{\mathcal {G}_k}^*\Vert _2\\&\le \sum \limits _{\begin{array}{c} k \in S, \\ \textbf{x}_{\mathcal {G}_k}^* \ne \textbf{0} \end{array} } \sqrt{|\mathcal {G}_k| } p'(\Vert \textbf{x}_{\mathcal {G}_k}^*\Vert _2)\Vert \textbf{x}_{\mathcal {G}_k}^\epsilon - \textbf{x}_{\mathcal {G}_k}^*\Vert _2 +\sum \limits _{\begin{array}{c} k \in S, \\ \textbf{x}_{\mathcal {G}_k}^* = \textbf{0} \end{array} } \sqrt{|\mathcal {G}_k| } p'(0^+) \Vert \textbf{x}_{\mathcal {G}_k}^\epsilon \Vert _2. \end{aligned}$$

By the assumption (P2) where u is defined, we deduce

$$\begin{aligned} \sum _{k \notin S}\left( \sqrt{|\mathcal {G}_k| } \, p'(\Vert \textbf{x}_{\mathcal {G}_k}^*\Vert _2) - \frac{\epsilon }{\lambda } \right) \Vert \textbf{x}_{\mathcal {G}_k}^\epsilon - \textbf{x}_{\mathcal {G}_k}^*\Vert _2 \le \sum _{k \in S}\left( \frac{\epsilon }{\lambda } + u \, \sqrt{|\mathcal {G}_k| } \right) \Vert \textbf{x}_{\mathcal {G}_k}^\epsilon - \textbf{x}_{\mathcal {G}_k}^*\Vert _2. \end{aligned}$$

The condition (8) guarantees the nonnegativity of the left-hand side of the inequality, which validates the existence of \(\delta ^*\). \(\square \)

Lemma 3 can be interpreted as, if there is a stationary solution that is sufficiently close to \(\textbf{x}^\epsilon \), then there exists a nonconvex cone which includes the direction \(\textbf{x}^\epsilon - \textbf{x}^*\). This is because a stationary solution \(\textbf{x}_{\mathcal {G}_k}^*\) is more likely to meet the condition (8) if \(\Vert \textbf{x}_{\mathcal {G}_k}^* \Vert _2\) is near the origin whenever the corresponding subvector of \(\textbf{x}_{\mathcal {G}_k}^\epsilon \) is zero. We note that the scale of \(\delta ^*\) in the analysis is related to the model parameters and the stationary solution which may lead to a strong assumption on the RSC.

We define the RSC assumption of L over the set \(\mathcal {V}_{\delta ^*}(S)\):

  1. (A4)

    . There exists \(\sigma > 0\) such that \(L(\textbf{x}_1) - L(\textbf{x}_2) - \langle \nabla L(\textbf{x}_2), \textbf{x}_1- \textbf{x}_2\rangle \ge \displaystyle {\frac{\sigma }{2}} \Vert \textbf{x}_1- \textbf{x}_2 \Vert _2^2 \quad \forall (\textbf{x}_1 - \textbf{x}_2) \in \mathcal {V}_{\delta ^*}(S)\).

Under Assumption (A4), we provide a bound of the distance between a stationary solution of (1) and the reference point \(\textbf{x}^\epsilon \).

Theorem 4

Let Assumptions (A4) and (P2) hold. Let \(\textbf{x}^\epsilon \) be a vector such that \(\Vert \nabla L(\textbf{x}^\epsilon )\Vert _2 \le \epsilon \) for a given \(0< \epsilon < \lambda u \min \limits _{1\le k\le m} \sqrt{|\mathcal {G}_k|}\). Let \(\textbf{x}^*\) be a stationary solution of (1) with

$$\begin{aligned} \epsilon < \underset{\ k \notin S}{\min }\sqrt{| \mathcal {G}_k|} \, \lambda \, p'(\Vert \textbf{x}_{\mathcal {G}_k}^* \Vert _2). \end{aligned}$$
(13)

Suppose (A4) holds over the set \(\mathcal {V}_{\delta ^*}(S) \ni (\textbf{x}^\epsilon - \textbf{x}^*)\). We have a bound

$$\begin{aligned} \Vert \textbf{x}^\epsilon -\textbf{x}^*\Vert _2 \le \frac{4 \lambda u}{\sigma } \, \max _{k \in S}\sqrt{|\mathcal {G}_k| } \, \sqrt{\sum _{k \in S}|\mathcal {G}_k|}. \end{aligned}$$

Proof

By the RSC assumption (A4), there exists \(\sigma >0\) such that

$$\begin{aligned} \frac{\sigma }{2}\Vert \textbf{x}^\epsilon -\textbf{x}^*\Vert _2^2 \le&\ L(\textbf{x}^\epsilon ) - L(\textbf{x}^*) - \langle \nabla L(\textbf{x}^*), \textbf{x}^\epsilon -\textbf{x}^*\rangle \\ \le&\ L(\textbf{x}^\epsilon ) - L(\textbf{x}^*) + \lambda \displaystyle {\sum \limits _{k=1}^m} \displaystyle \sqrt{|\mathcal {G}_k| } \, P_k' (\textbf{x}_{\mathcal {G}_k}^*; \textbf{x}_{\mathcal {G}_k}^\epsilon - \textbf{x}_{\mathcal {G}_k}^*) \text { by } (A3) \\ \le&\ \langle \nabla L(\textbf{x}^\epsilon ), \textbf{x}^\epsilon -\textbf{x}^*\rangle + \lambda \displaystyle {\sum \limits _{k=1}^m} \displaystyle \sqrt{|\mathcal {G}_k| } \, P_k' (\textbf{x}_{\mathcal {G}_k}^*; \textbf{x}_{\mathcal {G}_k}^\epsilon - \textbf{x}_{\mathcal {G}_k}^*) \text { by convexity of } L \\ \le&\sum _{k \notin S} \Big (\epsilon \!-\! \lambda \sqrt{|\mathcal {G}_k| } p'(\Vert \textbf{x}_{\mathcal {G}_k}^* \Vert _2) \Big )\Vert \textbf{x}_{\mathcal {G}_k}^*\Vert _2\! +\! \sum _{k \in S}(\epsilon \!+\!\lambda u\sqrt{|\mathcal {G}_k| })\Vert \textbf{x}_{\mathcal {G}_k}^\epsilon -\textbf{x}_{\mathcal {G}_k}^*\Vert _2. \hspace{-4pc} \end{aligned}$$

The last inequality is obtained by applying (10)-(12). Due to (13), the first term of the right-hand side of the inequality is negative and \(\epsilon < \lambda u\sqrt{|\mathcal {G}_k| }\), leading to

$$\begin{aligned} \frac{\sigma }{2}\Vert \textbf{x}^\epsilon -\textbf{x}^*\Vert _2^2&\le 2 \, \lambda u \max _{k \in S} \sqrt{|\mathcal {G}_k| } \sum _{k \in S} \Vert \textbf{x}_{\mathcal {G}_k}^\epsilon -\textbf{x}_{\mathcal {G}_k}^*\Vert _2 \\&\le 2 \, \lambda u \max _{k \in S}\sqrt{|\mathcal {G}_k| } \Vert \textbf{x}_{\mathcal {G}_S}^\epsilon - \textbf{x}_{\mathcal {G}_S}^*\Vert _1 \text { where } \mathcal {G}_S \triangleq \bigcup _{k \in S} \mathcal {G}_k \\&\le 2 \, \lambda u \max _{k \in S}\sqrt{|\mathcal {G}_k| } \sqrt{|\mathcal {G}_S|} \Vert \textbf{x}_{\mathcal {G}_S}^\epsilon - \textbf{x}_{\mathcal {G}_S}^*\Vert _2 \\&\le 2 \, \lambda u \max _{k \in S}\sqrt{|\mathcal {G}_k| } \sqrt{|\mathcal {G}_S|} \Vert \textbf{x}^\epsilon - \textbf{x}^*\Vert _2. \end{aligned}$$

By dividing \(\Vert \textbf{x}^\epsilon -\textbf{x}^*\Vert _2\) on both sides and substituting \(| \mathcal {G}_S | = \sum \limits _{k \in S} | \mathcal {G}_k|\), we complete the proof. \(\square \)

Theorem 3 is a generalization of existing bounds shown for individual sparsity problems (without group structure) [2, 18], for which every group is a singleton such that \(\mathcal {G}_1 = \{ 1 \}, \dots , \mathcal {G}_d = \{ d \}\). For example, if we let \(p(t) = |t|\), the problem (1) becomes LASSO regularization, and Theorem 3 exactly recovers the bound on the distance between the optimal solution of LASSO and the ground-truth shown in [18, Theorem 11.1]. The result also extends the bound derived in [2, Theorem 1] for nonconvex sparsity functions such as SCAD, MCP, and transformed \(\ell _1\).

4 Our Algorithm

We adopt the alternating direction method of multipliers (ADMM) [6] to minimize the problem (1). Specifically, we introduce an auxiliary variable \(\textbf{z}\) and rewrite (1) equivalently as

$$\begin{aligned} \min \limits _{\textbf{x}, \textbf{z}} \hspace{0.5pc}&\displaystyle L(\textbf{z}) + \, \lambda \displaystyle {\sum \limits _{k=1}^m} \displaystyle { \sqrt{|\mathcal {G}_k|} } \,p(\Vert \textbf{x}_{\mathcal {G}_k}\Vert _2) \qquad \text{ s.t. } \qquad \textbf{x} = \textbf{z}. \end{aligned}$$
(14)

The corresponding augmented Lagrangian function is

$$\begin{aligned} \mathcal L(\textbf{x}, \textbf{z}; \textbf{v}) \triangleq \displaystyle L(\textbf{z}) + \, \lambda \displaystyle {\sum \limits _{k=1}^m} \displaystyle { \sqrt{|\mathcal {G}_k|} } \,p(\Vert \textbf{x}_{\mathcal {G}_k}\Vert _2 ) + \rho \langle \textbf{v}, \textbf{x}-\textbf{z}\rangle + \frac{\rho }{2} \Vert \textbf{x}-\textbf{z}\Vert _2^2, \end{aligned}$$
(15)

where \(\textbf{v}\) is a Lagrangian multiplier (or dual variable) and \(\rho \) is a positive parameter. We consider a scaled form [6] in (15) by multiplying \(\rho \) in front of \(\langle \textbf{v}, \textbf{x}-\textbf{z}\rangle .\) Consequently, ADMM iterations proceed as follows:

$$\begin{aligned} \left\{ \begin{array}{l} \textbf{x}^{\tau +1} \in {\text {argmin}}_{\textbf{x}} \mathcal {L}(\textbf{x}, \textbf{z}^\tau ; \textbf{v}^\tau ) \\ \textbf{z}^{\tau +1} \in {\text {argmin}}_{\textbf{z}} \mathcal {L}(\textbf{x}^{\tau +1}, \textbf{z}; \textbf{v}^{\tau }) \\ \textbf{v}^{\tau +1} = \textbf{v}^\tau + \textbf{x}^{\tau +1} - \textbf{z}^{\tau +1}, \end{array}\right. \end{aligned}$$
(16)

where \(\tau \) indexes the iteration number. The \(\textbf{z}\)-subproblem is written as

$$\begin{aligned} \textbf{z}^{\tau +1} = {\text {argmin}}_{\textbf{z}} L(\textbf{z}) + \frac{\rho }{2}\Vert \textbf{x}^{\tau +1} - \textbf{z} + \textbf{v}^\tau \Vert _2^2, \end{aligned}$$

which is convex under Assumption (A3) and hence can be solved efficiently by existing convex programming algorithms. For example, a closed-form solution can be derived if the loss function is the least squares for linear regression. In Appendix A, we provide details on how the \(\textbf{z}\)-subproblem is solved for various GLM loss functions that are considered in the numerical study in Sect. 5. In Sect. 4.1, we elaborate on how to solve \(\textbf{x}\)-subproblem in (16), and the convergence analysis of the ADMM scheme (16) is conducted in Sect. 4.2.

4.1 \(\textbf{x}\)-subproblem

The \(\textbf{x}\)-subproblem can be decomposed into groups such that for each \(k\in \{1,\cdots , m\}\),

$$\begin{aligned} \textbf{x}_{\mathcal {G}_k}^{\tau + 1} \in {\text {argmin}}_{\textbf{x} \in \mathbb {R}^{|\mathcal {G}_k|}} \lambda \sqrt{|\mathcal {G}_k|} p(\Vert \textbf{x} \Vert _2) + \frac{\rho }{2} \Vert \textbf{x}- \textbf{z}_{\mathcal {G}_k}^\tau +\textbf{v}_{\mathcal {G}_k}^\tau \Vert _2^2 \, \triangleq \, H_{\lambda , \rho }(\textbf{x}). \end{aligned}$$
(17)

It is nonconvex due to Assumption (P1), by which a (global) optimal solution may be difficult to obtain. Corollary 5 characterizes conditions under which a stationary solution of (17) achieves the global optimality.

Corollary 5

Let Assumption (P3) hold. If \(\rho \ge \lambda \textrm{Lip}_p \mathcal {G}^{\max }\), then any stationary solution of (17) is a global minimizer.

Proof

Since \(\rho \) is the strong convexity modulus of the second term in (17), the statement follows from Theorem 2. \(\square \)

To solve (17), we introduce a general update scheme based on a proximal operator [5, Chapter 6], then discuss the stationarity of the obtained solution. The proximal operator of a function f is defined by

$$\begin{aligned} \textbf{prox}_f (y; \mu ) \in {\text {argmin}}_{x} \Big (\mu f(x)+\frac{1}{2} (x-y)^2\Big ), \end{aligned}$$
(18)

where f is a univariate function and \(\mu \) is a positive parameter. The notation argmin we use in (17) and (18) denotes a set of ordinary stationary solutions instead of the (global) optimal solutions. Here, we define \(\bar{x}\) as an ordinary stationary solution to (18) if there exists \(\bar{v} \in \partial f(\bar{x}) = \{ v \, | \, f(z) \ge f(\bar{x}) + v(z-\bar{x}), \ \forall z \ \}\) such that \(\mu \bar{v} + (\bar{x}-y) = 0.\) It has been shown that (18) yields a global solution for certain sparsity functions such as transformed \(\ell _1\) function [69, Theorem 3.1]. Multiple stationary solutions are possible for \(\ell _1\)-\(\ell _2\) function [31].

We aim to find a closed-form solution to (17). By applying the change of variable \(\textbf{x} = c(\textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau )\), we rewrite the problem (17) to the following univariate problem:

$$\begin{aligned} \min \limits _{c} \lambda \sqrt{|\mathcal {G}_k|} p(c\Vert \textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau \Vert _2) + \frac{\rho (c-1)^2}{2}\Vert \textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau \Vert _2^2. \end{aligned}$$
(19)

The problem (19) can be solved by a proximal operator, i.e.,

$$\begin{aligned} c^*=\frac{1}{\Vert \textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau \Vert _2}\textbf{prox}_p \left( \Vert \textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau \Vert _2; \frac{\lambda \sqrt{|\mathcal {G}_k|}}{\rho } \right) , \end{aligned}$$
(20)

where \(c^*\) is an ordinary stationary solution of (19). Applying the above to the \(\textbf{x}\)-subproblem yields our \(\textbf{x}\)-update:

$$\begin{aligned} \textbf{x}_{\mathcal {G}_k}^{\tau +1} = \frac{\textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau }{\Vert \textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau \Vert _2}\textbf{prox}_p \left( \Vert \textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau \Vert _2; \frac{\lambda \sqrt{|\mathcal {G}_k|}}{\rho } \right) . \end{aligned}$$
(21)

In the proposition below, we show that the solution (21) is a stationary solution of (17).

Proposition 6

The solution \(\textbf{x}_{\mathcal {G}_k}^{\tau +1}\) in (21) is a stationary solution of the \(\textbf{x}\)-subproblem (17).

Proof

We discuss the cases that \(\textbf{x}_{\mathcal {G}_k}^{\tau +1}\) to be zero and nonzero separately. Let \(c^*\) be a solution given by (20). If \(c^*\ne 0,\) then it satisfies

$$\begin{aligned} \lambda \sqrt{|\mathcal {G}_k|} p'(c^*\Vert \textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau \Vert _2) = \rho (1-c^*) \Vert \textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau \Vert _2, \end{aligned}$$

which implies that \(0< c^* < 1\). The corresponding \(\textbf{x}^* = c^*(\textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau )\) satisfies the stationary condition of (17):

$$\begin{aligned} \lambda \sqrt{|\mathcal {G}_k|} p'(\Vert \textbf{x}^*\Vert _2) \frac{\textbf{x}^*}{\Vert \textbf{x}^* \Vert _2} + \rho (\textbf{x}^* - \textbf{z}_{\mathcal {G}_k}^\tau +\textbf{v}_{\mathcal {G}_k}^\tau ) =\textbf{0}. \end{aligned}$$

In the case of \(c^* = 0\), there exists \(\bar{u} \in \partial p(0)\) such that \(\lambda \sqrt{|\mathcal {G}_k|} \bar{u} - \rho \Vert \textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau \Vert _2 = 0\). Here, the set \(\partial p(0)\) is nonempty with 0 belonging to the set. To see this, recall the definition of the subgradient: \(v \in \partial p(0)\) if \(p(y) \ge p(0) + v(y-0)\) for all y, which is equivalent to \(p(y) \ge v \, y\). Since \(p(y) \ge 0\) for any y by (P1), we have \(0 \in \partial p(0)\). Next, we observe the directional derivative of \(H_{\lambda , \rho }\) in (17) at \(\textbf{x}^*\) in the direction \(\textbf{x} - \textbf{x}^*\) for any \(\textbf{x},\) i.e.,

$$\begin{aligned}&H_{\lambda , \rho }'(\textbf{x}^*; \textbf{x} - \textbf{x}^*) = \lambda \sqrt{|\mathcal {G}_k|} P_k'(\textbf{x}^*; \textbf{x} - \textbf{x}^*) + \rho \langle \textbf{x}^* - \textbf{z}_{\mathcal {G}_k}^\tau + \textbf{v}_{\mathcal {G}_k}^\tau , \textbf{x} - \textbf{x}^* \rangle \\&\quad \ge \lambda \sqrt{|\mathcal {G}_k|} \lim \limits _{h \rightarrow 0^+} \frac{p(\Vert \textbf{x}^* + h( \textbf{x} - \textbf{x}^*) \Vert _2) - p(\Vert \textbf{x}^* \Vert _2)}{h} -\rho \Vert \textbf{x}^* - \textbf{z}_{\mathcal {G}_k}^\tau +\textbf{v}_{\mathcal {G}_k}^\tau \Vert _2 \, \Vert \textbf{x} - \textbf{x}^*\Vert _2. \end{aligned}$$

Letting \(\textbf{x}^* = \textbf{0}\) with \(p(0)=0\), we have

$$\begin{aligned} H_{\lambda , \rho }'(\textbf{0}; \textbf{x})&\ge \lambda \sqrt{|\mathcal {G}_k|} \lim \limits _{h \rightarrow 0^+} \frac{p( h \Vert \textbf{x} \Vert _2 )}{h} -\rho \Vert \textbf{z}_{\mathcal {G}_k}^\tau - \textbf{v}_{\mathcal {G}_k}^\tau \Vert _2 \, \Vert \textbf{x} \Vert _2 \\&= \lambda \sqrt{|\mathcal {G}_k|} p'(0^+) \Vert \textbf{x} \Vert _2 -\rho \Vert \textbf{z}_{\mathcal {G}_k}^\tau -\textbf{v}_{\mathcal {G}_k}^\tau \Vert _2 \, \Vert \textbf{x} \Vert _2 \quad \text {by L'h}\hat{o}\text {pital's rule.} \nonumber \end{aligned}$$
(22)

Combining (22) with the stationary condition of \(c^*\) provides

$$\begin{aligned} H_{\lambda , \rho }'(\textbf{0}; \textbf{x})&\ge \lambda \sqrt{|\mathcal {G}_k|} \Vert \textbf{x} \Vert _2 \Big ( p'(0^+) - \bar{u} \Big ). \end{aligned}$$

It remains to show that \(p'(0^+) \ge u\) for all \(u \in \partial p(0)\). Suppose there exists \(\hat{u} \in \partial p(0)\) such that \(p'(0^+) < \hat{u}\). From the property of the subgradient, we have \(p(t) \ge p(0) + \hat{u} (t-0)\) for all \(t \in \mathbb {R}\). If we choose a strictly positive \(\hat{t}\), then we must have \(p(\hat{t}) \ge p(0) + \hat{u}(\hat{t}-0) > p(0) + p'(0^+)(\hat{t}-0)\). This contradicts the concavity of p on the domain \([0,\infty )\). Hence we conclude that if \(c^* = 0\) is a stationary solution to (19), then \(H_{\lambda , \rho }'(\textbf{0}; \textbf{x}) \ge 0\) for any \(\textbf{x}\). \(\square \)

Proposition 6 and Corollary 5 indicate that if Assumption (P3) holds with \(\rho \ge \lambda \textrm{Lip}_p \mathcal {G}^{\max },\) then \(\textbf{x}_{\mathcal {G}_k}^{\tau +1}\) is a global minimizer to the problem (17).

4.2 Convergence Analysis

Algorithm 1
figure a

Group ADMM Framework

The ADMM framework for minimizing (14) that involves both \(\textbf{x}\)- and \(\textbf{z}\)-subproblems is described in Algorithm 1, where \(\tau _{\max }\) is the maximal number of iterations. In this section, we present its convergence analysis. We first show that each \(\textbf{x}\)- and \(\textbf{z}\)-update decreases its objective value, followed by monotontic decreasing of \(\{\mathcal {L}(\textbf{x}^\tau ,\textbf{z}^\tau ;\textbf{v}^\tau )\}\); refer to Lemmas 7, 8, and 10, respectively.

Lemma 7

Let Assumption (P3) hold. If \(\rho >\lambda \mathrm{Lip_{p}} \mathcal {G}^{\max },\) then for any \(\textbf{x}^{\tau +1}\) given by (16), there exists \(\bar{c}_1 > 0\) independent of \(\tau \) such that

$$\begin{aligned} \mathcal {L}(\textbf{x}^{\tau +1},\textbf{z}^\tau ;\textbf{v}^\tau ) - \mathcal {L}(\textbf{x}^{\tau },\textbf{z}^\tau ;\textbf{v}^\tau ) \le -\frac{\bar{c}_1}{2} \Vert \textbf{x}^{\tau +1}- \textbf{x}^\tau \Vert _2^2. \end{aligned}$$
(23)

Proof

By Proposition 6, \(\textbf{x}^{\tau +1}\) is a stationary solution to (16) such that

$$\begin{aligned} \lambda \displaystyle {\sum \limits _{k=1}^m} \displaystyle { \sqrt{|\mathcal {G}_k|} } \,P'( \textbf{x}_{\mathcal {G}_k}^{\tau +1}; \textbf{x}_{\mathcal {G}_k}^\tau - \textbf{x}_{\mathcal {G}_k}^{\tau +1} ) + \rho \langle \textbf{x}^{\tau +1}-\textbf{z}^\tau + \textbf{v}^\tau ,\textbf{x}^\tau - \textbf{x}^{\tau +1}\rangle \ge 0. \end{aligned}$$
(24)

It follows from Lemma 1 that

$$\begin{aligned} p(\Vert \textbf{x}_{\mathcal {G}_k}^{\tau +1} \Vert _2 ) - p(\Vert \textbf{x}_{\mathcal {G}_k}^\tau \Vert _2 ) \le \frac{\mathrm{Lip_{p}}}{2}\Vert \textbf{x}_{\mathcal {G}_k}^{\tau +1}-\textbf{x}_{\mathcal {G}_k}^\tau \Vert _2^2 - P'( \textbf{x}_{\mathcal {G}_k}^{\tau +1}; \textbf{x}_{\mathcal {G}_k}^\tau - \textbf{x}_{\mathcal {G}_k}^{\tau +1} ). \end{aligned}$$
(25)

Simple calculations lead to

$$\begin{aligned}&\mathcal {L}(\textbf{x}^{\tau +1},\textbf{z}^\tau ;\textbf{v}^\tau ) - \mathcal {L}(\textbf{x}^\tau ,\textbf{z}^\tau ;\textbf{v}^\tau ) \\&\quad = \lambda \displaystyle {\sum \limits _{k=1}^m} \displaystyle { \sqrt{|\mathcal {G}_k|} } \,\bigg (p(\Vert \textbf{x}_{\mathcal {G}_k}^{\tau +1} \Vert _2 ) - p(\Vert \textbf{x}_{\mathcal {G}_k}^\tau \Vert _2 )\bigg )\\&\hspace{10pc}+ \frac{\rho }{2} \bigg (\Vert \textbf{x}^{\tau +1}-\textbf{z}^\tau + \textbf{v}^\tau \Vert _2^2 - \Vert \textbf{x}^\tau -\textbf{z}^\tau + \textbf{v}^\tau \Vert _2^2\bigg ) \\&\quad \le \frac{\lambda \mathrm{Lip_{p}}}{2}\displaystyle {\sum \limits _{k=1}^m} \displaystyle { \sqrt{|\mathcal {G}_k|} } \Vert \textbf{x}_{\mathcal {G}_k}^{\tau +1}-\textbf{x}_{\mathcal {G}_k}^\tau \Vert _2^2 -\lambda \displaystyle {\sum \limits _{k=1}^m} \displaystyle { \sqrt{|\mathcal {G}_k|} } \,P'( \textbf{x}_{\mathcal {G}_k}^{\tau +1}; \textbf{x}_{\mathcal {G}_k}^\tau - \textbf{x}_{\mathcal {G}_k}^{\tau +1} ) \\&\hspace{10pc} + \frac{\rho }{2} \langle \textbf{x}^{\tau +1}+\textbf{x}^\tau - 2 \textbf{z}^\tau + 2 \textbf{v}^\tau , \textbf{x}^{\tau +1} - \textbf{x}^\tau \rangle \text { by } (25) \\&\quad \le \frac{\lambda \mathrm{Lip_{p}}}{2} \mathcal {G}^{\max }\Vert \textbf{x}^{\tau +1}- \textbf{x}^\tau \Vert _2^2 + \frac{\rho }{2} \langle \textbf{x}^\tau - \textbf{x}^{\tau +1},\textbf{x}^{\tau +1} - \textbf{x}^\tau \rangle \text { by } (24) \\&\quad = -\frac{\displaystyle \rho -\lambda \mathrm{Lip_{p}} \mathcal {G}^{\max }}{2}\Vert \textbf{x}^{\tau +1}- \textbf{x}^\tau \Vert _2^2. \end{aligned}$$

If \(\rho >\lambda \mathrm{Lip_{p}} \mathcal {G}^{\max },\) we choose \(\bar{c}_1 = \frac{1}{2} (\rho -\lambda \mathrm{Lip_{p}}\mathcal {G}^{\max })>0\) such that (23) holds. \(\square \)

Lemma 8

Let Assumption (A3) hold. There exists \(\bar{c}_2 > 0\) independent of \(\tau \) such that

$$\begin{aligned} \mathcal {L}(\textbf{x}^{\tau +1},\textbf{z}^{\tau +1};\textbf{v}^\tau ) - \mathcal {L}(\textbf{x}^{\tau +1},\textbf{z}^\tau ;\textbf{v}^\tau ) \le -\frac{\bar{c}_2}{2} \Vert \textbf{z}^{\tau +1}- \textbf{z}^\tau \Vert _2^2. \end{aligned}$$

Proof

The optimality condition of \(\textbf{z}^{\tau +1}\) is

$$\begin{aligned} \nabla L(\textbf{z}^{\tau +1}) - \rho (\textbf{x}^{\tau +1} - \textbf{z}^{\tau +1} + \textbf{v}^\tau ) = \textbf{0}. \end{aligned}$$
(26)

It follows from Assumption (A3) that

$$\begin{aligned}&\mathcal {L}(\textbf{x}^{\tau +1},\textbf{z}^{\tau +1};\textbf{v}^\tau ) - \mathcal {L}(\textbf{x}^{\tau +1},\textbf{z}^\tau ;\textbf{v}^\tau ) \\&\quad = L(\textbf{z}^{\tau +1}) - L(\textbf{z}^\tau ) + \frac{\rho }{2}\bigg (\Vert \textbf{x}^{\tau +1} - \textbf{z}^{\tau +1} + \textbf{v}^\tau \Vert _2^2- \Vert \textbf{x}^{\tau +1} - \textbf{z}^\tau + \textbf{v}^\tau \Vert _2^2\bigg ) \\&\quad \le -\langle \nabla L(\textbf{z}^{\tau +1}),\textbf{z}^\tau - \textbf{z}^{\tau +1}\rangle -\frac{\sigma }{2}\Vert \textbf{z}^{\tau +1}-\textbf{z}^\tau \Vert _2^2 \\&\hspace{10pc}+ \frac{\rho }{2} \langle 2 \textbf{x}^{\tau +1} - \textbf{z}^{\tau +1}- \textbf{z}^{\tau } + 2 \textbf{v}^\tau , \textbf{z}^\tau -\textbf{z}^{\tau +1}\rangle \text { by } (A3) \\&\quad = -\frac{\sigma }{2}\Vert \textbf{z}^{\tau +1}-\textbf{z}^\tau \Vert _2^2 + \frac{\rho }{2}\langle \textbf{z}^{\tau +1}-\textbf{z}^\tau , \textbf{z}^\tau - \textbf{z}^{\tau +1}\rangle \text { by }(26) \\&\quad = - \frac{\rho +\sigma }{2}\Vert \textbf{z}^{\tau +1}-\textbf{z}^{\tau }\Vert _2^2. \end{aligned}$$

As \(\rho >0\) and \(\sigma \ge 0,\) we choose \(\bar{c}_2=\rho +\sigma >0\) that completes the proof. \(\square \)

Lemma 9

Let Assumption (A2) hold. We have

$$\begin{aligned} \Vert \textbf{v}^{\tau +1} - \textbf{v}^\tau \Vert _2^2 \le \frac{\mathrm{Lip_L}^2}{\rho ^2}\Vert \textbf{z}^{\tau +1} - \textbf{z}^\tau \Vert _2^2 . \end{aligned}$$

Proof

Based on (26) and \(\textbf{v}\)-update formula \(\textbf{v}^{\tau +1} = \textbf{v}^\tau + \textbf{x}^{\tau +1}- \textbf{z}^{\tau +1}\), we obtain

$$\begin{aligned} \nabla L(\textbf{z}^{\tau +1}) - \rho \textbf{v}^{\tau +1} = 0, \end{aligned}$$
(27)

or equivalently \( \textbf{v}^{\tau +1} = \frac{\nabla L(\textbf{z}^{\tau +1})}{\rho }.\) Similarly, we have \(\textbf{v}^\tau = \frac{\nabla L(\textbf{z}^\tau )}{\rho }\). It follows from Assumption (A2) that

$$\begin{aligned} \Vert \textbf{v}^{\tau +1} - \textbf{v}^\tau \Vert _2 = \frac{1}{\rho }\Vert \nabla L(\textbf{z}^{\tau +1})-\nabla L(\textbf{z}^\tau )\Vert _2 \le \frac{\mathrm{Lip_L}}{\rho }\Vert \textbf{z}^{\tau +1} - \textbf{z}^\tau \Vert _2. \end{aligned}$$
(28)

\(\square \)

Combining Lemmas 7-9, we shown in Lemma 10 that every triplet \((\textbf{x}^{\tau +1},\textbf{z}^{\tau +1};\textbf{v}^{\tau +1})\) produced by (16) sufficiently decreases the objective value of (15).

Lemma 10

(sufficient descent) Let Assumptions (P3), (A2) and (A3) hold. If \(\rho \) in (15) satisfies \(\rho > \max \left\{ \lambda {\textrm{Lip}}_p {\mathcal {G}^{\max }}, \frac{\sqrt{\sigma ^2+8\mathrm{Lip_L}^2}-\sigma }{2} \right\} \), then there exist two constants \(c_1>0\) and \(c_2>0\) such that

$$\begin{aligned} \mathcal {L}(\textbf{x}^{\tau +1},\textbf{z}^{\tau +1};\textbf{v}^{\tau +1}) - \mathcal {L}(\textbf{x}^\tau ,\textbf{z}^\tau ;\textbf{v}^\tau ) \le - c_1\Vert \textbf{x}^{\tau +1}-\textbf{x}^\tau \Vert _2^2- c_2\Vert \textbf{z}^{\tau +1}-\textbf{z}^\tau \Vert _2^2. \end{aligned}$$
(29)

Proof

It follows from \(\textbf{v}\)-update formula \(\textbf{v}^{\tau +1}-\textbf{v}^\tau = \textbf{x}^{\tau +1}- \textbf{z}^{\tau +1}\) that

$$\begin{aligned} \mathcal {L}(\textbf{x}^{\tau +1},\textbf{z}^{\tau +1};\textbf{v}^{\tau +1}) - \mathcal {L}(\textbf{x}^{\tau +1},\textbf{z}^{\tau +1};\textbf{v}^\tau ) =&\langle \rho \textbf{v}^{\tau +1}, \textbf{x}^{\tau +1}-\textbf{z}^{\tau +1}\rangle -\langle \rho \textbf{v}^\tau , \textbf{x}^{\tau +1}-\textbf{z}^{\tau +1}\rangle \\ =&\rho \langle \textbf{v}^{\tau +1}-\textbf{v}^\tau , \textbf{x}^{\tau +1}-\textbf{z}^{\tau +1}\rangle \\ =&\rho \Vert \textbf{v}^{\tau +1} - \textbf{v}^\tau \Vert _2^2 \le \frac{\mathrm{Lip_L}^2}{\rho }\Vert \textbf{z}^{\tau +1} - \textbf{z}^\tau \Vert _2^2, \end{aligned}$$

where the last inequality holds by Lemma 9. By applying Lemmas 7-8, we achieve the desired inequality (29) with \(c_1 = \frac{\bar{c}_1}{2}\) and \(c_2 = \frac{\bar{c}_2}{2} - \frac{\mathrm{Lip_L}^2}{\rho }\), where \(\bar{c}_1\) and \(\bar{c}_2\) are defined in Lemma 7 and Lemma 8, respectively. The condition \(\rho > \lambda {\textrm{Lip}}_p {\mathcal {G}^{\max }}\) is required for Lemma 7. Additionally we require \(\rho > \frac{\sqrt{\sigma ^2+8\mathrm{Lip_L}^2}-\sigma }{2}\) such that \(c_2 >0\). \(\square \)

Theorem 11 establishes the subsequence convergence of the iterates

under an assumption that the objective function \(F_\lambda (\cdot )\) is coercive; refer to Definition 3. If either the regularization function P or the loss function L is coercive, then the objective function is coercive, which guarantees the boundedness of the minimizing sequence.

Definition 3

A function \(f(\cdot )\) is coercive if \(f(\textbf{x}) \rightarrow \infty \) as \(\Vert \textbf{x}\Vert _2 \rightarrow \infty \).

Theorem 11

(convergence) Let Assumptions (P3), (A1)-(A3) hold. If either P or L is coercive, and \(\rho \) in (15) satisfies \(\rho > \max \left\{ \lambda {\textrm{Lip}}_p {\mathcal {G}^{\max }}, \frac{\sqrt{\sigma ^2+8\mathrm{Lip_L}^2}-\sigma }{2} \right\} \), then the sequence \(\{(\textbf{x}^\tau , \textbf{z}^\tau ,\textbf{v}^\tau )\}_{\tau =1}^\infty \) generated by (16) has a convergent subsequence. Moreover, its limit point is a stationary solution of the problem (1).

Proof

We first show the convergence of the sequence \(\{(\textbf{x}^\tau , \textbf{z}^\tau ,\textbf{v}^\tau )\}_{\tau =1}^\infty \). By telescoping summation of (29) from \(\tau =0\) to T,  we have

$$\begin{aligned} \mathcal {L}(\textbf{x}^{T+1},\textbf{z}^{T+1};\textbf{v}^{T+1}) \le&\mathcal {L}(\textbf{x}^0,\textbf{z}^0;\textbf{v}^0) - c_1\sum _{\tau =0}^T\Vert \textbf{x}^{\tau +1}-\textbf{x}^\tau \Vert _2^2- c_2\sum _{\tau =0}^T\Vert \textbf{z}^{\tau +1}-\textbf{z}^\tau \Vert _2^2 \nonumber \\ \le&\mathcal {L}(\textbf{x}^0,\textbf{z}^0;\textbf{v}^0), \forall \, T = 0,1,\dots \end{aligned}$$
(30)

which implies that the sequence \(\{\mathcal {L}(\textbf{x}^\tau ,\textbf{z}^\tau ;\textbf{v}^\tau )\}_{\tau = 0}^\infty \) is upper-bounded. On the other hand, we can estimate a lower bound for \(\forall \, T = 0,1,\dots \)

$$\begin{aligned}&\mathcal L(\textbf{x}^{T+1}, \textbf{z}^{T+1}; \textbf{v}^{T+1}) \nonumber \\&\quad = \displaystyle L(\textbf{z}^{T+1}) + F_\lambda (\textbf{x}^{T+1}) - L(\textbf{x}^{T+1})\, + \rho \langle \textbf{v}^{T+1}, \textbf{x}^{T+1}-\textbf{z}^{T+1}\rangle + \frac{\rho }{2} \Vert \textbf{x}^{T+1}-\textbf{z}^{T+1}\Vert _2^2 \nonumber \\&\quad \ge F_\lambda (\textbf{x}^{T+1}) + \langle \rho \textbf{v}^{T+1}-\nabla L(\textbf{z}^{T+1}), \textbf{x}^{T+1}-\textbf{z}^{T+1}\rangle + \frac{\rho -\mathrm{Lip_L}}{2} \Vert \textbf{x}^{T+1}-\textbf{z}^{T+1}\Vert _2^2 \nonumber \\&\quad = F_\lambda (\textbf{x}^{T+1}) + \frac{\rho -\mathrm{Lip_L}}{2} \Vert \textbf{x}^{T+1}-\textbf{z}^{T+1}\Vert _2^2 , \end{aligned}$$
(31)

where we use (27) and Assumption (A2). It follows from (31) that the sequence {\(F_\lambda (\textbf{x}^\tau )\}_{\tau = 0}^\infty \) is upper-bounded if \(\rho > \mathrm{Lip_L}\) and thus {\(F_\lambda (\textbf{x}^\tau )\}_{\tau = 0}^\infty \) is bounded by Assumption (A1). Note that \(\mathrm{Lip_L} \le \frac{\sqrt{\sigma ^2+8\mathrm{Lip_L}^2}-\sigma }{2}\), a constant stated in Lemma 10. Since \(F_\lambda (\cdot )\) is coercive by the coerciveness of P or L, \(\{\textbf{x}^\tau \}_{\tau = 1}^\infty \) is bounded, so is \(\{\textbf{z}^\tau \}_{\tau = 1}^\infty \) by (31). To show the boundedness of \(\{\textbf{v}^\tau \}_{\tau = 1}^\infty \), we consider

$$\begin{aligned} \Vert \textbf{v}^{T+1} - \textbf{v}^0\Vert _2 \le \frac{\mathrm{Lip_L}}{\rho }\Vert \textbf{z}^{T+1} - \textbf{z}^0\Vert _2, \; \forall \, T = 0,1,\dots \end{aligned}$$

which can be obtained similarly to (28). This implies that

$$\begin{aligned} \Vert \textbf{v}^{T+1}\Vert _2\le \Vert \textbf{v}^0\Vert + \frac{\mathrm{Lip_L}}{\rho } (\Vert \textbf{z}^{T+1}\Vert _2+\Vert \textbf{z}^0\Vert _2), \; \forall \, T = 0,1,\dots \end{aligned}$$
(32)

Together with the boundedness of \(\{\textbf{z}^\tau \}_{\tau = 1}^\infty ,\) we have \(\{\textbf{v}^\tau \}_{\tau = 1}^\infty \) bounded. By Bolzano-Weierstrass theorem, the bounded sequence \(\{(\textbf{x}^\tau , \textbf{z}^\tau , \textbf{v}^\tau )\}_{\tau =1}^\infty \) has a convergent subsequence, denoted by \((\textbf{x}^{\tau _j}, \textbf{z}^{\tau _j}, \textbf{v}^{\tau _j}) \rightarrow (\textbf{x}^*, \textbf{z}^*, \textbf{v}^*)\) as \(\tau _j \rightarrow \infty \).

With the boundedness, \(\mathcal {L}(\textbf{x}^\tau ,\textbf{z}^\tau ;\textbf{v}^\tau )\) converges due to the monotonic decreasing property shown in Lemma 10. By letting \(T\rightarrow \infty \) in (30), we have \(\sum _{\tau = 0}^\infty \Vert \textbf{x}^{\tau +1}-\textbf{x}^\tau \Vert _2^2\) and \(\sum _{\tau = 0}^\infty \Vert \textbf{z}^{\tau +1}-\textbf{z}^\tau \Vert _2^2\) are finite. Therefore, \(\textbf{x}^{\tau +1}-\textbf{x}^\tau \rightarrow 0\) and \(\textbf{z}^{\tau +1}-\textbf{z}^\tau \rightarrow 0\) as \(\tau \rightarrow \infty \). It further follows from Lemma 9 that \(\textbf{v}^{\tau +1}-\textbf{v}^\tau \rightarrow 0\) as well. As \((\textbf{x}^{\tau _j}, \textbf{z}^{\tau _j}, \textbf{v}^{\tau _j}) \rightarrow (\textbf{x}^*, \textbf{z}^*, \textbf{v}^*)\), we have \((\textbf{x}^{\tau _j+1}, \textbf{z}^{\tau _j+1}, \textbf{v}^{\tau _j+1}) \rightarrow (\textbf{x}^*, \textbf{z}^*, \textbf{v}^*)\) and \(\textbf{x}^*=\textbf{z}^*\) due to the \(\textbf{v}\)-update.

We next show that \((\textbf{x}^*, \textbf{z}^*, \textbf{v}^*)\) is a stationary solution of (1). By the iterative scheme (16), we have

$$\begin{aligned} \mathcal {L}(\textbf{x}^{\tau _j+1},\textbf{z}^{\tau _j},\textbf{v}^{\tau _j}) \le \mathcal {L}(\textbf{x},\textbf{z}^{\tau _j},\textbf{v}^{\tau _j}) \quad \forall \textbf{x} \\ \mathcal {L}(\textbf{x}^{\tau _j+1},\textbf{z}^{\tau _j+1},\textbf{v}^{\tau _j}) \le \mathcal {L}(\textbf{x}^{\tau _j+1},\textbf{z},\textbf{v}^{\tau _j}) \quad \forall \textbf{z} \end{aligned}$$

Let \(\tau _j \rightarrow \infty \), we have \(\mathcal {L}(\textbf{x}^*,\textbf{z}^*,\textbf{v}^*) \le \mathcal {L}(\textbf{x},\textbf{z}^*,\textbf{v}^*),\, \forall \textbf{x}\) and \(\mathcal {L}(\textbf{x}^*,\textbf{z}^*,\textbf{v}^*) \le \mathcal {L}(\textbf{x}^*,\textbf{z},\textbf{v}^*),\, \forall \textbf{z}\), which implies that

$$\begin{aligned} \lambda \displaystyle {\sum \limits _{k=1}^m} \displaystyle { \sqrt{|\mathcal {G}_k|} } \,p(\Vert \textbf{x}_{\mathcal {G}_k}^* \Vert _2 ) + \frac{\rho }{2}\Vert \textbf{v}^*\Vert _2^2 \le \lambda \displaystyle {\sum \limits _{k=1}^m} \displaystyle { \sqrt{|\mathcal {G}_k|} } \,p(\Vert \textbf{x}_{\mathcal {G}_k}\Vert _2 ) + \frac{\rho }{2} \Vert \textbf{x}-\textbf{z}^*+ \textbf{v}^*\Vert _2^2, \forall \textbf{x} \end{aligned}$$
(33)

and

$$\begin{aligned} L(\textbf{z}^*) + \frac{\rho }{2}\Vert \textbf{v}^*\Vert _2^2 \le L(\textbf{z}) + \frac{\rho }{2} \Vert \textbf{x}^*-\textbf{z}+ \textbf{v}^*\Vert _2^2, \forall \textbf{z}. \end{aligned}$$
(34)

Let us fix \(\textbf{x}\) in (33), and let \(\textbf{z} = \textbf{x}\) in (34). As \(\textbf{x}^*=\textbf{z}^*\), combining (33) and (34) yields

$$\begin{aligned}&L(\textbf{x}^*) + \lambda \displaystyle {\sum \limits _{k=1}^m} \displaystyle { \sqrt{|\mathcal {G}_k|} } \,p(\Vert \textbf{x}_{\mathcal {G}_k}^* \Vert _2 ) + \rho \Vert \textbf{v}^*\Vert _2^2 \nonumber \\&\hspace{1pc} \le L(\textbf{x}) + \lambda \displaystyle {\sum \limits _{k=1}^m} \displaystyle { \sqrt{|\mathcal {G}_k|} } \,p(\Vert \textbf{x}_{\mathcal {G}_k}\Vert _2 ) + \frac{\rho }{2}(\Vert \textbf{x}-\textbf{x}^*+ \textbf{v}^*\Vert _2^2+\Vert \textbf{x}^*-\textbf{x}+ \textbf{v}^*\Vert _2^2), \end{aligned}$$
(35)

for any \(\textbf{x}\). Define \(\widehat{F}_{\lambda }(\textbf{x}) \triangleq F_\lambda (\textbf{x}) + \rho \Vert \textbf{x} - \textbf{x}^*\Vert _2^2\). It follows from (35) that

$$\begin{aligned} \widehat{F}_\lambda (\textbf{x}^*) = F_\lambda (\textbf{x}^*) \le F_\lambda (\textbf{x}) + \rho \Vert \textbf{x} - \textbf{x}^*\Vert _2^2 = \widehat{F}_\lambda (\textbf{x}), \, \forall \textbf{x}, \end{aligned}$$

implying \(\textbf{x}^*\) is a global minimum for \(\widehat{F}_{\lambda }(\textbf{x})\) and hence a stationary point. We show \(F_\lambda \) and \(\widehat{F}_{\lambda }\) have the same directional derivative at the point \(\textbf{x} = \textbf{x}^*\) by the following calculations,

$$\begin{aligned} \widehat{F}_{\lambda }^{\, \prime }(\textbf{x}^*; \textbf{d})&= \lim _{h\rightarrow 0^+}\dfrac{\widehat{F}_{\lambda }(\textbf{x}^* + h \textbf{d})-\widehat{F}_{\lambda }(\textbf{x}^*) }{h} \\&= \lim _{h\rightarrow 0^+}\dfrac{F_{\lambda }(\textbf{x}^* + h \textbf{d}) + \rho \Vert \textbf{x}^* + h\textbf{d} - \textbf{x}^*\Vert _2^2 -F_{\lambda }(\textbf{x}^*) - \rho \Vert \textbf{x}^* - \textbf{x}^*\Vert _2^2}{h} \\&= \lim _{h\rightarrow 0^+}\dfrac{F_{\lambda }(\textbf{x}^* + h \textbf{d}) -F_{\lambda }(\textbf{x}^*) + \rho h^2 \Vert \textbf{d}\Vert _2^2}{h}\\&= \lim _{h\rightarrow 0^+}\dfrac{F_{\lambda }(\textbf{x}^* + h \textbf{d}) -F_{\lambda }(\textbf{x}^*) }{h} = F'_{\lambda }(\textbf{x}^*; \textbf{d}), \quad \forall \textbf{d}. \end{aligned}$$

Since \(\textbf{x}^*\) is a stationary point of \(\widehat{F}_{\lambda }(\textbf{x}),\) it is also a stationary point of \(F_\lambda (\textbf{x})\). \(\square \)

If neither L nor P is coercive, then we need to assume the sequence generated by the ADMM framework is bounded. In other words, the sequence either diverges or has a subsequence convergent to a stationary point; see Theorem 12.

Theorem 12

(convergence without coerciveness) Let Assumptions (P3), (A1)-(A3) hold. Let \(\rho \) satisfy the condition given in Theorem 11. If the sequence \(\{(\textbf{x}^\tau , \textbf{z}^\tau , \textbf{v}^\tau )\}_{\tau =1}^\infty \) is bounded, then it has a subsequence convergent to a stationary point of problem (1).

Proof

If \(\{\textbf{x}^\tau \}_{\tau =1}^\infty \) is bounded, then \(\{\textbf{z}^\tau \}_{\tau =1}^\infty \) and \(\{\textbf{v}^\tau \}_{\tau =1}^\infty \) are bounded due to (31) and (32), respectively. The rest of the proof is along the same lines as Theorem 11, thus is omitted. \(\square \)

Note that the Poisson regression does not have a global Lipschitz constant owing to \(\psi ''(\theta _i)=\exp (\theta _i)\), and hence Assumption (A2) does not hold. As a result, Theorems 11-12 are not applicable to Poisson regression. Fortunately, we observe in the empirical studies of Sect. 5 that ADMM with the Poisson loss function generally converges.

5 Numerical Experiments

In the numerical experiments, we examine group regularization methods that are widely used, including group LOG, group MCP, group SCAD, group transformed \(\ell _1\), and group LASSO. The definitions of the corresponding univariate regularization functions and the references for their proximal operators are listed in Table 1.

Table 1 A list of sparsity-promoting regularization functions and their proximal operators

A comparison to LASSO is added as a baseline method for feature selection without the group structure. We consider three loss functions for linear regression, Poisson regression, and logistic regression in Sects. 5.1-5.3, respectively. We generate synthetic data for linear regression and Poisson regression, while applying logistic regression on a real dataset that involves prostate cancer gene expression levels.

5.1 Synthetic Data for Linear Regression

We generate 50 triplets of a dataset that consists of 200 features and 200 observations \((A,\textbf{x}^*, \textbf{b})\) for linear regression, where \(A \in \mathbb {R}^{200 \times 200}\) is called the feature matrix, \(\textbf{x}^* \in \mathbb {R}^{200}\) is the ground-truth vector, and \(\textbf{b} \in \mathbb {R}^{200}\) is the response vector. Each row of the feature matrix A, denoted by \(A_{i } \in \mathbb {R}^{200}\), is randomly generated from multivariate Gaussian distribution with zero mean \(\textbf{0} \in \mathbb {R}^{200}\) and covariance matrix \(\varSigma \in \mathbb {R}^{200 \times 200}\) independently, i.e, \(A_{i} {\mathop {\sim }\limits ^{iid}} \mathcal {N}(\textbf{0}, \varSigma )\). We set the variances of all the features to be 1, i.e. \(\varSigma _{jj} = 1\), \(\forall j\). The off-diagonal elements of the covariances \(\varSigma _{jj'}\), \(j\ne j'\) are set as one of the following cases:

  • Case 1. \(\varSigma _{jj'}=0\).

  • Case 2. \(\varSigma _{jj'}=0.2\) when \(j,j' \in \mathcal {G}_k\), \(\varSigma _{jj'}=0\) otherwise.

  • Case 3. \(\varSigma _{jj'}=0.5\) when \(j,j' \in \mathcal {G}_k\), \(\varSigma _{jj'}=0.2\) otherwise.

The three cases for the covariance matrix \(\varSigma \) consider various levels of correlations among the features. Specifically, Case 1 considers features that are completely independent of each other, while Cases 2-3 consider the features positively correlated in part or in whole. Positive correlations are introduced within a group in Case 2. In Case 3, all feature pairs have positive correlations, and the within-group correlations are stronger than the across-group correlations.

The ground-truth \(\textbf{x}^*\in \mathbb {R}^{d}\) (\(d=200\)) consists of 40 equal-size groups with 5 coefficients in each group that are simultaneously zero or nonzero. We assume without loss of generality that \(x^*_1, \ldots , x^*_s\) are nonzero coefficients \((s<d)\), whose indices \(1, \ldots , s\) are grouped into the first \(m'\) distinct groups \(\mathcal {G}_1, \ldots ,\mathcal {G}_{m'}\) \((m'<m)\). The indices in \(\mathcal {G}_{m'+1}, \ldots ,\mathcal {G}_{m}\) correspond to zero coefficients. We set the number of nonzero groups to be one, three, or five for Cases 1-3. The coefficients in the nonzero group(s) are randomly generated from uniform distribution between \(-5\) and 5 independently, i.e, \(\textbf{x}^*_j {\mathop {\sim }\limits ^{iid}} U[-5,5]\). The response vector \(\textbf{b}\) is generated by a linear regression model

$$\begin{aligned} \textbf{b} = A \textbf{x}^* + \textbf{e}, \end{aligned}$$
(36)

where the noise \(\textbf{e} \in \mathbb {R}^{200}\) follows Gaussian distribution, \(e_i {\mathop {\sim }\limits ^{iid}} \mathcal {N}(0,\tilde{\sigma }^2)\). Here, \(\tilde{\sigma }^2\) is the empirical version of \(\text {Var}(A \textbf{x}^*)\) such that \(\tilde{\sigma }^2 =\sum _i( A_{i}\textbf{x}^*-\bar{A}\textbf{x}^*)^2/199\), where \(\bar{A}=\sum _i A_{i}/200\).

Let \(\hat{\textbf{x}} \in \mathbb {R}^d\) be a reconstructed solution from any method with its support \(\hat{\mathcal {S}}=\{j: \hat{x}_j \ne 0\}\). The complement of the support is denoted as \(\hat{\mathcal {S}}^c=\{j: \hat{x}_j = 0\}\). Let \(\hat{\mathcal {M}}=\{k: \hat{x}_j \ne 0, \forall j \in \mathcal {G}_k \}\) and \(\hat{\mathcal {N}}=\{k: \hat{x}_j = 0, \forall j \in \mathcal {G}_k \}\) denote the index set of groups in which coefficients being estimated as nonzero and the index set of groups whose coefficients being estimated as zero, respectively. To quantitatively evaluate the performance of each regularization method, we consider the following standard metrics:

  1. 1.

    Relative error of \(\hat{\textbf{x}} \triangleq \displaystyle \frac{\Vert \hat{\textbf{x}}-\textbf{x}^*\Vert _2}{\Vert \textbf{x}^*\Vert _2}\).

  2. 2.

    Precision of \(\hat{\textbf{x}}\triangleq \displaystyle \frac{|\hat{\mathcal {S}} \cap \{1,\ldots ,s\}| }{|\hat{\mathcal {S}}|}\).

  3. 3.

    Recall of \(\hat{\textbf{x}}\triangleq \displaystyle \frac{|\hat{\mathcal {S}} \cap \{1,\ldots , s\}| }{s}\).

  4. 4.

    Element accuracy of \(\hat{\textbf{x}} \triangleq \displaystyle \frac{|\hat{\mathcal {S}} \cap \{1,\ldots ,s\}|+|\hat{\mathcal {S}}^c \cap \{s+1,\ldots ,d\}|}{d}\).

  5. 5.

    Group accuracy of \(\hat{\textbf{x}} \triangleq \displaystyle \frac{|\hat{\mathcal {M}} \cap \{1,\ldots ,m'\}|+|\hat{\mathcal {N}} \cap \{m'+1,\ldots ,m\}|}{m}\).

For tuning the hyperparameter \(\lambda \), we split the dataset \((A, \textbf{b})\) into two equal-size datasets: a training dataset \((A_{tr}, \textbf{b}_{tr}) \in \mathbb {R}^{100 \times 200} \times \mathbb {R}^{100}\) and a validation dataset \((A_v, \textbf{b}_v) \in \mathbb {R}^{100 \times 200} \times \mathbb {R}^{100}\). We solve the following optimization problem with the training dataset for different penalty functions \(p(\cdot )\):

$$\begin{aligned} \min _{\textbf{x}, x_0} \frac{1}{100}\Vert A_{tr} \textbf{x} + x_0 \textbf{1} - \textbf{b}_{tr}\Vert _2^2 + \lambda \, \displaystyle \sum \limits _{k=1}^{40} \displaystyle \sqrt{|\mathcal {G}_k| } p(\Vert \textbf{x}_{\mathcal {G}_k}\Vert _2). \end{aligned}$$
(37)

For the linear regression problem (37), there is a closed-form solution for the \(\textbf{z}\)-subproblem as detailed in Appendix A. The hyperparameter settings for the different penalty functions are summarized in Table 2. Specifically for group LOG, group transformed \(\ell _1\), group LASSO, and LASSO, we tune the hyperparameter \(\lambda \) with 50 logarithmically spaced values (generated by Matlab function logspace) from \(10^{-4}\) to 10. For group MCP and group SCAD, we tune the log-spaced hyperparameter \(\tilde{\lambda }\) from \(10^{-4}\) to 10 with \(\lambda = 1\). By fixing \(\lambda =1\) for group MCP and group SCAD, we conform to their standard formulations found in [14, 68]. Although we do not have the intercept \(x_0\) in generating data from (36), we include it in the estimation to mimic the reality with no prior information about the intercept value. The numerical experiments are performed without the regularization on the intercept to have straightforward interpretations of the estimated solution. We may regularize the intercept term by treating it as a standalone group, for which the convergence theory is easily extended. Given a specific value of hyperparameters, we obtain an estimated vector \(\hat{\textbf{x}}\) and an intercept \(\hat{x}_0\) that can be used to compute the mean squared error (MSE) on the validation set, \(\text {MSE}_{\text {lin}} \triangleq \Vert A_{v} \hat{\textbf{x}} + \hat{x}_0 \textbf{1} - \textbf{b}_{v}\Vert _2^2/100\). The optimal value of the hyperparameter can be found with the smallest MSE among the preset range of the hyperparameters. We do not tune the algorithmic parameter \(\rho \), as it only affects the convergence speed but not the performance. Generally, a larger value of \(\rho \) results in a longer computation time. On the other hand, a very small of \(\rho \) may cause our algorithm to be divergent, as the \(\textbf{x}\)-subproblem is nonconvex and thus ill-posed. In practice, we choose \(\rho = 1\) for group LASSO, and \(\rho = 2\lambda \mathrm{Lip_{p}} \mathcal {G}^{\max }\) for group LOG, MCP, SCAD, and TL1, which ensures the \(\textbf{x}\)-subproblem is convex.

Table 2 Hyperparameter settings for different penalty functions
Table 3 Results for synthetic linear regression datasets with Case 1 covariance matrix

Table 3 summarizes the results of all the eight methods (TL1 stands for transformed \(\ell _1\)) for the datasets with Case 1 covariance. Note that the overall performance of the various models does not depend on the number of nonzero groups. We additionally consider the oracle method that sets the zero coefficients in \(\textbf{x}^*\) to be zero a priori and obtains the solution by least-squares minimization without any regularization. The nonconvex regularizers outperform the convex regularizers such as group LASSO and LASSO. The average relative errors of the nonconvex regularizers are very close to those of the oracle method. In particular, group LOG is successful in recovering the sparsity and group structure and has smaller relative errors than the oracle method on average. Although the sparsity and group structure of the oracle method perfectly aligns with the ground-truth, the group LOG outperforms the oracle solution which is undermined by a large signal-to-noise ratio. Both LASSO and group LASSO have low average precision values but high average recall values, which indicates that both methods tend to incorrectly estimate the zero coefficients at the ground-truth as nonzero. LASSO is not successful in recovering the group structure as the group structure is not incorporated in the model formulation. Note that the union of the group index sets \(\hat{\mathcal {M}}\) and \(\hat{\mathcal {N}}\) for the group regularizers and the oracle method is the whole group index set \(\{1,\ldots , m\}\) while LASSO may have some group indices not in any of the two sets. The estimation with a larger number of nonzero groups is a more challenging task, as indicated by the increase in the relative errors and the decrease in the precision, accuracy, and group accuracy metrics. Similar patterns are obtained in the results with Case 2 and Case 3 covariances. Refer to Tables 67 in Appendix C for more details.

Fig. 1
figure 1

Convergence comparison of the ADMM for group LOG penalty to the BSUM algorithm [20] and an iterative algorithm [23] in terms of the objective value (left) and the relative error to the ground-truth (right) with respect to the computation time

We compare the ADMM to a block successive upper-bound minimization (BSUM) method [20] and an iteratively reweighted algorithm in [23, Algorithm 1] for the group LOG penalty. The last algorithm is a special case of the successive upper-bound minimization algorithm [47] that solves a sequence of convex subproblems involving the weight of the current iterate; we refer to the algorithm as the iteratively reweighted algorithm. We note that iterates of the algorithms are shown to converge to the same type of stationary solution. We implement the algorithms on an identical computation environment with covariance matrix \(\varSigma _{ij'}=0\) (Case 1) and three nonzero groups in ground-truth. Figure 1 plots the decrease in the objective values and the relative errors of the iterates with respect to computation time. Although the methods reach the same level of objective value and relative error in the end, Fig. 1 indicates that the ADMM converges faster than the iteratively reweighted algorithm when both subproblems in (16) have closed-form solutions. Both ADMM and iterative algorithm are much faster compared to BSUM.

5.2 Synthetic Data for Poisson Regression

Poisson regression is a generalized linear model that takes count data as its response. We generate 50 triples of \((A,\textbf{x}^*, \textbf{b})\), where \(\textbf{x}^* \in \mathbb {R}^{200}\) is the ground-truth vector and \(\textbf{b} \in \mathbb {R}^{200}\) is the response vector generated by Poisson distribution, i.e.,

$$\begin{aligned} b_i {\mathop {\sim }\limits ^{iid}} \text {Poisson}(A_i \textbf{x}^*). \end{aligned}$$

The feature matrix \(A \in \mathbb {R}^{200 \times 200}\) is generated in the same manner as in Sect. 5.1 with Case 1, 2, 3 covariance settings. One, three, and five groups of 5 coefficients in the ground-truth \(\textbf{x}^*\) are nonzero and the rest of the coefficients are zero. The nonzero coefficients are randomly generated from uniform distribution between \(-0.5\) and 0.5, i.e., \(\textbf{x}^*_j {\mathop {\sim }\limits ^{iid}} U[-0.5,0.5]\). Same as in Sect. 5.1, we randomly split the dataset \((A, \textbf{b})\) into training and validation sets. For the training dataset \((A_{tr}, \textbf{b}_{tr})\), we consider the following optimization problem with different penalty functions \(p(\cdot )\):

$$\begin{aligned}&\min _{\textbf{x}, x_0} \frac{1}{100}\sum _{i=1}^{100} \left[ \mathbf {(}b_{tr})_{i} \Big ( ( A_{tr})_{i} \textbf{x} + x_0 \Big ) - \exp \Big ((A_{tr})_{i} \textbf{x} + x_0 \Big )\right] + \lambda \, \displaystyle {\sum \limits _{k=1}^{40}} \displaystyle \sqrt{|\mathcal {G}_k| } p(\Vert \textbf{x}_{\mathcal {G}_k}\Vert _2). \end{aligned}$$

We adopt Newton’s method, described in Appendix A, to solve for the \(\textbf{z}\)-subproblem. The Newton’s method terminates if the distance between the current iterate and the previous iterate is less than \(10^{-3}\). Note that our empirical study shows that the stopping tolerance in the subproblem does not have a significant influence on the final results. We use 50 different hyperparameter values from \(10^{-4}\) to 10 (\(\lambda \) for group LOG, group transformed \(\ell _1\), group LASSO, LASSO, and \(\tilde{\lambda }\) for group MCP, group SCAD). Given a specific hyperparameter value, we obtain the estimated coefficients \(\hat{\textbf{x}}\) and the intercept \(\hat{x}_0\). For hyperparameter selection, we use mean squared error (MSE) on the validation set \((A_v, \textbf{b}_v)\) for Poisson regression defined as

$$\begin{aligned} \text {MSE}_{\text {pois}} \triangleq \Vert \exp (A_{v} \hat{\textbf{x}} + \hat{x}_0 \textbf{1}) - \textbf{b}_{v}\Vert _2^2, \end{aligned}$$

where the \(\exp \) of a vector is a componentwise operation. The hyperparameter value giving the smallest MSE is chosen. Note that there are alternatives to MSE for Poisson regression, such as deviance and mean squared error of log response, to alleviate the sensitivity to large predicted values. With the known ground-truth, the evaluation metrics are the same as in linear regression.

Table 4 Results for synthetic Poisson regression datasets with Case 1 covariance matrix

Table 4 summarizes the results of the eight approaches with Case 1 covariance setting. The five nonconvex regularizers show great performance, among which group LOG attains the smallest relative error. Group LASSO and LASSO show poor performance in terms of relative errors, precision, and group accuracy. Tables  89 in Appendix C present the results for the Case 2 and Case 3 covariance settings, reporting a similar conclusion.

5.3 Prostate Cancer Gene Expression Dataset for Logistic Regression

Prostate cancer is the most commonly diagnosed non-skin cancer and the second leading cause of cancer death among men in the United States [7]. In order to identify prostate cancer risk genes, we analyze gene expression levels of prostate samples collected from 102 study participants [50]. Genetic expression levels of 6033 genes were measured from 52 prostate cancer patients and 50 normal controls. For our model fittings, we choose the top 50 genes with large median absolute deviation in the participants’ gene expression levels. The response vector consists of binary values representing 0 = cancer patient and 1 = normal control. We generate a feature matrix of dimension \(102 \times 150\) for a cubic B-spline basis function of the 50 genes, where each row is for one man and each of the three columns corresponds to one gene.

To evaluate the performance of the seven regularization methods for the logistic regression model, we perform 50 random splits of the dataset into a training set \((A_{tr}, \textbf{b}_{tr} ) \in \mathbb {R}^{82 \times 150} \times \mathbb {R}^{82}\) and test set \((A_{test}, \textbf{b}_{test} ) \in \mathbb {R}^{20 \times 150} \times \mathbb {R}^{20}\). The model coefficients \( \hat{\textbf{x}}\) and \(\hat{x}_0\) are estimated from the logit loss function with regularization, i.e.,

$$\begin{aligned}&\min _{\textbf{x}, x_0} \frac{1}{82}\sum _{i=1}^{82} \left[ \log \left\{ 1+ \exp \Big ( (A_{tr})_i \textbf{x} + x_0 \Big )\right\} - (b_{tr})_i\exp \Big ((A_{tr})_i \textbf{x} + x_0 \Big )\right] \\&\quad + \lambda \, \displaystyle {\sum \limits _{k=1}^{50}} \displaystyle \sqrt{|\mathcal {G}_k| } p(\Vert \textbf{x}_{\mathcal {G}_k}\Vert _2).g \end{aligned}$$

As the Hessian matrix for this problem is nearly singular, we apply the standard gradient descent for solving the \(\textbf{z}\)-subproblem. The gradient descent terminates if the distance between the current iterate and the previous iterate is less than \(10^{-3}\). The oracle method is not available for this experiment since the ground-truth is not known. The performance metrics of each method are given by

  1. 1.

    Prediction error \( \triangleq \displaystyle { \frac{1}{20} \sum _{i=1}^{20} \Big | \textbf{1}\left( (\hat{b}_{test})_i > 0.5\right) - (b_{test})_{i}\Big |}, \), where

    $$\begin{aligned} (\hat{b}_{test})_i = \frac{1}{1+\exp (-\hat{x}_0- ( A_{test})_{i} \hat{\textbf{x}})} \end{aligned}$$

    and \(\textbf{1}(\varOmega )\) is the indicator function that returns 1 if condition \(\varOmega \) holds and 0 otherwise.

  2. 2.

    Area under a receiver operating characteristic curve (AUC) \( \displaystyle \triangleq \int _{0}^{1} TPR(t) \ dt \), where

    $$\begin{aligned}&TPR(t) \triangleq \frac{TP(t)}{TP(t)+FN(t)},\\&TP(t) \triangleq \sum _{i=1}^{20} \textbf{1}\left( (\hat{b}_{test})_i > t\right) (b_{test})_i, \text { and }\\&FN(t) \triangleq \sum _{i=1}^{20} \textbf{1}\left( (\hat{b}_{test})_i \le t\right) (b_{test})_i. \end{aligned}$$
  3. 3.

    Coefficient selection rate \(\triangleq \displaystyle {\frac{1}{d} \sum \limits _{j=1}^{d} \textbf{1}(\hat{x}_j \ne 0)}\).

  4. 4.

    Group selection rate \(\triangleq \displaystyle {\frac{1}{m} \sum \limits _{k=1}^{m} \textbf{1}(\hat{\textbf{x}}_\mathcal {G_k} \ne \varvec{0})}\), where \(\hat{\textbf{x}}_\mathcal {G_k}\) is a coefficient subvector corresponding to the k-th group.

For all methods, we perform 5-fold cross-validation on the 50 training sets to tune the hyperparameters (\(\lambda \) for group LOG, group transformed \(\ell _1\), group LASSO, LASSO, and \(\tilde{\lambda }\) for group MCP, group SCAD) by maximizing AUC. The other parameters are fixed as in Table 2, except for \(\varepsilon = \texttt {1e-4}\) of group LOG. For group LOG, we tune the log-spaced hyperparameter \(\lambda \) from \(10^{-4}\) to \(10^{-2}\). For group transformed \(\ell _1\) and group LASSO, we tune the log-spaced hyperparameter \(\lambda \) from \(10^{-3}\) to \(10^{-1}\). For group MCP and group SCAD, we tune the log-spaced hyperparameter \(\tilde{\lambda }\) from \(10^{-3}\) to \(10^{-1}\) while fixing the hyperparameter \(\lambda = 1\). With the selected hyperparamter by cross-validation, the model coefficients \( \hat{\textbf{x}}\) and \(\hat{x}_0\) are estimated on the whole training set and their performances are evaluated on the corresponding test set.

Table 5 Results for prostate cancer gene dataset

Table 5 exhibits the performance of the seven regularization methods with logistic regression. The nonconvex group penalties except for group transformed \(\ell _1\) have smaller prediction errors and higher AUCs than group LASSO. As seen in the coefficient and group selection rates, both group MCP and group SCAD result in non-sparse coefficients with marginally higher AUC values. LASSO, on average, has the smallest coefficient selection rate, but it cannot retain the group structure by nature. Taking both prediction accuracy and group sparsity recovery into consideration, group LOG shows satisfactory results compared to the other methods.

6 Conclusion

We study a generalized formulation of the non-overlapping group selection problem, which encompasses many existing works by choosing a specific set of loss functions and sparsity-promoting functions. We analyzed the properties of a directional stationary solution to our proposed model, demonstrating its global optimality under certain conditions and providing a bound of its distance to a reference point which is a proxy of the ground-truth in the view of probability. We applied the ADMM framework with a proximal operator to iteratively minimize the generalized formulation that is commonly nonconvex and nonsmooth. We also proved the subsequence convergence of the algorithm to a stationary point. The global convergence and inexact ADMM [61, 62] (when the subproblems are solved approximately), which requires more conditions on the loss function, will be left for future work. In numerical experiments, we tested our algorithm on synthetic datasets with linear and Poisson regression analysis, showing that nonconvex group regularization methods often outperform the convex approaches with respect to the recovery of the ground-truth. The analysis of prostate cancer gene expression data confirmed that a solution with group sparsity structure is successfully produced by our proposed model, in which nonconvex group regularization methods outperform group LASSO.