Skip to main content

Discrete optimization methods for group model selection in compressed sensing

Abstract

In this article we study the problem of signal recovery for group models. More precisely for a given set of groups, each containing a small subset of indices, and for given linear sketches of the true signal vector which is known to be group-sparse in the sense that its support is contained in the union of a small number of these groups, we study algorithms which successfully recover the true signal just by the knowledge of its linear sketches. We derive model projection complexity results and algorithms for more general group models than the state-of-the-art. We consider two versions of the classical iterative hard thresholding algorithm (IHT). The classical version iteratively calculates the exact projection of a vector onto the group model, while the approximate version (AM-IHT) uses a head- and a tail-approximation iteratively. We apply both variants to group models and analyse the two cases where the sensing matrix is a Gaussian matrix and a model expander matrix. To solve the exact projection problem on the group model, which is known to be equivalent to the maximum weight coverage problem, we use discrete optimization methods based on dynamic programming and Benders’ decomposition. The head- and tail-approximations are derived by a classical greedy-method and LP-rounding, respectively.

Introduction

In many applications involving sensors or sensing systems an unknown sparse signal has to be recovered from a relatively small number of measurements. The reconstruction problem in standard compressed sensing attempts to recover an unknown k-sparse signal \({\mathbf {x}}\in {\mathbb {R}}^N\), i.e. it has at most k non-zero entries, from its (potentially noisy) linear measurements \({\mathbf {y}}= {\mathbf {A}}{\mathbf {x}}+ {\mathbf {e}}\). Here, \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) for \(m\ll N\), \({\mathbf {y}}\in {\mathbb {R}}^m\) and \({\mathbf {e}}\in {\mathbb {R}}^m\) is a noise vector, typically with a bounded noise level \(\Vert {\mathbf {e}}\Vert _2 \le \eta \); see [13, 14, 19]. A well-known result is that, if \({\mathbf {A}}\) is a random Gaussian matrix, the number of measurements required for most of the classical algorithms like \(\ell _1\)-minimization or Iterative Hard Thresholding (IHT) to successfully recover the true signal is \(m={\mathcal {O}}\left( k\log \left( N/k\right) \right) \) [10, 15].

In model-based compressed sensing we exploit second-order structures beyond the first order sparsity and compressibility structures of a signal to more efficiently encode and more accurately decode the signal. Efficient encoding means taking a fewer number of measurements than in the standard compressed sensing setting, while accurate decoding does not only include smaller recovery error but better interpretability of the recovered solution than in the standard compressed sensing setting. The second order structures of the signal are usually referred to as the structured sparsity of the signal. The idea is, besides standard sparsity, to take into account more complicated structures of the signal [5]. Nevertheless most of the classical results and algorithms for standard compressed sensing can be adapted to the model-based framework [5].

Numerous applications of model-based compressed sensing exist in practice. Key amongst these applications is the multiple measurement vector (MMV) problem, which can be modelled as a block-sparse recovery problem [21]. The tree-sparse model has been well-exploited in a number of wavelet-based signal processing applications [5]. In the sparse matrix setting (see Sect. 2) the model-based compressed sensing was used to solve the Earth Mover Distance problem (EMD). The EMD problem introduced in [53] is motivated by the task of reconstructing time sequences of spatially sparse signals, e.g. seismic measurements; see also [26]. In addition, there are many more potential applications in linear sketching including data streaming [47], graph sketching [1, 25], breaking privacy of databases via aggregate queries [20], and in sparse regression codes or sparse superposition codes (SPARC) decoding [38, 54], which is also an MMV problem.

Structured sparsity models include tree-sparse, block-sparse, and group-sparse models. For instance, for block-sparse models with dense Gaussian sensing matrices it has been established in [5] that the number of required measurements to ensure recovery is \(m={\mathcal {O}}(k)\) as opposed to \(m={\mathcal {O}}\left( k\log \left( N/k\right) \right) \) in the standard compressed sensing setting. Furthermore, in the sparse matrix setting, precisely for adjacency matrices of model expander graphs (also known as model expander matrices), the tree-sparse model only requires \(m = {\mathcal {O}}\left( \log _k\left( N/k\right) \right) \) measurements [3, 34], which is smaller than the standard compressed sensing sampling rate stated above. Moreover, all proposed algorithms that perform an exact model projection, which is to find the closest vector in the model space for a given signal, guarantee recovery of a solution belonging to the model space, which is then more interpretable than applying off-the-shelf standard compressed sensing algorithms [3].

As the exact model projection problem used in many of the classical algorithms may become theoretically and computationally hard for specific sparsity models, approximation variants of some well-known algorithms like the Model-IHT have been introduced in [27]. Instead of iteratively solving the exact model projection problem, this algorithm, called AM-IHT, uses a head- and a tail-approximation to recover the signal which is computationally less demanding in general. The latter computational benefit comes along with a larger number of measurements required to obtain successful recovery with weaker recovery guarantees: the typical speed versus accuracy trade-off.

A special class of structured sparsity models are group models, where the support of the signal is known to be contained in the union of a small number of groups of indices. Group models were already studied extensively in the literature in the compressed sensing context; see [6, 36, 49]. Its choice is motivated by several applications, e.g. in image processing; see [44, 50, 51]. As it was shown in [4] the exact projection problem for group models is NP-hard in general but can be solved in polynomial time by dynamic programming if the intersection graph of the groups has no cycles. The latter case is quite restricting, since as a consequence each element is contained in at most two groups. In this work we extend existing results for the Model-IHT algorithm and its approximation variant (AM-IHT) derived in [27] to group models and model expander matrices. We focus on deriving discrete optimization methods to solve the exact projection problem and the head- and tail-approximations for much more general classes of group models than the state-of-the-art.

In Sect. 2 we present the main preliminary results regarding compressed sensing for structured sparsity models and group models. In Sect. 3 we study recovery algorithms using exact projection oracles. We first show that for group models with low treewidth, the projection problem can be solved in polynomial time by dynamic programming which is a generalization of the result in [4]. We then adapt known theoretical results for model expander matrices to these more general group models. To solve the exact projection problem for general group models we apply a Benders’ decomposition procedure. It can be even used for the more general assumption that we seek a signal which is group-sparse and additionally sparse in the classical sense. In Sect. 4 we study recovery algorithms using approximation projection oracles, namely head- and tail-approximations. We apply the known results in [27] to group models of low frequency and show that the required head- and tail-approximations for group models can be solved by a classical greedy-method and LP rounding, respectively. In Sect. 5 we test all algorithms, including Model-IHT, AM-IHT, MEIHT and AM-EIHT, on overlapping block-groups and compare the number of required measurements, iterations and the run-time.

Summary of our contributions

  • We study the Model-Expander IHT (MEIHT) algorithm, which was analysed in [3] for tree-sparse and loopless group-sparse signals, and extend the existing results to general group models, proving convergence of the algorithm.

  • We extend the results in [4] by proving that the projection problem can be solved in polynomial time if the incidence graph of the underlying group model has bounded treewidth. This includes the case when the intersection graph has bounded treewidth, which generalizes the result for acyclic graphs derived in [4]. We complement the latter result with a hardness result that we use to justify the bounded treewidth approach.

  • We derive a Benders’ decomposition procedure to solve the projection problem for arbitrary group models, assuming no restriction on the frequency or the structure of the groups. The latter procedure even works for the more general model combining group-sparsity with classical sparsity. We integrate the latter procedure into the Model-IHT and MEIHT algorithm.

  • We apply the Approximate-Model IHT (AM-IHT) derived in [26, 28] to Gaussian and expander matrices and to the case of group models with bounded frequency, which is the maximal number of groups an element is contained in. In the expander case we call the algorithm AM-EIHT. To this end we derive both, head- and tail-approximations of arbitrary precision using a classical greedy method and LP-rounding. Using the AM-IHT and the results in [26, 28], this implies compressive sensing \(\ell _2/\ell _2\) recovery guarantees for group-sparse signals. We show that the number of measurements needed to guarantee a successful recovery exceeds the number needed by the usual model-based compressed sensing bound [5, 9] only by a constant factor.

  • We test the algorithms Model-IHT, MEIHT, AM-IHT and AM-EIHT on groups given by overlapping blocks for random signals and measurement matrices. We analyse and compare the minimal number of measurements needed for recovery, the run-time and the number of iterations of the algorithm.

Preliminaries

Notation

In most of this work scalars are denoted by ordinary letters (e.g. x, N), vectors and matrices by boldface letters (e.g. \(\mathbf{x}\), \(\mathbf{A}\)), and sets by calligraphic capital letters (e.g., \(\mathcal {S}\)).

The cardinality of a set \(\mathcal {S}\) is denoted by \(|\mathcal {S}|\) and we define \([N] := \{1, \ldots , N\}\). Given \(\mathcal {S} \subseteq [N]\), its complement is denoted by \(\mathcal {S}^c := [N] {\setminus } \mathcal {S}\) and \({\mathbf {x}}_\mathcal {S}\) is the restriction of \({\mathbf {x}}\in {\mathbb {R}}^N\) to \(\mathcal {S}\), i.e.

$$\begin{aligned} ({\mathbf {x}}_\mathcal {S})_i = {\left\{ \begin{array}{ll} x_i, &{}\quad \text{ if } ~i \in \mathcal {S},\\ 0 &{}\quad \text{ otherwise }.\end{array}\right. } \end{aligned}$$

The support of a vector \({\mathbf {x}}\in {\mathbb {R}}^N\) is defined by \({\text {supp}}({\mathbf {x}})=\left\{ i\in [N] \ : \ x_i\ne 0\right\} \). For a given \(k\in {\mathbb {N}}\) we say a vector \({\mathbf {x}}\in {\mathbb {R}}^N\) is k-sparse if \(|{\text {supp}}({\mathbf {x}})|\le k\). For a matrix \({\mathbf {A}}\), the matrix \({\mathbf {A}}_{\mathcal {S}}\) denotes a sub-matrix of \({\mathbf {A}}\) with columns indexed by \({\mathcal {S}}\). For a graph \(G=(V,E)\) and \({\mathcal {S}}\subseteq V\), \(\varGamma ({\mathcal {S}})\) denotes the set of neighbours of \({\mathcal {S}}\), that is the set of nodes that are connected by an edge to the nodes in \({\mathcal {S}}\). We denote by \(e_{ij} = (i,j)\) an edge connecting node i to node j. The set \({\mathcal {G}}_i\) denotes a group of size \(g_i\) and a group model is any subset of \(\mathfrak {G} = \{{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_M\}\); while a group model of order \(G\in [N]\) is denoted by \({\mathfrak {G}}_G\), which is a collection of any G groups of \({\mathfrak {G}}\). For a subset of groups \({\mathcal {S}}\subset {\mathfrak {G}}\) we sometimes write

$$\begin{aligned} \bigcup {\mathcal {S}}:=\bigcup _{S\in {\mathcal {S}}}S. \end{aligned}$$

The \(\ell _p\) norm of a vector \(\mathbf{x} \in {\mathbb {R}}^N\) is defined as

$$\begin{aligned}\Vert \mathbf{x}\Vert _p := \left( \sum _{i=1}^N x_i^p \right) ^{1/p}.\end{aligned}$$

Compressed sensing

Recall that the reconstruction problem in standard compressed sensing [13, 14, 19] attempts to recover an unknown k-sparse signal \({\mathbf {x}}\in {\mathbb {R}}^N\), from its (potentially noisy) linear measurements \({\mathbf {y}}= {\mathbf {A}}{\mathbf {x}}+ {\mathbf {e}}\), where \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\), \({\mathbf {y}}\in {\mathbb {R}}^m\) for \(m\ll N\) and \({\mathbf {e}}\in {\mathbb {R}}^m\) is a noise vector, typically with a bounded noise level \(\Vert {\mathbf {e}}\Vert _2 \le \eta \). The reconstruction problem can be formulated as the optimization problem

$$\begin{aligned} \min _{{\mathbf {x}}\in {\mathbb {R}}^N} ~\Vert {\mathbf {x}}\Vert _0 \quad \text{ subject } \text{ to } \quad \Vert {\mathbf {A}}{\mathbf {x}}- {\mathbf {y}}\Vert _2 \le \eta , \end{aligned}$$
(1)

where \(\Vert {\mathbf {x}}\Vert _0\) is the number of non-zero components of \({\mathbf {x}}\). Problem (1) is usually relaxed to an \(\ell _1\)-minimization problem by replacing \(\Vert {\cdot } \Vert _0\) with the \(\ell _1\)-norm. It has been established that the solution minimizing the \(\ell _1\)-norm coincides with the optimal solution of (1) under certain conditions [14]. Besides the latter approach the compressed sensing problem can be solved by a class of greedy algorithms, including the IHT [10]. A detailed discussion on compressed sensing algorithms can be found in [22].

The idea behind the IHT can be explained by considering the problem

$$\begin{aligned} \min _{{\mathbf {x}}\in {\mathbb {R}}^N} ~\Vert {\mathbf {A}}{\mathbf {x}}- {\mathbf {y}}\Vert ^2_2 \quad \text{ subject } \text{ to } \quad \Vert {\mathbf {x}}\Vert _0 \le k. \end{aligned}$$
(2)

Under certain choices of \(\eta \) and k the latter problem is equivalent to (1) [10]. Based on the idea of gradient descent methods (2) can be solved by iteratively taking a gradient descent step, followed by a hard thresholding operation, which sets all components to zero except the largest k in magnitude. Starting with an initial guess \({\mathbf {x}}^{(0)} = {\varvec{0}}\), the \((n+1)\)-th IHT update is given by

$$\begin{aligned} {\mathbf {x}}^{(n+1)} = \mathcal {H}_k\left[ {\mathbf {x}}^{(n)} + {\mathbf {A}}^*\left( {\mathbf {A}}{\mathbf {x}}^{(n)} - {\mathbf {y}}\right) \right] , \end{aligned}$$
(3)

where \(\mathcal {H}_k:{\mathbb {R}}^N\rightarrow {\mathbb {R}}^N\) is the hard threshold operator and \({\mathbf {A}}^*\) is the adjoint matrix of \({\mathbf {A}}\).

Recovery guarantees of algorithms are typically given in terms of what is referred to as the \(\ell _p/\ell _q\) instance optimality [14]. Precisely, an algorithm has \(\ell _p/\ell _q\) instance optimality if for a given signal \({\mathbf {x}}\) it always returns a signal \(\widehat{{\mathbf {x}}}\) with the following error bound

$$\begin{aligned} \Vert {\mathbf {x}}- \widehat{{\mathbf {x}}}\Vert _p \le c_1(k,p,q) \sigma _k({\mathbf {x}})_q + c_2(k,p,q) \eta , \end{aligned}$$
(4)

where \(1\le q \le p \le 2\), \(c_1(k,p,q), c_2(k,p,q)\) are constants independent of the dimension of the signal and

$$\begin{aligned} \displaystyle \sigma _k({\mathbf {x}})_q = \min _{{\mathbf {z}}:\Vert {\mathbf {z}}\Vert _0\le k}\Vert {\mathbf {x}}-{\mathbf {z}}\Vert _q \end{aligned}$$

is the best k-term approximation of a signal (in the \(\ell _q\)-norm).

Ideally, we would like to have \(\ell _2/\ell _1\) instance optimality [14]. It turned out that the instance optimality of the known algorithms depends mainly on the sensing matrix \({\mathbf {A}}\). Key amongst the tools used to analyse the suitability of \({\mathbf {A}}\) is the restricted isometry property, which is defined in the following.

Definition 1

(RIP) A matrix \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) satisfies the \(\ell _p\)-norm restricted isometry property (RIP-p) of order k, with restricted isometry constant (RIC) \(\delta _k < 1\), if for all k-sparse vectors \({\mathbf {x}}\)

$$\begin{aligned} \left( 1-\delta _k\right) \Vert {\mathbf {x}}\Vert _p^p \le \Vert {\mathbf {A}}{\mathbf {x}}\Vert _p^p \le \left( 1+\delta _k\right) \Vert {\mathbf {x}}\Vert _p^p. \end{aligned}$$
(5)

Typically RIP without the subscript p refers to case when \(p=2\). We use this general definition here because we will study the case \(p=1\) later. The RIP is a sufficient condition on \({\mathbf {A}}\) that guarantees optimal recovery of \({\mathbf {x}}\) for most of the known algorithms. If the entries of \({\mathbf {A}}\) are drawn i.i.d from a sub-Gaussian distribution and \(m = {\mathcal {O}}\left( k\log (N/k)\right) \), then \({\mathbf {A}}\) has RIP-2 with high probability and leads to the ideal \(\ell _2/\ell _1\) instance optimality for most algorithms; see [15]. Note that the bound \({\mathcal {O}}\left( k\log (N/k)\right) \) is asymptotically tight. On the other hand, deterministic constructions of \({\mathbf {A}}\) or random \({\mathbf {A}}\) with binary entries with non-zero mean do not achieve this optimal m, and are faced with the so-called square root bottleneck where \(m=\varOmega \left( k^2\right) \); see [16, 18].

Sparse sensing matrices from expander graphs

The computational benefits of sparse sensing matrices necessitated finding a way to circumvent the square root bottleneck for non-zero mean binary matrices. One such class of binary matrices is the class of adjacency matrices of expander graphs (henceforth referred to as expander matrices), which satisfy the weaker RIP-1. Expander graphs are objects of interest in pure mathematics and theoretical computer science, for a detailed discourse on this subject see [31]. We define an expander graph as follows:

Definition 2

(Expander graph) Let \(H=\left( [N],[m],{\mathcal {E}}\right) \) be a left-regular bipartite graph with N left vertices, m right vertices, a set of edges \({\mathcal {E}}\) and left degree d. If for any \(\epsilon \in (0,1/2)\) and any \({\mathcal {S}}\subset [N]\) of size \(|{\mathcal {S}}|\le k\) it holds \(|\varGamma ({\mathcal {S}})| \ge (1-\epsilon )dk\), then H is referred to as a \((k,d,\epsilon )\)-expander graph.

An expander matrix is the adjacency matrix of an expander graph. Choosing \(m = {\mathcal {O}}\left( k\log (N/k)\right) \), then a random bipartite graph \(H=\left( [N],[m],{\mathcal {E}}\right) \) with left degree \(d={\mathcal {O}}\left( \frac{1}{\epsilon }\log (\frac{N}{k})\right) \) is an \((k,d,\epsilon )\)-expander graph with high probability [22]. Furthermore expander matrices achieve the sub-optimal \(\ell _1/\ell _1\) instance optimality [8]. For completeness we state the lemma in [35] deriving the RIC for such matrices.

Lemma 1

(RIP-1 for expander matrices [35]) Let \({\mathbf {A}}\) be the adjacency matrix of a \((k,d,\epsilon )\)-expander graph H, then for any k-sparse vector \({\mathbf {x}}\), we have

$$\begin{aligned} \left( 1-2\epsilon \right) d\Vert {\mathbf {x}}\Vert _1 \le \Vert {\mathbf {A}}{\mathbf {x}}\Vert _1 \le d\Vert {\mathbf {x}}\Vert _1. \end{aligned}$$
(6)

The most relevant algorithm that exploits the structure of the expander matrices is the Expander-IHT (EIHT) proposed in [22]. Similar to the IHT algorithm it performs updates

$$\begin{aligned} {\mathbf {x}}^{(n+1)} = \mathcal {H}_k\left[ {\mathbf {x}}^{(n)} + {\mathcal {M}}\left( {\mathbf {A}}{\mathbf {x}}^{(n)} - {\mathbf {y}}\right) \right] , \end{aligned}$$
(7)

where \({\mathcal {M}}: {\mathbb {R}}^m \rightarrow {\mathbb {R}}^N\) is the median operator and \([{\mathcal {M}}({\mathbf {z}})]_i = \text{ median }\left( \{z_j\}_{j\in \varGamma (i)}\right) \) for each \({\mathbf {z}}\in {\mathbb {R}}^m\). For expander matrices the EIHT achieves \(\ell _1/\ell _1\) instance optimality [22].

Model-based compressed sensing

Besides sparsity (and compressibility) signals do exhibit more complicated structures. When compressed sensing takes into account these more complicated structures (or models) in addition to sparsity, it is usually referred to as model-based compressed sensing or structured sparse recovery [5]. A precise definition is given in the following:

Definition 3

(Structured sparsity model [5]) A structured sparsity model is a collection of sets, \({\mathfrak {M}}=\left\{ {\mathcal {S}}_1,\ldots , {\mathcal {S}}_M\right\} \) with \(|{\mathfrak {M}}| = M\), of allowed structured supports \({\mathcal {S}}_i\subseteq [N]\).

Note that the classical k-sparsity studied in Sect. 2.2 is a special case of a structured sparsity model where all supports of size at most k are allowed. Popular structured sparsity models include tree-sparse, block-sparse, and group-sparse models [5]. In this work we study group-sparse models which we will introduce in Sect. 2.4.

Similar to the classical sparsity case the RIP property is defined for structured sparsity models.

Definition 4

(Model-RIP [5]) A matrix \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) satisfies the \(\ell _p\)-norm model restricted isometry property (\(\mathfrak {M}\)-RIP-p) with model restricted isometry constant (\(\mathfrak {M}\)-RIC) \(\delta _{{\mathfrak {M}}} < 1\), if for all vectors \({\mathbf {x}}\) with \({\text {supp}}({\mathbf {x}})\in {\mathfrak {M}}\) it holds

$$\begin{aligned} \left( 1-\delta _{{\mathfrak {M}}}\right) \Vert {\mathbf {x}}\Vert _p^p \le \Vert {\mathbf {A}}{\mathbf {x}}\Vert _p^p \le \left( 1+\delta _{{\mathfrak {M}}}\right) \Vert {\mathbf {x}}\Vert _p^p. \end{aligned}$$
(8)

In [5] it was shown that for a matrix \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) to have the Model-RIP with high probability the required number of measurements is \(m = {\mathcal {O}}(k)\) for tree-sparse signals and \(m = {\mathcal {O}}\left( kg + \log \left( N/(kg)\right) \right) \) for block-sparse signals with block size g, when the sensing matrices are dense (typically sub-Gaussian). In general for a given structured sparsity model \({\mathfrak {M}}\) for sub-Gaussian random matrices the number of required measurements is \(m={\mathcal {O}}\left( \delta _{{\mathfrak {M}}}^{-2}g\log (\delta _{{\mathfrak {M}}}^{-1}) + \log (|{\mathfrak {M}} |)\right) \) where g is the cardinality of the largest support in \({\mathfrak {M}}\).

Furthermore the authors in [5] show that classical algorithms like the IHT can be modified for structured sparsity models to achieve instance optimality. To this end the hard thresholding operator \(\mathcal {H}_k\) used in the classical IHT is replaced by a model-projection oracle which for a given signal \({\mathbf {x}}\in {\mathbb {R}}^N\) returns the closest signal over all signals having support in \({\mathfrak {M}}\). We define the model-projection oracle in the following.

Definition 5

(Model-projection oracle [5]) Given \(p\ge 1\), a model-projection oracle is a function \({\mathcal {P}}_{{\mathfrak {M}}}:{\mathbb {R}}^N \rightarrow {\mathbb {R}}^N\) such that for all \({\mathbf {x}}\in {\mathbb {R}}^N\) we have \({\text {supp}}({\mathcal {P}}_{{\mathfrak {M}}}({\mathbf {x}}))\in {\mathfrak {M}}\) and it holds

$$\begin{aligned} \Vert {\mathbf {x}}- {\mathcal {P}}_{{\mathfrak {M}}}({\mathbf {x}})\Vert _p = \min _{{\mathcal {S}}\in {\mathfrak {M}}} \Vert {\mathbf {x}}- {\mathbf {x}}_{{\mathcal {S}}}\Vert _p. \end{aligned}$$

From the definition it directly follows that \({\mathcal {P}}_{{\mathfrak {M}}}({\mathbf {x}})_i = {\mathbf {x}}_i\) if \(i\in {\text {supp}}({\mathcal {P}}_{{\mathfrak {M}}}({\mathbf {x}}))\) and 0 otherwise. Note that in the case of classical k-sparsity the model-projection oracle is given by the hard thresholding operator \(\mathcal {H}_k\). In contrast to this case, calculating the optimal model projection \(P_{{\mathfrak {M}}} ({\mathbf {x}})\) for a given signal \({\mathbf {x}}\in {\mathbb {R}}^N\) and a given structured sparsity model \({\mathfrak {M}}\) may be computationally hard. Depending on the model \({\mathfrak {M}}\) finding the optimal model projection vector may be even NP-hard; see Sect. 2.4. The modified version of the IHT derived in [5] is presented in Algorithm 1.

figurea

Note that common halting criterion is given by a maximum number of iterations or a bound on the iteration error \(\Vert {\mathbf {x}}^{(n+1)}-{\mathbf {x}}^{(n)}\Vert _p\).

Model-sparse sensing matrices from expander graphs

In the sparse matrix setting the sparse matrices we consider are called model expander matrices, which are adjacency matrices of model expander graphs, defined thus.

Definition 6

(Model expander graph) Let \(H=\left( [N],[m],{\mathcal {E}}\right) \) be a left-regular bipartite graph with N left vertices, m right vertices, a set of edges \({\mathcal {E}}\) and left degree d. Given a model \({\mathfrak {M}}\), if for any \(\epsilon _{{\mathfrak {M}}} \in (0,1/2)\) and any \({\mathcal {S}}= \cup _{{\mathcal {S}}_i \in {\mathcal {K}}}{\mathcal {S}}_i\), with \({\mathcal {K}} \subset {\mathfrak {M}}\) and \(|{\mathcal {S}}| \le s\), we have \(|\varGamma ({\mathcal {S}})| \ge (1-\epsilon _{{\mathfrak {M}}})d|{\mathcal {S}}|\), then H is referred to as an \((s,d,\epsilon _{{\mathfrak {M}}})\)-model expander graph.

In this setting the known results are sub-optimal. Using model expander matrices for tree-sparse models the required number of measurements to obtain instance optimality is \(m = k\log \left( N/k\right) /\log \log \left( N/k\right) \) which was shown in [3, 34].

A key ingredient in the analysis for the afore-mentioned sample complexity results for model expanders is the model-RIP-1, which is just RIP-1 for model expander matrices (hence they are also called model-RIP-1 matrices [34]). Consequently, Lemma 1 also holds for these model-RIP-1 matrices [34].

First, in [3] the Model Expander IHT (MEIHT) was studied for loopless overlapping groups and D-ary tree models. Similar to Algorithm 1 the MEIHT is a modification of EIHT where the hard threshold operator \(\mathcal {H}_k\) is replaced by the projection oracle \(\mathcal {P}_{{\mathfrak {M}}}\) onto the model \({\mathfrak {M}}\). Thus the update of the MEIHT in each iteration is given by

$$\begin{aligned} {\mathbf {x}}^{(n+1)} = \mathcal {P}_{{\mathfrak {M}}}\left[ {\mathbf {x}}^{(n)} + {\mathcal {M}}\left( {\mathbf {A}}{\mathbf {x}}^{(n)} - {\mathbf {y}}\right) \right] . \end{aligned}$$
(9)

In [3] the authors show that this algorithm always returns a solution in the model, which is highly desirable for some applications. The running time is given in the proposition below.

Proposition 1

([3, Proposition 3.1]) The runtime of MEIHT is \({\mathcal {O}}\left( kN\bar{n}\right) \) and \({\mathcal {O}}\left( M^2G\bar{n} + N\bar{n}\right) \) for D-ary tree models and loopless overlapping group models respectively, where k is the sparsity of the tree model and G is the number of active groups (i.e. group sparsity of the model), \(\bar{n}\) is the number of iterations, M is the number of groups and N is the dimension of the signal.

Group models

The models of interest in this paper are group models. A group model is a collection \(\mathfrak {G} = \{{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_M\}\) of groups of indices, i.e. \({\mathcal {G}}_i\subset [N]\), together with a budget \(G \in [M]\). We denote \({\mathfrak {G}}_G\) as the structured sparsity model (i.e. group-sparse model) which contains all supports contained in the union of at most G groups in \({\mathfrak {G}}\), i.e.

$$\begin{aligned} {\mathfrak {G}}_G:=\left\{ {\mathcal {S}}\subseteq [N]\ : \ {\mathcal {S}}\subseteq \bigcup _{i\in {\mathcal {I}}}{\mathcal {G}}_i, ~|{\mathcal {I}}|\le G \right\} . \end{aligned}$$
(10)

We will always tacitly assume that \(\bigcup _{i=1}^{M} {\mathcal {G}}_i = [N]\). We say that a signal \({\mathbf {x}} \in {\mathbb {R}}^N\) is G-group-sparse if the support of \({\mathbf {x}}\) is contained in \({\mathfrak {G}}_G\). If G is clear from the context, we simply say that \({\mathbf {x}}\) is group-sparse. Let \(g_i = |{\mathcal {G}}_i|\) and denote \(g_{\text {max}} = \max _{i\in [M]} g_i\) as the size of largest support in \({\mathfrak {G}}_G\). The intersection graph of a group model is the graph which has a node for each group \({\mathcal {G}}_i\in {\mathfrak {G}}\) and has an edge between \({\mathcal {G}}_i\) and \({\mathcal {G}}_j\) if the groups overlap, i.e. if \({\mathcal {G}}_i\cap {\mathcal {G}}_j \ne \emptyset \); see [4]. We call a group model loopless if the intersection graph of the group model has no cycles. We call a group model block model if all groups have equal size and if they are pairwise disjoint. In this case the groups are sometimes called blocks. We define the frequency f of a group model as the maximum number of groups an element is contained in, i.e.

$$\begin{aligned} f:=\max _{i\in [N]} \left| \left\{ j\in [M] \ : \ i\in {\mathcal {G}}_j \right\} \right| . \end{aligned}$$

In [4] besides the latter group models, the more general models are considered where an additional sparsity in the classical sense is required on the signal. More precisely for a given budget \(G\in [M]\) and a sparsity \(K\in [N]\) they study the structured sparsity model

$$\begin{aligned} {\mathfrak {G}}_{G,K}:=\left\{ {\mathcal {S}}\subseteq [N]\ : \ {\mathcal {S}}\subseteq \bigcup _{i\in {\mathcal {I}}}{\mathcal {G}}_i, ~|{\mathcal {I}}|\le G, \ |{\mathcal {S}}|\le K \right\} . \end{aligned}$$
(11)

Note that for \(K=N\) we obtain a standard group model defined as above.

Both variants of group models defined above clearly are special cases of a structured sparsity model defined in Sect. 2.3. Therefore all results for structured sparsity models can be used for group-sparse models. To adapt Algorithm 1 a model projection oracle \({\mathcal {P}}_{{\mathfrak {G}}_G}\) (or \({\mathcal {P}}_{{\mathfrak {G}}_{G,K}}\)) has to be provided. Note that for several applications we are not only interested in the optimal support of the latter projection but we want to find at most G groups covering this support. The main work of this paper is to analyse the complexity of the latter problem for group models and to provide efficient algorithms to solve it exactly or approximately. Given a signal \({\mathbf {x}}\in {\mathbb {R}}^N\), the group model projection problem or sometimes called signal approximation problem is then to find a support \({\mathcal {S}}\in {\mathfrak {G}}_{G,K}\) together with G groups covering this support such that \(\Vert {\mathbf {x}}-{\mathbf {x}}_{{\mathcal {S}}}\Vert _p\) is minimal, i.e. we want to solve the problem

$$\begin{aligned} \min _{\begin{array}{c} {\mathcal {G}}_1,\ldots ,{\mathcal {G}}_G\in {\mathfrak {G}} \\ {\mathcal {S}}\subset \bigcup _{i=1}^{G}{\mathcal {G}}_i \\ |{\mathcal {S}}|\le K \end{array}} \Vert {\mathbf {x}}- {\mathbf {x}}_{{\mathcal {S}}}\Vert _p. \end{aligned}$$

If the parameter K is not mentioned we assume \(K=N\).

Baldassarre et al. [4] observed the following close relation to the NP-hard Maximum \(G\)-Coverage problem. Given a signal \({\mathbf {x}} \in {\mathbb {R}}^N\), a group-sparse vector \(\hat{{\mathbf {x}}}\) for which \(\Vert {\mathbf {x}}-\hat{{\mathbf {x}}}\Vert _2^2\) is minimum satisfies \(\hat{x}_i \in \{0,x_i\}\) for all \(i \in [N]\). For a vector with the latter property,

$$\begin{aligned} \Vert {\mathbf {x}}-\hat{{\mathbf {x}}}\Vert _2^2 = \sum _{i=1}^N x_i^2 - \sum _{i=1}^N \hat{x}_i^2 \end{aligned}$$

holds and so minimizing \(\Vert {\mathbf {x}}-\hat{{\mathbf {x}}}\Vert _2^2\) is equivalent to maximizing \(\sum _{i=1}^N \hat{x}_i^2\). Consequently, the group model projection problem with \(K=N\) is equivalent to the problem of finding an index set \({\mathcal {I}}\subset [M]\) of at most G groups, i.e. \(|{\mathcal {I}}| \le G\), maximizing \(\sum _{i\in \bigcup _{j\in {\mathcal {I}}} {\mathcal {G}}_j} x_i^2\). This problem is called Maximum \(G\)-Coverage in the literature [30]. Despite the prominence of the latter problem, we will stick to the group model notation, since it is closer to the applications we have in mind and we will leave the regime of Maximum \(G\)-Coverage by introducing more constraints later.

We simplify the notation by defining \(w_i = x_i^2\) for all \(i \in [N]\). Using this notation, the group model projection problem is equivalent to finding an optimal solution of the following integer program:

$$\begin{aligned} \begin{aligned} \max&{\mathbf {w}}^\top {\mathbf {u}}\\ s.t. \quad&\sum _{i \in [M]} v_i \le G \\&u_j \le \sum _{i:j \in {\mathcal {G}}_i} v_i \quad \text{ for } \text{ all } j \in [N] \\&{\mathbf {u}}\in \{0,1\}^N, \ {\mathbf {v}}\in \{0,1\}^M \end{aligned} \end{aligned}$$
(12)

Here, the variable \(u_i\) is one if and only if the i-th index is contained in the support of the signal approximation, and \(v_i\) is one if and only if the group \({\mathcal {G}}_i\) is chosen.

Besides the NP-hardness for the general case the authors in [4] show that the group model projection problem can be solved in polynomial time by dynamic programming for the special case of loopless groups. Furthermore the authors show that if the intersection graph is bipartite the projection problem can be solved in polynomial time by relaxing problem (12). Similar results are obtained for the more general problem, where additional to the group-sparsity the classical K-sparsity is assumed, i.e. the additional constraint

$$\begin{aligned}\sum _{i\in [N]}u_i\le K\end{aligned}$$

is added to problem (12).

As stated in Sect. 2.3, the authors of [5] first study a special case of group models, i.e. block models, where the groups are non-overlapping and are all of equal size. The sample complexity they derived in that work for sub-Gaussian measurement matrices is \(m = {\mathcal {O}}\left( Gg + \log \left( N/(Gg)\right) \right) \), where g is the fixed block size. However, in [3] the authors studied group models in the sparse matrix setting, besides other results they proposed the MEIHT algorithm for tree and group models. The more relevant result to this work show that for loopless overlapping group-sparse models with maximum group size \(g_{\text {max}}\), using model expander measurement matrices, the number of measurements required for successful recovery is \(m = Gg_{\text {max}}\log \left( N/(Gg_{\text {max}})\right) /\log \left( Gg_{\text {max}}\right) \); see [3]. This results holds for general groups, the “looplessness” condition is only necessary for the polynomial time reconstruction using the MEIHT algorithm. Therefore, this sample complexity result also holds for the general group models we consider in this manuscript.

Group lasso

The classical Lasso approach for k-sparse signals seeks to minimize a quadratic error penalized by the \(\ell _1\)-norm [22]. More precisely, for a given \(\lambda >0\) we want to find an optimal solution of the problem

$$\begin{aligned} \min _{{\mathbf {x}}} \Vert A{\mathbf {x}}-{\mathbf {y}}\Vert _2^2 + \lambda \Vert {\mathbf {x}}\Vert _1. \end{aligned}$$
(13)

It is well known that using the latter approach for appropriate choices of \(\lambda \) leads to sparse solutions.

The Lasso approach was already extended to group models in [55] and afterwards studied in several works for non-overlapping groups; see [33, 40, 45, 48]. The idea is again to minimize a loss function, e.g. the quadratic loss, and to penalize the objective value for each group by a norm of the weights of the recovered vector restricted to the items in each group. An extension which can also handle overlapping groups was studied in [37, 56]. In [49] the authors study what they call the latent group Lasso. To this end they consider a loss function \(L:{\mathbb {R}}^N\rightarrow {\mathbb {R}}\) and propose to solve the \(\ell _1 /\ell _2\)-penalized problem

$$\begin{aligned} \begin{aligned} \min&\quad L({\mathbf {x}}) + \lambda \sum _{{\mathcal {G}}\in {\mathfrak {G}}} d_{{\mathcal {G}}} \Vert {\mathbf {w}}^{{\mathcal {G}}}\Vert _2 \\ s.t. \quad&\quad {\mathbf {x}}=\sum _{{\mathcal {G}}\in {\mathfrak {G}}}{\mathbf {w}}^{{\mathcal {G}}}\\&\quad {\text {supp}}\left( {\mathbf {w}}^{{\mathcal {G}}}\right) \subset {\mathcal {G}} \ \forall {\mathcal {G}}\in {\mathfrak {G}}\\&\quad {\mathbf {x}}\in {\mathbb {R}}^N, {\mathbf {w}}^{{\mathcal {G}}}\in {\mathbb {R}}^N \ \forall {\mathcal {G}}\in {\mathfrak {G}} \end{aligned} \end{aligned}$$
(14)

for given weights \(\lambda >0\) and \(d_{{\mathcal {G}}}\ge 0\) for each \({\mathcal {G}}\in {\mathfrak {G}}\). The idea is that for ideal choices of the latter weights a solution of Problem (14) will be sparse and its support is likely to be a union of groups. Nevertheless it is not guaranteed that the number of selected groups is optimal as it is the case for the iterative methods in the previous sections. Note that equivalently we can replace each norm \(\Vert {\mathbf {w}}^{{\mathcal {G}}}\Vert \) by a variable \(z^{{\mathcal {G}}}\) in the objective function and add the quadratic constraint \(\Vert {\mathbf {w}}^{{\mathcal {G}}}\Vert _2^2\le (z^{{\mathcal {G}}})^2\). Hence Problem (14) can be modelled as a quadratic problem and can be solved by standard solvers like CPLEX.

The \(l_0\) counterpart of Problem (14) was considered in [32] under the name block coding and can be formulated as

$$\begin{aligned} \begin{aligned} \min&\quad L({\mathbf {x}}) + \lambda \sum _{{\mathcal {G}}\in \tilde{{\mathfrak {G}}}} d_{{\mathcal {G}}} \Vert {\mathbf {w}}^{{\mathcal {G}}}\Vert _2 \\ s.t. \quad&\quad {\mathbf {x}}=\sum _{{\mathcal {G}}\in \tilde{{\mathfrak {G}}}}{\mathbf {w}}^{{\mathcal {G}}}\\&\quad {\text {supp}}\left( {\mathbf {w}}^{{\mathcal {G}}}\right) \subset {\mathcal {G}} \ \forall {\mathcal {G}}\in \tilde{{\mathfrak {G}}}\\&\quad \tilde{{\mathfrak {G}}}\subset {\mathfrak {G}}\\&\quad {\mathbf {x}}\in {\mathbb {R}}^N, {\mathbf {w}}^{{\mathcal {G}}}\in {\mathbb {R}}^N \ \forall {\mathcal {G}}\in \tilde{{\mathfrak {G}}}. \end{aligned} \end{aligned}$$
(15)

Note that in contrast to Problem (14) an easy reformulation of Problem (15) into a continuous quadratic problem is not possible. Nevertheless we can reformulate it using the mixed-integer programming formulation

$$\begin{aligned} \begin{aligned} \min&\quad L({\mathbf {x}}) + \lambda \sum _{{\mathcal {G}}\in {\mathfrak {G}}} d_{{\mathcal {G}}} {\mathbf {v}}^{{\mathcal {G}}} \\ s.t. \quad&\quad {\mathbf {x}}=\sum _{{\mathcal {G}}\in {\mathfrak {G}}}{\mathbf {w}}^{{\mathcal {G}}}\\&\quad {\text {supp}}\left( {\mathbf {w}}^{{\mathcal {G}}}\right) \subset {\mathcal {G}} \ \forall {\mathcal {G}}\in {\mathfrak {G}}\\&\quad w_i^{{\mathcal {G}}}\le M_i{\mathbf {v}}^{{\mathcal {G}}} \ \forall i\in [N], {\mathcal {G}}\in {\mathfrak {G}}\\&\quad -M_i{\mathbf {v}}^{{\mathcal {G}}}\le {\mathbf {w}}_i^{{\mathcal {G}}}\ \forall i\in [N], {\mathcal {G}}\in {\mathfrak {G}}\\&\quad {\mathbf {x}}\in {\mathbb {R}}^N, {\mathbf {w}}^{{\mathcal {G}}}\in {\mathbb {R}}^N, {\mathbf {v}}^{{\mathcal {G}}}\in \left\{ 0,1\right\} \ \forall {\mathcal {G}}\in {\mathfrak {G}} \end{aligned} \end{aligned}$$
(16)

where \(M_i\in {\mathbb {R}}\) can be chosen larger or equal to the entry \(|x_i|\) of the true signal for each \(i\in [N]\). The variables \({\mathbf {v}}^{{\mathcal {G}}}\in \left\{ 0,1\right\} \) have value 1 if and only if group \({\mathcal {G}}\) is selected for the support of \({\mathbf {x}}\). As for the \(\ell _1/\ell _2\) variant it is not guaranteed that the number of selected groups is optimal. Note that the latter problem is a mixed-integer problem and therefore hard to solve in large dimension in general. Furthermore the efficiency of classical methods as the branch and bound algorithm depend on the quality of the calculated lower bound which depends on the values \(M_i\). Hence in practical applications where the true signal is not known good estimations of the \(M_i\) values are crucial for the success of the latter method. Another drawback is that the best values for \(\lambda \) and the weights \(d_{{\mathcal {G}}}\) are not known in advance and have to be chosen appropriately for each application.

We study Problems (14) and (15) computationally in Sect. 5.

Approximation algorithms for model-based compressed sensing

As mentioned in the last section solving the projection problem, given in Definition 5, may be computationally hard. To overcome this problem the authors in [27, 28] present algorithms, based on the idea of IHT (and CoSaMP), which instead of solving the projection problems exactly, use two approximation procedures called head- and tail-approximation. In this section we will shortly describe the concept and the results in [27, 28]. Note that we just present results related to the IHT, although similar results for the CoSaMP were derived as well in [27, 28].

Given two structured sparsity models \({\mathfrak {M}}, {\mathfrak {M}}_H\) and a vector \({\mathbf {x}}\), let \({\mathcal {H}}\) be an algorithm that computes a vector \({\mathcal {H}}({\mathbf {x}})\) with support in \({\mathfrak {M}}_H\). Then, given some \(\alpha \in {\mathbb {R}}\) (typically \(\alpha < 1\)) we say that \({\mathcal {H}}\) is an \((\alpha ,{\mathfrak {M}}, {\mathfrak {M}}_H,p)\)-head approximation  if

$$\begin{aligned} \Vert {\mathcal {H}}({\mathbf {x}})\Vert _p \ge \alpha \cdot \Vert {\mathbf {x}}_{{\mathcal {S}}}\Vert _p \text{ for } \text{ all } {\mathcal {S}} \in {\mathfrak {M}}. \end{aligned}$$
(17)

Note that the support of the vector calculated by \({\mathcal {H}}\) is contained in \({\mathfrak {M}}_H\) while the approximation guarantee must be fulfilled for all supports in \({\mathfrak {M}}\).

Moreover given two structured sparsity models \({\mathfrak {M}}, {\mathfrak {M}}_T\) let \({\mathcal {T}}\) be an algorithm which computes a vector \({\mathcal {T}}({\mathbf {x}})\) with support in \({\mathfrak {M}}_T\). Given some \(\beta \in {\mathbb {R}}\) (typically \(\beta > 1\)) we say that \({\mathcal {T}}\) is a \((\beta ,{\mathfrak {M}}, {\mathfrak {M}}_T,p)\)-tail approximation if

$$\begin{aligned} \Vert {\mathbf {x}}- {\mathcal {T}}({\mathbf {x}}) \Vert _p \le \beta \cdot \Vert {\mathbf {x}}- {\mathbf {x}}_{{\mathcal {S}}}\Vert _p \text{ for } \text{ all } {\mathcal {S}} \in {\mathfrak {M}}. \end{aligned}$$
(18)

Note that in general a head approximation does not need to be a tail approximation and vice versa.

The cases studied in [27] are \(p=1\) and \(p=2\). For the case \(p=2\) the authors propose an algorithm called Approximate Model-IHT (AM-IHT), shown in Algorithm 2.

figureb

Assume that \({\mathcal {T}}\) is a \((\beta ,{\mathfrak {M}}, {\mathfrak {M}}_T,2)\)-tail approximation and \({\mathcal {H}}\) is a \((\alpha ,{\mathfrak {M}}_T\oplus {\mathfrak {M}}, {\mathfrak {M}}_H,2)\)-head approximation where \({\mathfrak {M}}_T\oplus {\mathfrak {M}}\) is the Minkowski sum of \({\mathfrak {M}}_T\) and \({\mathfrak {M}}\). Furthermore we assume the condition

$$\begin{aligned} \alpha ^2 > 1-(1+\beta )^{-2} \end{aligned}$$
(19)

holds. The authors in [27] prove that for a signal \({\mathbf {x}}\in {\mathbb {R}}^N\) with \({\text {supp}}({\mathbf {x}})\in {\mathfrak {M}}\), noisy measurements \({\mathbf {y}}={\mathbf {A}}{\mathbf {x}}+{\mathbf {e}}\) where \({\mathbf {A}}\) has \({\mathfrak {M}} \oplus {\mathfrak {M}}_T\oplus {\mathfrak {M}}_H\)-model RIP with RIC \(\delta \),

Algorithm 2 calculates a signal estimate \(\hat{{\mathbf {x}}}\) satisfying

$$\begin{aligned} \Vert {\mathbf {x}}-\hat{{\mathbf {x}}} \Vert _2 \le \tau \Vert {\mathbf {e}}\Vert _2 \end{aligned}$$

where \(\tau \) depends on \(\delta , ~\alpha \) and \(\beta \). Note that the condition (19) holds e.g. for approximation accuracies \(\alpha >0.9\) and \(\beta <1.1\).

For the case \(p=1\) the authors replace Step 3 in Algorithm 2 by the update

$$\begin{aligned} {\mathbf {x}}^{(n+1)} \leftarrow {\mathcal {T}}\left( {\mathbf {x}}^{(n)} + {\mathcal {H}}\left( {\mathcal {M}}\left( {\mathbf {y}}-{\mathbf {A}}{\mathbf {x}}^{(n)}\right) \right) \right) \end{aligned}$$

where \({\mathcal {M}}\) is the median operator defined as in Sect. 2.2. Under the same assumptions as above, but considering \(p=1\) for the head- and tail-approximations and \({\mathbf {A}}\) having the \({\mathfrak {M}}\)-RIP-1, the authors in [27] show convergence of the adapted algorithm.

Comparison to related works

In a more illustrative way in Tables 1 and 2 below we show where our results stand vis-a-vis other results. In Table 1 we show the studied models together with the derived sample complexity and the studied class of measurements matrices. In Table 2 we present the names of the studied algorithms, the class of the model projection, the class of the algorithm used to solve the model projection, the runtime complexity of the projection problem and the class of instance optimality.

Table 1 Comparison of model-based compressive sensing results
Table 2 Comparison of model-based compressive sensing results

Algorithms with exact projection oracles

In this section we study the exact group model projection problem which has to be solved iteratively in the Model-IHT and the MEIHT. We extend existing results for group-sparse models and pass from loopless overlapping group models (which was the most general prior to this work) to overlapping group models of bounded treewidth and to general group models without any restriction on the structure. The graph representing a loopless overlapping group model has a treewidth of 1.

We start by showing that it is possible to perform exact projections onto overlapping groups with bounded treewidth using dynamic programming, see Sect. 3.1. While this procedure has a polynomial run-time bound it is restricted to the class of group models with bounded treewidth. Nevertheless we prove that the exact projection problem is NP-hard if the incidence graph is a grid which is the most basic graph structure without bounded treewidth. For the sake of completeness we solve the exact projection problem for all instances of group models by a method based on Benders’ decomposition in Sect. 3.6. Solving an NP-hard problem this method does not yield a polynomial run-time guarantee but works well in practice as shown in [17]. In Sect. 3.5 we present an appropriately modified algorithm (MEIHT) with exact projection oracles for the recovery of signals from structured sparsity models. We derive corollaries for the general group-model case from existing works about run-time and convergence of this modified algorithm.

Recall the following notation: \({\mathcal {G}}_i\) denotes a group of size \(g_i\), \(i\in [M]\), and a group model is a collection \(\mathfrak {G} = \{{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_M\}\). The group-sparse model of order G is denoted by \({\mathfrak {G}}_G\), which contains all supports \({\mathcal {S}}\) which are contained in the union of at most G groups of \({\mathfrak {G}}\), i.e. \({\mathcal {S}}\subseteq \bigcup _{j\in \mathcal {I}} {\mathcal {G}}_j, ~ \mathcal {I}\subseteq [M]\) and \(|\mathcal {I}|\le G\); see (10). We will interchangeably say that \({\mathbf {x}}\) or \({\mathcal {S}}\) is \({\mathfrak {G}}_G\)-sparse. Clearly group models are a special case of structured sparsity models. Assume \(s_{\text {max}}\) is the size of the maximal support which is possible by selecting G groups out of \({\mathfrak {G}}\). For \(g_{\text {max}}\) denoting the maximal size of a single group in \({\mathfrak {G}}\), i.e.,

$$\begin{aligned} g_{\text {max}} = \max _{i\in [M]} |{\mathcal {G}}_i|, \end{aligned}$$

we have \(s_{\text {max}}\in {\mathcal {O}}\left( G g_{\text {max}}\right) \). Furthermore the number of possible supports is in \({\mathcal {O}}(M^G)\). Therefore applying the result from Sect. 2.3 we obtain

$$\begin{aligned} m={\mathcal {O}}\left( \delta ^{-2}G g_{\text {max}} \log (\delta ^{-1}) + G\log (M) \right) \end{aligned}$$
(20)

as the number of required measurements for a sub-Gaussian matrix to obtain the group-model-RIP with RIC \(\delta \) with high probability, which induces the convergence of Algorithm 1 for small enough \(\delta \).

Group models of low treewidth

One approach to overcome the hardness of the group model projection problem is to restrict the structure of the group models considered. To this end we follow Baldassarre et al. [4] and consider two graphs associated to a group model \({\mathfrak {G}}\).

The intersection graph of \({\mathfrak {G}}\), \(I({\mathfrak {G}})\), is given by the vertex set \(V(I({\mathfrak {G}})) = {\mathfrak {G}}\), and the edge set

$$\begin{aligned}E(I({\mathfrak {G}})) = \{ RS : R,S \in {\mathfrak {G}}, R\ne S \text{ and } R \cap S \ne \emptyset \}.\end{aligned}$$

The incidence graph of \({\mathfrak {G}}\), \(B({\mathfrak {G}})\), is given by the vertex set \(V(B({\mathfrak {G}})) = [N] \cup {\mathfrak {G}}\), and the edge set

$$\begin{aligned}E(B({\mathfrak {G}})) = \{ eS : e\in [N], S \in {\mathfrak {G}} \text{ and } e \in S \}.\end{aligned}$$

Note that the incidence graph is bipartite since an edge is always adjacent to an element e and a group S. See Fig. 1 for a simple illustration of the two constructions. Baldassarre et al. [4] prove that there is a polynomial time algorithm to solve the group model projection problem in the case that the intersection graph is an acyclic graph. Their algorithm uses dynamic programming on the acyclic structure of the intersection graph.

Fig. 1
figure 1

Intersection graph and incidence graph of the group model \({\mathfrak {G}}=\{A=\{1,2\},B=\{1,2,3\},C=\{2,4\}, D=\{3,4\}\}\)

We generalize this approach and show that the same problem can be solved in polynomial time if the treewidth of the incidence graph is bounded. Following Proposition 2 below, this implies that the group model projection Problem can be solved in polynomial time if the treewidth of the intersection graph is bounded. We proceed by formally introducing the relevant concepts.

Tree decomposition

Let \(\bar{G}=(V,E)\) be a graph. A tree decomposition of \(\bar{G}\) is a tree T where each node \(x \in V(T)\) of T has a bag \(B_x \subseteq V\) of vertices of \(\bar{G}\) such that the following properties hold:

  1. 1.

    \(\bigcup _{x \in V(T)} B_x = V\).

  2. 2.

    If \(B_x\) and \(B_y\) both contain a vertex \(v \in V\), then the bags of all nodes of T on the path between x and y contain v as well. Equivalently, the tree nodes containing the vertex v form a connected subtree of T.

  3. 3.

    For every edge vw in E there is some bag that contains both v and w. That is, vertices in V can be adjacent only if the corresponding subtrees in T have a node in common.

The width of a tree decomposition is the size of its largest bag minus one, i.e. \(\max _{x \in V(T)} |B_x|-1\). The treewidth of \(\bar{G}\), \(\text{ tw }(\bar{G})\), is the minimum width among all possible tree decompositions of \(\bar{G}\).

Intuitively, the treewidth measures how ‘treelike’ a graph is: the smaller the treewidth is, the more treelike is the graph. The graphs of treewidth one are the acyclic graphs. Figure 2 shows a graph of treewidth 2, together with a tree decomposition.

Fig. 2
figure 2

A graph and a tree decomposition of width 2

Before stating any algorithms, we discuss the relation of the treewidth of the intersection and the incidence graphs of a given group model.

When bounding the treewidth of the graphs associated to a group model, it makes sense to consider the incidence graph rather than the intersection graph. This is due to the following simple observation.

Proposition 2

For any group model \({\mathfrak {G}}\) it holds that \(\text{ tw }(B({\mathfrak {G}})) \le \text{ tw }(I({\mathfrak {G}}))+1\). However, for every t there exists a group model \({\mathfrak {G}}\) such that \(\text{ tw }(I({\mathfrak {G}}))-\text{ tw }(B({\mathfrak {G}})) = t\).

This statement is not necessarily new, but we quickly prove it in our language in order to be self-contained.

Proof of Proposition 2

To see the first assertion, let \({\mathfrak {G}}\) be a group model. Let T be a tree decomposition of \(I({\mathfrak {G}})\) of width \(\text{ tw }(I({\mathfrak {G}}))\). In the following, we attach leaves to T, one for each element in [N], and obtain a tree decomposition of \(B({\mathfrak {G}})\). Each leaf will contain at most \(\text{ tw }(I({\mathfrak {G}}))\) many elements of \({\mathfrak {G}}\) and at most one element of [N]. Hence we get \(\text{ tw }(B({\mathfrak {G}})) \le \text{ tw }(I({\mathfrak {G}}))+1\).

To construct the tree decomposition of \(B({\mathfrak {G}})\) pick any \(i \in [N]\), and let \({\mathfrak {G}}_i\) be the set of groups in \({\mathfrak {G}}\) containing i. Since all groups in \({\mathfrak {G}}_i\) contain i, the set \({\mathfrak {G}}_i\) is a clique in \(I({\mathfrak {G}})\).Footnote 1 Moreover, since T is a tree decomposition of \(I({\mathfrak {G}})\), the subtrees of the groups in \({\mathfrak {G}}_i\) mutually share a node. As subtrees of a tree have the Helly property, there is at least one node x of T such that \({\mathfrak {G}}_i \subseteq B_x\). We now add a new node \(x_i\) with bag \(B_{x_i} = {\mathfrak {G}}_i \cup \{i\}\) and an edge between \(x_i\) and x in T. Doing this for all \(i\in [N]\) simultaneously, it is easy to see that we arrive at a tree decomposition \(T'\) of \(B({\mathfrak {G}})\) of width at most \(\text{ tw }(I({\mathfrak {G}}))+1\) which proves the first assertion.

To prove the second assertion consider, for any t, the group model \({\mathfrak {G}}\) where

$$\begin{aligned}{\mathfrak {G}}= \{{\mathcal {G}}_i : i\in [t+2]\} \text{ with } {\mathcal {G}}_i = \{i,t+3\}.\end{aligned}$$

Note that \(B({\mathfrak {G}})\) is a tree, hence \(\text{ tw }(B({\mathfrak {G}}))=1\). In \(I({\mathfrak {G}})\), however, the set \({\mathfrak {G}}\) is a clique of size \(t+2\). Thus, \(\text{ tw }(I({\mathfrak {G}}))=t+1\) which implies \(\text{ tw }(I({\mathfrak {G}}))-\text{ tw }(B({\mathfrak {G}})) = t\). \(\square \)

Consider now a tree decomposition T of a graph \(\bar{G}\). We say that T is a nice tree decomposition if every node x is of one of the following types.

  • Leaf: x has no children and \(B_x=\emptyset \).

  • Introduce: x has one child, say y, and there is a vertex \(v \notin B_y\) of \(\bar{G}\) with \(B_x = B_y \cup \{v\}\).

  • Forget: x has one child, say y, and there is a vertex \(v \notin B_x\) of \(\bar{G}\) with \(B_y = B_x \cup \{v\}\).

  • Join: x has two children y and z such that \(B_x=B_y=B_z\).

This kind of decomposition limits the structure of the difference of two adjacent nodes in the decomposition. A folklore statement (explained in detail in the classic survey by Kloks [39]) says that such a nice decomposition is easily computed given any tree decomposition of \(\bar{G}\) without increasing the width.

Dynamic programming

In this section we derive a polynomial time algorithm for the group model projection problem for fixed treewidth. As in Sect. 2.4 we assume we have a given signal \(x\in {\mathbb {R}}^N\) and define \(w\in {\mathbb {R}}^N\) with \(w_i=x_i^2\). In the following we use the notation

$$\begin{aligned} w({\mathcal {S}} ):=\sum _{i\in {\mathcal {G}}}w_i \end{aligned}$$

for a subset \({\mathcal {S}}\subseteq [N]\). The algorithm is presented using a nice tree decomposition of the incidence graph \(B({\mathfrak {G}})\) and uses the following concept. Fix a node x of the decomposition tree of \(B({\mathfrak {G}})\), a number i with \(0 \le i \le G\) and a map \(c:B_x\rightarrow \{0,1,1_?\}\). We say that c is a colouring of \(B_x\). We consider solutions to the problem group model projection(xic) which is defined as follows.

A set \({\mathcal {S}}\subseteq {\mathfrak {G}}\) is a feasible solution of group model projection(xic) if

  1. (a)

    \({\mathcal {S}}\) contains only groups that appear in some bag of a node in the subtree rooted at node x,

  2. (b)

    \(|{\mathcal {S}}| = i\),

  3. (c)

    \({\mathcal {S}}\cap B_x\) contains exactly those group-vertices of \(B_x\) that are in \(c^{-1}(1)\), that is,

    $$\begin{aligned} {\mathcal {S}}\cap c^{-1}(1) = {\mathfrak {G}}\cap c^{-1}(1), \text { and} \end{aligned}$$
  4. (d)

    of the elements in \(B_x\), \({\mathcal {S}}\) covers exactly those that are in \(c^{-1}(1)\). Formally,

    $$\begin{aligned}\left( \bigcup {\mathcal {S}}\right) \cap B_x = [N] \cap c^{-1}(1).\end{aligned}$$

The objective value of the set \({\mathcal {S}}\) is given by \(w(\bigcup {\mathcal {S}}) + w(c^{-1}(1_?))\). Intuitively, a feasible solution to group model projection(xic) does not cover elements labelled 0 or \(1_?\), but covers all elements labelled 1. The elements labelled \(1_?\) are assumed to be covered by groups not yet visited in the tree decomposition.

If group model projection(xic) does not admit a feasible solution, we say that group model projection(xic) is infeasible. The maximum objective value attained by a feasible solution to group model projection(xic), if feasible, we denote by \(\text{ OPT }(x,i,c)\). If group model projection(xic) is infeasible, we set \(\text{ OPT }(x,i,c) = - \infty \).

Assertion (d) implies that group model projection(xic) is infeasible if the groups in \(c^{-1}(1)\) cover elements in \(c^{-1}(0)\) or \(c^{-1}(1_?)\). That is, group model projection(xic) is infeasible if \(\bigcup \left( {\mathfrak {G}} \cap c^{-1}(1)\right) \not \subseteq [N] \cap c^{-1}(1)\). To deal with this exceptional case we call c consistent if \(\bigcup \left( {\mathfrak {G}} \cap c^{-1}(1)\right) \subseteq [N] \cap c^{-1}(1)\), and inconsistent otherwise. Note that consistency of c is necessary to ensure feasibility of group model projection(xic), but not sufficient.

Our algorithm processes the nodes of a nice tree decomposition in a bottom-up fashion. Fix a node x, a number i with \(0 \le i \le G\) and a map \(c:B_x\rightarrow \{0,1,1_?\}\). We use dynamic programming to compute the value \(\text{ OPT }(x,i,c)\), assuming we know all possible values \(\text{ OPT }(y,j,c')\) for all children y of x, all j with \(0 \le j \le G\), and all \(c': B_y\rightarrow \{0,1,1_?\}\).

In the following, for a subset \(S\subset B_x\) the function \(c|_S:S\rightarrow \{0,1,1_?\}\) is the restriction of c to S. We use \(\varGamma (v)\) to denote the neighborhood of v in \(B({\mathfrak {G}})\).

If c is not consistent, we may set \(\text{ OPT }(x,i,c) = - \infty \) right away. We thus assume that c is consistent and distinguish the type of node x as follows.

  • Leaf: set \(\text{ OPT }(x,0,c) = 0\) and \(\text{ OPT }(x,i,c) = -\infty \) for all \(i \in [G]\).

  • Introduce: let y be the unique child of x and let \(v \notin B_y\) such that \(B_x = B_y \cup \{v\}\).

    If \(v \in [N]\), we set

    $$\begin{aligned} \text{ OPT }(x,i,c) = {\left\{ \begin{array}{ll} \text{ OPT }(y,i,c|_{B_y}), &{}\quad \text { if } c(v)=0\\ \text{ OPT }(y,i,c|_{B_y}) + w(v), &{}\quad \text { if } c(v)=1 \text { and } i > 0\\ \text{ OPT }(y,i,c|_{B_y}) + w(v),&{}\quad \text { if } c(v)=1_?\\ -\infty , &{}\quad \text { otherwise} \end{array}\right. } \end{aligned}$$

    If \(v \in {\mathfrak {G}}\), we set

    $$\begin{aligned} \text{ OPT }(x,i,c) = {\left\{ \begin{array}{ll} \text{ OPT }(y,i,c|_{B_y}), &{}\quad \text { if } c(v)=0\\ \max \{ \text{ OPT }(y,i-1,c') : (y,c') \text { is compatible to } (x,c) \}, &{}\quad \text { if } c(v)=1 \text { and } c^{-1}(1) \cap \varGamma (v) \ne \emptyset \\ -\,\infty , &{}\quad \text { otherwise} \end{array}\right. } \end{aligned}$$
    (21)

    where \((y,c')\) is compatible to (xc) if

    • \(c':B_y \rightarrow \{0,1,1_?\}\) is a consistent colouring of \(B_y\),

    • \(c^{-1}(0) = c'^{-1}(0)\), and

    • \(c^{-1}(1) = c'^{-1}(1) \cup (c'^{-1}(1_?) \cap \varGamma (v))\).

  • Forget: let y be the unique child of x and let \(v \notin B_x\) such that \(B_y = B_x \cup \{v\}\). We set

    $$\begin{aligned} \text{ OPT }(x,i,c) = \max \{\text{ OPT }(y,i,c') : ~ c':B_y \rightarrow \{0,1,1_?\} \text{, } c = c'|_{B_x} \text {, and } c'(v) \ne 1_?\} \end{aligned}$$
    (22)
  • Join: we set

    $$\begin{aligned} \text{ OPT }(x,i,c)= & {} \max \{\text{ OPT }(y,i_1,c') + \text{ OPT }(z,i_2,c'') \nonumber \\&- w(((B_x \cap [N]) \cup \bigcup (B_x \cap {\mathfrak {G}})) {\setminus } c^{-1}(0)) : i_1\nonumber \\&+ i_2 - |c'^{-1}(1) \cap c''^{-1}(1) \cap {\mathfrak {G}}| = i\}, \end{aligned}$$
    (23)

    where y and z are the two children of x. The maximum is taken over all consistent colourings \(c',c'':B_x \rightarrow \{0,1,1_?\}\) with \(c^{-1}(0)=c'^{-1}(0)=c''^{-1}(0)\) and \(c^{-1}(1) = c'^{-1}(1) \cup c''^{-1}(1)\).

  • Root: first we compute \(\text{ OPT }(x,G,c)\) for all relevant choices of c, depending on the type of node x. The algorithm then terminates with the output

    $$\begin{aligned} \text{ OPT } = \max \left\{ \text{ OPT }(x,G,c) : ~ c : B_x \rightarrow \{0,1,1_?\}\right\} . \end{aligned}$$

Lemma 2

The output OPT is the objective value of an optimal solution of the group model projection problem.

Proof

The proof follows the individual steps of the dynamic programming algorithm. Consider the problem group model projection(xic). If c is not consistent, we correctly set \(\text{ OPT }(x,i,c)=-\,\infty \). We thus proceed to the case when c is consistent. Fix an optimal solution \({\mathcal {S}}\) of group model projection(xic) if existent.

The leaf-node case is clear, so we proceed to the case of x being an introduce-node. Let y be the unique child of x and let \(v \notin B_y\) such that \(B_x = B_y \cup \{v\}\). First assume that \(v \in [N]\).

Assume that \(c(v)=0\). Since c is consistent, \(c^{-1}(1) \cap \varGamma (v) = \emptyset \) holds, and so \({\mathcal {S}}\) is an optimal solution to group model projection \((y,i,c|_{B_y})\). We may thus set \(\text{ OPT }(x,i,c)=\text{ OPT }(y,i,c|_{B_y})\).

If \(c(v)=1\), \({\mathcal {S}}\) covers v, so we need to make sure some vertex \(u \in \varGamma (v) \cap B_x\) is contained in the solution in order for group model projection(xic) to be feasible. Hence we have \(\text{ OPT }(x,i,c) = \text{ OPT }(y,i,c|_{B_y}) + w(v)\) if \(c^{-1}(1) \cap \varGamma (v) \ne \emptyset \), and \(\text{ OPT }(x,i,c)=-\,\infty \) otherwise.

Next we assume that \(v \in {\mathfrak {G}}\). If \(c(v)=0\), we may simply put \(\text{ OPT }(x,i,c) = \text{ OPT }(y,i,c|_{B_y})\). So, assume that \(c(v)=1\). If group model projection(xic) is feasible and thus \({\mathcal {S}}\) exists, define \({\mathcal {S}}' = {\mathcal {S}}{\setminus } \{v\}\). Now \({\mathcal {S}}'\) is a solution to \(\text{ OPT }(y,i,c')\) for some \(c' : B_y \rightarrow \{0,1,1_?\}\) with \(c^{-1}(0) = c'^{-1}(0)\) and \(c^{-1}(1) = c'^{-1}(1) \cup (c'^{-1}(1_?) \cap \varGamma (v))\). Note that \((y,c')\) is compatible to (xc). Consequently, \(\text{ OPT }(x,i,c)\) is upper bounded by the right hand side of (21).

To see that \(\text{ OPT }(x,i,c)\) is at least the right hand side of (21), let \((y,c'')\) be compatible to (xc) and let \({\mathcal {S}}''\) be a solution to group model projection \((y,i,c'')\) of objective value \(\lambda \in {\mathbb {R}}\). Then \({\mathcal {S}}'' \cup \{v\}\) is a solution to group model projection(xic) of objective value \(\lambda \). Consequently,

$$\begin{aligned}\text{ OPT }(x,i,c) = \max \{ \text{ OPT }(y,i,c') : (y,c') \text { is compatible to } (x,c) \}.\end{aligned}$$

If x is a forget-node, let y be the unique child of x and let \(v \notin B_x\) such that \(B_y = B_x \cup \{v\}\). If \(v \in {\mathcal {S}}\), we have

$$\begin{aligned} \text{ OPT }(x,i,c) \le \text{ OPT }(y,i,c') \text{ where } c':B_y \rightarrow \{0,1,1_?\}, \ \ c = c'|_{B_x} \text { and } c'(v) =1. \end{aligned}$$

Otherwise, if \(v \notin {\mathcal {S}}\), we have

$$\begin{aligned} \text{ OPT }(x,i,c) \le \text{ OPT }(y,i,c') \text{ where } c':B_y \rightarrow \{0,1,1_?\}, \ \ c = c'|_{B_x} \text { and } c'(v) =0. \end{aligned}$$

Moreover, any solution of group model projection \((y,i,c')\), where \(c':B_y \rightarrow \{0,1,1_?\}, \ \ c = c'|_{B_x}\) and \(c'(v) =1\) is a solution of group model projection(xic). This proves (22).

If x is a join-node, let y and z be the two children of x and recall that \(B_x=B_y=B_z\).

Let \({\mathcal {S}}'\) be the collection of groups in \({\mathcal {S}}\) contained in the subtree rooted at y, and let \({\mathcal {S}}''\) be the collection of groups in \({\mathcal {S}}\) contained in the subtree rooted at z. Since T is a tree decomposition, \({\mathcal {S}}' \cap {\mathcal {S}}'' = {\mathcal {S}}\cap B_x\).

Note that \({\mathcal {S}}'\) is a solution of group model projection \((y,i_1,c')\) and \({\mathcal {S}}''\) is a solution of group model projection \((z,i_1,c'')\) for some \(c',c'':B_x \rightarrow \{0,1,1_?\}\) and \(i_1,i_2\) with \(i_1 + i_2 - |c'^{-1}(1) \cap c''^{-1}(1) \cap {\mathfrak {G}}| = i\). It is easy to see that \(c^{-1}(0)=c'^{-1}(0)=c''^{-1}(0)\) and \(c^{-1}(1) = c'^{-1}(1) \cup c''^{-1}(1)\).

The objective value of \({\mathcal {S}}\) equals

$$\begin{aligned} w\left( \bigcup {\mathcal {S}}\right) + w\left( c^{-1}\left( 1_?\right) \right)= & {} w\left( \bigcup {\mathcal {S}}'\right) + c'^{-1}\left( 1_?\right) + w\left( \bigcup {\mathcal {S}}''\right) + c''^{-1}\left( 1_?\right) \\&- w\left( \left( \left( B_x \cap [N]\right) \cup \bigcup \left( B_x \cap {\mathfrak {G}}\right) \right) {\setminus } c^{-1}\left( 0\right) \right) . \end{aligned}$$

This shows that \(\text{ OPT }(x,i,c)\) is at most the right hand side of (23).

Now let \(\tilde{{\mathcal {S}}}\) be an optimal solution of group model projection \((y,j_1,\tilde{c})\) and let \(\hat{{\mathcal {S}}}\) be an optimal solution of group model projection \((z,j_2,\hat{c})\) where

  • \(\tilde{c}, \hat{c}:B_x \rightarrow \{0,1,1_?\}\) are both consistent,

  • \(c^{-1}(0)=\tilde{c}^{-1}(0)=\hat{c}^{-1}(0)\) and \(c^{-1}(1) = \tilde{c}^{-1}(1) \cup \hat{c}^{-1}(1)\), and

  • \(j_1 + j_2 - |\tilde{c}^{-1}(1) \cap \hat{c}^{-1}(1) \cap {\mathfrak {G}}| = i\).

Note that \(\tilde{{\mathcal {S}}}\) and \(\hat{{\mathcal {S}}}\) exist since, as we have shown earlier, the colourings \(c'\) and \(c''\) satisfy the above assertions. Then \(\hat{{\mathcal {S}}} \cup \tilde{{\mathcal {S}}}\) is a solution of group model projection(xic) with objective value

$$\begin{aligned} \max \left\{ \text{ OPT }(y,j_1,\tilde{c}) + \text{ OPT }(z,j_1,\hat{c}) - w(((B_x \cap [N]) \cup \bigcup (B_x \cap {\mathfrak {G}})) {\setminus } c^{-1}(0))\right\} . \end{aligned}$$

Consequently, \(\text{ OPT }(x,i,c)\) is at least the right hand side of (23) and thus (23) holds. \(\square \)

By storing the best current solution alongside the OPT(xic)-values we can compute an optimal solution together with OPT.

Runtime of the algorithm

The computational complexity of the individual steps are as follows.

  1. (a)

    Given the incidence graph \(B({\mathfrak {G}})\) on \(n=M+N\) vertices of treewidth \(w_T\), one can compute a tree decomposition of width \(w_T\) in time \({\mathcal {O}}(2^{{\mathcal {O}}(w_T^3)}n)\) using Bodlaender’s algorithm [11]. The number of nodes of the decomposition is in \({\mathcal {O}}(n)\).

  2. (b)

    Given a tree decomposition of width \(w_T\) with t nodes, one can compute a nice tree decomposition of width \(w_T\) on \({\mathcal {O}}(w_Tt)\) nodes in \({\mathcal {O}}(w_T^2t)\) time in a straightforward way [39].

The running time of the dynamic programming is bounded as follows.

Theorem 1

The dynamic programming algorithm can be implemented to run in \({\mathcal {O}}(w_T 5^{w_T} G^2 N t)\) time given a nice tree decomposition of \(B({\mathfrak {G}})\) of width \(w_T\) on t nodes.

Note that we can assume that \(t = {\mathcal {O}}(w_T n)\) with \(n=M+N\). Together with the running time of the construction of the nice tree decomposition we can solve the exact projection problem on graphs with treewidth \(w_T\) in \({\mathcal {O}}((N+M)(w_T^2 5^{w_T} G^2 N + 2^{{\mathcal {O}}(w_T^3)}+w_T^2))\).

Proof of Theorem 1

Since the join-nodes are clearly the bottleneck of the algorithm, we discuss how to organize the computation in a way that the desired running time bound of \({\mathcal {O}}(w_T 5^{w_T} G^2 N)\) holds in a node of this type.

So, let x be a join-node and assume that y and z are the two children of x. We want to compute \(\text{ OPT }(x,i,c)\), for all colourings \(c:B_x \rightarrow \{0,1,1_?\}\) and all i with \(0 \le i \le G\). Recall that we need to compute this value according to (23), that is,

$$\begin{aligned} \text{ OPT }(x,i,c)&= \max \left\{ \text{ OPT }(y,i_1,c') + \text{ OPT }(z,i_2,c'') \right. \nonumber \\&\quad - \left. \,w(((B_x \cap [N]) \cup \bigcup (B_x \cap {\mathfrak {G}})) {\setminus } c^{-1}(0)) : i_1 + i_2 - |c'^{-1}(1) \cap c''^{-1}(1) \cap {\mathfrak {G}}| = i\right\} , \end{aligned}$$

where the maximum is taken over all consistent colourings \(c',c'':B_x \rightarrow \{0,1,1_?\}\) with \(c^{-1}(0)=c'^{-1}(0)=c''^{-1}(0)\) and \(c^{-1}(1) = c'^{-1}(1) \cup c''^{-1}(1)\).

We enumerate all \(5^{w_T+1}\) colourings \(C: B_x \rightarrow \{(0,0),(1,1),(1,1_?),(1_?,1),(1_?,1_?)\}\) and derive c, \(c'\), and \(c''\). We put

$$\begin{aligned} c(v)&= {\left\{ \begin{array}{ll} 0, &{}\quad \text { if } C(v)=(0,0)\\ 1, &{}\quad \text { if } C(v) \in \{(1,1),(1,1_?),(1_?,1)\}\\ 1_?, &{}\quad \text { if } C(v)=(1_?,1_?)\\ \end{array}\right. }\\ c'(v)&= {\left\{ \begin{array}{ll} 0, &{}\quad \text { if } C(v)=(0,0)\\ 1, &{}\quad \text { if } C(v) \in \{(1,1),(1,1_?)\}\\ 1_?, &{}\quad \text { if } C(v) \in \{(1_?,1),(1_?,1_?)\}\\ \end{array}\right. }\\ c''(v)&= {\left\{ \begin{array}{ll} 0, &{}\quad \text { if } C(v)=(0,0)\\ 1, &{}\quad \text { if } C(v) \in \{(1,1),(1,1_?)\}\\ 1_?, &{}\quad \text { if } C(v) \in \{(1,1_?),(1_?,1_?)\}\\ \end{array}\right. } \end{aligned}$$

If either of c, \(c'\), or \(c''\) are inconsistent, we discard this choice of C. In this way we capture every consistent colouring \(c:B_x \rightarrow \{0,1,1_?\}\), and all consistent choices of \(c'\), and \(c''\) satisfying \(c^{-1}(0)=c'^{-1}(0)=c''^{-1}(0)\) and \(c^{-1}(1) = c'^{-1}(1) \cup c''^{-1}(1)\).

It remains to discuss the computation of the value \(w((c^{-1}(1) \cap [N]) \cup \bigcup (c^{-1}(1) \cap {\mathfrak {G}})))\). This value can be computed in \({\mathcal {O}}(w_TN)\) time, since we are computing differences and unions of at most \(w_T\) groups of size N each. We arrive at a total running time in \({\mathcal {O}}(w_T 5^{w_T} G^2 N)\). \(\square \)

Remark 1

The dynamic programming algorithm can be extended to include a sparsity restriction on the support of the signal approximation itself. So, we can compute an optimal K-sparse G-group-sparse signal approximation if the bipartite incidence graph of the studied group models is bounded. The running time of the algorithm increases by a factor of \({\mathcal {O}}(K)\).

Hardness on grid-like group structures

An \(r \times r\)-grid is a graph with vertex set \([r]\times [r]\), and two vertices \((a,b),(c,d) \in [r]\times [r]\) are adjacent if and only if \(|a-c|=1\) and \(|b-d|=0\), or if \(|a-c|=0\) and \(|b-d|=1\). We also say that r is the size of the grid. Figure 3 shows a \(6 \times 6\)-grid.

Fig. 3
figure 3

A \(6 \times 6\)-grid

Recall the group model projection problem can be solved efficiently when the treewidth of the incidence graph of the group structure is bounded, as shown in Sect. 3.1.

Definition 7

(Graph minor) Let \(G_1\) and \(G_2\) be two graphs. The graph \(G_2\) is called a minor of \(G_1\) if \(G_2\) can be obtained from \(G_1\) by deleting edges and/or vertices and by contracting edges.

A classical theorem by Robertson and Seymour [52] says that in a graph class \({\mathcal {C}}\) closed under taking minors either the treewidth is bounded or \({\mathcal {C}}\) contains all grid graphs.

Consequently, if \({\mathcal {C}}\) is a class of graphs that does not have a bounded treewidth, it contains all grids. Our next theorem shows that group model projection is NP-hard on group models \({\mathfrak {G}}\) for which \(B({\mathfrak {G}})\) is a grid, thus complementing Theorem 1.

Theorem 2

The group model projection problem is NP-hard even if restricted to instances \({\mathfrak {G}}\) for which \(B({\mathfrak {G}})\) is a grid graph and the weight of any element is either 0 or 1.

Consider the following problem: given an \(n \times n\)-pixel black-and-white image, pick k \(2 \times 2\)-pixel windows to cover as many black pixels as possible. This problem can be modeled as the group model projection problem on a grid graph where the weight of any element is either 0 or 1. See Fig. 4 for an illustration. Note that we added artifical pixel-nodes with weight 0 to the boundary of the graph to obtain the desired grid-structure. This group model is of frequency at most 4, and so we can do an approximate model projection and signal recovery using the result of Sect. 4.

Fig. 4
figure 4

Selecting black pixels by \(2\times 2\) pixel windows/group model projection in a grid graph with binary weights. Squares are group-vertices, circles are element-vertices

Our proof is a reduction from the Vertex Cover problem. Recall that for a graph \(\bar{G}\), a vertex cover is a subset of the vertices of \(\bar{G}\) such that any edge of \(\bar{G}\) has at least one endpoint in this subset. The size of the smallest vertex cover of \(\bar{G}\), the vertex cover number, is denoted \(\tau (\bar{G})\).

Given a graph \(\bar{G}\) and a number k as input, the task in the Vertex Cover problem is to decide whether \(\bar{G}\) admits a vertex cover of size k. That is, whether \(\tau (\bar{G}) \le k\). This problem is NP-complete even if restricted to cubic planar graphs [23].Footnote 2

We use the following simple lemma in our proof.

Lemma 3

Let \(\bar{G}\) be a graph and let \(G'\) be the graph obtained by subdividing some edge of \(\bar{G}\) twice. Then \(\tau (\bar{G}) = \tau (G')-1\).

We can now prove our theorem.

Proof of Theorem 2

The reduction is from Vertex Cover on planar cubic graphs. Consider \(\bar{G}=(V,E)\) to be a planar cubic graph and let k be some number. Our aim is to compute an instance \(({\mathfrak {G}},{\mathbf {w}},k')\) of the group model projection problem where \(B({\mathfrak {G}})\) is a grid such that \(\bar{G}\) has a vertex cover of size k if and only if \({\mathfrak {G}}\) admits a selection of \(k'\) groups that together cover elements of a total weight at least some threshold t.

First we embed the graph \(\bar{G}\) in some grid H of polynomial size, meaning the vertices of \(\bar{G}\) get mapped to the vertices of the grid and edges get mapped to mutually internally disjoint paths in the grid connecting its endvertices. This can be done in polynomial time using an algorithm for orthogonal planar embedding [2]. We denote the mapping by \(\pi \), hence \(\pi (u)\) is some vertex of H and \(\pi (vw)\) is a path from \(\pi (v)\) to \(\pi (w)\) in H for all \(u \in V\) and \(vw \in E\).

Next we subdivide each edge of the grid 9 times, so that a vertical/horizontal edge of H becomes a vertical/horizontal path of length 10 in some new, larger grid \(H'\). We choose \(H'\) such that the corners of H are mapped to the corners of \(H'\). In particular, \(|V(H')| \le 100 |V(H)|\). Let us denote the obtained embedded subdivision of \(\bar{G}\) in \(H'\) by \(G'\), and let \(\pi '\) denote the embedding. Moreover, let \(\phi \) be the corresponding embedding of H into the subdivided grid \(H'\). Note that \(\text{ im }~\pi '|_{V} \subseteq \text{ im }~\phi |_{V}\).

Let (AB) be a bipartition of \(H'\). We may assume that \(\pi '(u)\) is in A for all \(u \in V\). We consider \(H'\) to be the incidence graph \(B({\mathfrak {G}})\) of a group model \({\mathfrak {G}}\) where the vertices in B correspond to the elements and the vertices in A correspond to the groups of \({\mathfrak {G}}\). We refer to the vertices in A as group-vertices and to vertices in B as element-vertices. Slightly abusing notation, we identify each group with its group-vertex and each element with its element-vertex and write \({\mathfrak {G}}=A\).

We observe that

  1. (a)

    \(G'\) is an induced subgraph of \(H'\),

  2. (b)

    every vertex \(\pi '(u)\), \(u \in V\), has degree 3 in \(G'\) and is a group-vertex,

  3. (c)

    every other vertex has degree 2 in \(G'\), and

  4. (d)

    for every group-vertex \(x \in V(H') {\setminus } V(G')\) there is some group-vertex \(u \in V(G')\) with

    $$\begin{aligned} \varGamma _{H'}(x) \cap V(G') \subseteq \varGamma _{H'}(u) \cap V(G'). \end{aligned}$$

Next we will tweak the embedding of \(\bar{G}\) a bit, to remove paths \(\pi (uv)\) with the wrong parity. We do so in a way that preserves the properties (a)–(d). Let \({\mathcal {P}}_0 \subseteq \{\pi '(uv) : uv \in E(H)\}\) be the set of all paths with a length 0 (mod 4), and let \({\mathcal {P}}_2 = \{\pi '(uv) : uv \in E(H)\} {\setminus } {\mathcal {P}}_0\). We want to substitute each path in \({\mathcal {P}}_0\) by a path of length 2 (mod 4). For this, let \(u'\) be the neighbour of u in the path \(\pi (uv)\). Note that the path \(\pi '(uu')\) in \(H'\) starts with a vertical or horizontal path P from \(\pi '(u)\) to \(\pi '(u')\) of length 10. We bypass the middle vertex of this path (an element-vertex) by going over two new element-vertices and one group-vertex instead. See Fig. 5 for an illustration.

To keep the notation easy we denote the newly obtained path by \(\pi ''(uv)\). Note that, after adding the bypass, the new path \(\pi ''(uv)\) is two edges longer and thus has length 2 (mod 4). We complete \(\pi ''\) to an embedding of \(\bar{G}\) by putting \(\pi ''(u) = \pi '(u)\) and \(\pi ''(vu') = \pi '(vu')\) for all \(u \in V\) and \(vu' \in E\) with \(\pi '(vu') \in {\mathcal {P}}_2\). Moreover, let us denote the changed embedding of \(\bar{G}\) by \(G''\).

Fig. 5
figure 5

Introducing a bypass. Squares are group-vertices, circles are element-vertices

We observe that the new embedding \(G''\) still satisfies the assertions (a)–(d) and, in addition, it holds that

  1. (e)

    every path connecting two vertices of degree 3 over vertices of degree 2 only has length 2 (mod 4).

Next we define the weights of the element-vertices by putting

$$\begin{aligned} {\mathbf {w}}(x) = {\left\{ \begin{array}{ll} 1, &{} \quad x \in V(G'') \\ 0, &{} \quad x \in V(H') {\setminus } V(G'') \end{array}\right. } \quad \text{ for } \text{ each } \text{ element-vertex } x\hbox { of }H'. \end{aligned}$$

Assertion (d) implies that, for any subset \({\mathcal {S}} \subseteq {\mathfrak {G}}\) of size k there is an \({\mathcal {S}}' \subseteq {\mathfrak {G}}\) of size at most k such that

  • \({\mathcal {S}}' \subseteq A \cap V(G'')\), and

  • \({\mathbf {w}}(\bigcup {\mathcal {S}}') \ge {\mathbf {w}}(\bigcup {\mathcal {S}})\).

Since \({\mathbf {w}}(u)=0\) for all elements in \(B {\setminus } V(G'')\), we may thus restrict our attention to the restricted group model \({\mathfrak {G}}'= A \cap V(G'')\) on the element set \(B\cap V(G'')\).

Slightly abusing notation, any subset \({\mathcal {S}}\subseteq {\mathfrak {G}}'\) is a vertex subset of \(I({\mathfrak {G}}')\) and \({\mathbf {w}}(\bigcup {\mathcal {S}})\) equals the number of edges of \(I({\mathfrak {G}}')\) adjacent to the vertex set \({\mathcal {S}}\) in \(I({\mathfrak {G}}')\). Moreover, the graph \(I({\mathfrak {G}}')\) is obtained from the graph \(\bar{G}\) by subdividing each edge an even number of times.

From Lemma 3 we know that there is some number t such that \(\tau (\bar{G}) = \tau (I({\mathfrak {G}}))-t\). Hence, \(\bar{G}\) has a vertex cover of size k if and only if \(M'\) has a cover of size \(k'=k+t\) of total weight \(|E(I({\mathfrak {G}}'))|\). This, in turn, is the case if and only if M admits a cover of size \(k'\) of total weight \(|E(I({\mathfrak {G}}'))|\). Since the construction of \({\mathfrak {G}}\) can be done in polynomial time, the proof is complete. \(\square \)

MEIHT for general group structures

In this section we apply the results for structured sparsity models and for expander matrices to the group model case. The Model-Expander IHT (MEIHT) algorithm is one of the exact projection algorithms with provable guarantees for tree-sparse and loopless overlapping group-sparse models using model-expander sensing matrices [3]. In this work we show how to use the MEIHT for more general group structures. The only modification of the MEIHT algorithm is the projection on these new group structures. We show MEIHT’s guaranteed convergence and polynomial runtime.

figurec

Note that as in [4], we are able to do model projections with an additional sparsity constraint, i.e. projection onto \(\mathcal {P}_{{\mathfrak {G}}_{G,K}}\) defined in (11). Therefore Algorithm 3 works with an extra input K and the model projection \(\mathcal {P}_{{\mathfrak {G}}_{G}}\) becomes \(\mathcal {P}_{{\mathfrak {G}}_{G,K}}\), retuning a \({\mathfrak {G}}_{G,K}\)-sparse approximation to \({\mathbf {x}}\).

The convergence analysis of MEIHT with the more general group structures considered here remains the same as for loopless overlapping group models discussed in [3]. We are able to perform the exact projection of \(\mathcal {P}_{{\mathfrak {G}}_G}\) (and \(\mathcal {P}_{{\mathfrak {G}}_{G,K}}\)) as discussed in Sect. 3.6. With the possibility of doing the projection onto the model, we present the convergence results in Corollaries 1 and 2 as corollaries to Theorem 3.1 and Corollary 3.1 in [3] respectively.

Corollary 1

Consider \({\mathfrak {G}}\) to be a group model of bounded treewidth and \({\mathcal {S}}\) to be \({\mathfrak {G}}_G\)-sparse. Let the matrix \({\mathbf {A}}\in \{0,1\}^{m\times N}\) be a model expander matrix with \(\epsilon _{{\mathfrak {G}}_{3G}} < 1/12\) and d ones per column. For any \({\mathbf {x}}\in {\mathbb {R}}^N\) and \({\mathbf {e}}\in {\mathbb {R}}^m\), the sequence of updates \({\mathbf {x}}^{(n)}\) of MEIHT with \({\mathbf {y}}= {\mathbf {A}}{\mathbf {x}}+ {\mathbf {e}}\) satisfies, for any \(n\ge 0\)

$$\begin{aligned} \Vert {\mathbf {x}}^{(n)} - {\mathbf {x}}_{{\mathcal {S}}}\Vert _1 \le \alpha ^n \Vert {\mathbf {x}}^{(0)} - {\mathbf {x}}_{{\mathcal {S}}}\Vert _1 + \left( 1-\alpha ^n\right) \beta \Vert {\mathbf {A}}{\mathbf {x}}_{{\mathcal {S}}^c} + {\mathbf {e}}\Vert _1, \end{aligned}$$
(24)

where \(\alpha = 8\epsilon _{{\mathfrak {G}}_{3G}}\left( 1-4\epsilon _{{\mathfrak {G}}_{3G}}\right) ^{-1} \in (0,1)\) and \(\beta = 4d^{-1}\left( 1 - 12\epsilon _{{\mathfrak {G}}_{3G}}\right) ^{-1} \in (0,1)\).

Note that \(\epsilon _{{\mathfrak {G}}_{3G}}\) is the expansion coefficient of the underlying \((s,d,\epsilon _{{\mathfrak {G}}_{3G}})\)-model expander graph for \({\mathbf {A}}\). This ensures that \({\mathbf {A}}\) satisfies model RIP-1 over all \({\mathfrak {G}}_{3G}\)-sparse signals.

The proof of this Corollary can be done analogously to the proof of Theorem 3.1 in [3]. It is thus skipped and the interested reader is referred to [3].

Let us define the \(\ell _1\)-error of the best \({\mathfrak {G}}_G\)-term approximation to a vector \({\mathbf {x}}\in {\mathbb {R}}^N\)

$$\begin{aligned} \sigma _{{\mathfrak {G}}_G}({\mathbf {x}})_1 = \min _{{{\mathfrak {G}}_G}\text{-sparse } ~{\mathbf {z}}}\Vert {\mathbf {x}}-{\mathbf {z}}\Vert _1. \end{aligned}$$
(25)

This is then used in the following corollary.

Corollary 2

Consider the setting of Corollary 1. After \(n = \left\lceil \log \left( \frac{\Vert {\mathbf {x}}\Vert _1}{\Vert {\mathbf {e}}\Vert _1}\right) /\log \left( \frac{1}{\alpha }\right) \right\rceil \) iterations, MEIHT returns a solution \(\widehat{{\mathbf {x}}}\) satisfying

$$\begin{aligned} \Vert \widehat{{\mathbf {x}}}- {\mathbf {x}}\Vert _1 \le c_1 \sigma _{{\mathfrak {G}}_G}({\mathbf {x}})_1 + c_2 \Vert {\mathbf {e}}\Vert _1 \end{aligned}$$
(26)

where \(c_1 = \beta d\) and \(c_2 = 1+\beta \).

Proof

Without loss of generality we initialize MEIHT with \({\mathbf {x}}^{(0)} = 0\). Upper bounding \(1-\alpha ^n\) by 1 and using triangle inequalities with some algebraic manipulations (24) simplifies to

$$\begin{aligned} \Vert {\mathbf {x}}^{(n)} - {\mathbf {x}}\Vert _1 \le \alpha ^n \Vert {\mathbf {x}}\Vert _1 + \beta \Vert {\mathbf {A}}{\mathbf {x}}_{{\mathcal {S}}^c}\Vert _1 + \beta \Vert {\mathbf {e}}\Vert _1. \end{aligned}$$
(27)

Using the fact that \({\mathbf {A}}\) is a binary matrix with d ones per column we have \(\Vert {\mathbf {A}}{\mathbf {x}}_{{\mathcal {S}}^c}\Vert _1 \le d\Vert {\mathbf {x}}_{{\mathcal {S}}^c}\Vert _1\). We also have \(\Vert {\mathbf {e}}\Vert _1 \ge \alpha ^n \Vert {\mathbf {x}}\Vert _1\) when \(n \ge \log \left( \frac{\Vert {\mathbf {x}}\Vert _1}{\Vert {\mathbf {e}}\Vert _1}\right) /\log \left( \frac{1}{\alpha }\right) \). Applying these bounds to (27) leads to

$$\begin{aligned} \Vert {\mathbf {x}}^{(n)} - {\mathbf {x}}\Vert _1 \le \beta d\Vert {\mathbf {x}}_{{\mathcal {S}}^c}\Vert _1 + \left( 1+\beta \right) \Vert {\mathbf {e}}\Vert _1 \end{aligned}$$
(28)

for \(n = \left\lceil \log \left( \frac{\Vert {\mathbf {x}}\Vert _1}{\Vert {\mathbf {e}}\Vert _1}\right) /\log \left( \frac{1}{\alpha }\right) \right\rceil \). This is equivalent to (26) with \(c_1 = \beta d\), \(c_2 = 1+\beta \), \({\mathbf {x}}^{(n)} = \widehat{{\mathbf {x}}}\) for the given n, and \(\sigma _{{\mathfrak {G}}_G}({\mathbf {x}})_1 = \Vert {\mathbf {x}}_{{\mathcal {S}}^c}\Vert _1\) because \({\mathbf {x}}_{{\mathcal {S}}}\) is the best \({\mathfrak {G}}_G\)-term approximation to the \({\mathfrak {G}}_G\)-sparse \({\mathbf {x}}\). This completes the proof.\(\square \)

The runtime complexity of MEIHT still depends on the median operation and the complexity of the projection onto the model. However, as observed in [3], the projection onto the model is the dominant operation of the algorithm. Therefore, the complexity of MEIHT is of the order of the complexity of the projection onto the model. In the case of overlapping group models with bounded treewidth, MEIHT achieves a polynomial runtime complexity as shown in Proposition 3 below. On the other hand, when the treewidth of the group model is unbounded MEIHT can still be implemented by using the Bender’s decomposition procedure in Sect. 3.6 for the projection which may have an exponential runtime complexity.

Proposition 3

The runtime of MEIHT is \({\mathcal {O}}((N+M)(w_T^2 5^{w_T} G^2 N) \bar{n} + (N+M)(2^{{\mathcal {O}}(w_T^3)}+w_T^2))\) for the \({\mathfrak {G}}_G\)-group-sparse model with bounded treewidth \(w_T\), where \(\bar{n}\) is the number of iterations, M is the number of groups, G is the group budget and N is the signal dimension.

Proof

Before we start the MEIHT procedure we have to calculate a nice tree decomposition of the incidence graph of the group model. This can be done in \({\mathcal {O}} ((N+M)2^{{\mathcal {O}}(w_T^3)}+w_T^2)\). Then in each of the iterations of the MEIHT we have to solve the exact projection onto the model which is the dominant operation of the MEIHT. Since the projection on the group model with bounded treewidth \(w_T\) can be done through the dynamic programming algorithm that runs in \({\mathcal {O}}((N+M)(w_T^2 5^{w_T} G^2 N))\), as proven in Sect. 3.1, this proves the result. \(\square \)

Remark 2

The convergence results above hold when MEIHT is modified appropriately to solve the standard K-sparse and G-group-sparse problem with groups having bounded treewidth, where the projection becomes \(\mathcal {P}_{{\mathfrak {G}}_{G,K}}\). However, in this case the runtime complexity in each iteration grows by a factor of \({\mathcal {O}}(K)\), as indicated in Remark 1.

Exact projection for general group models

In this section we consider the most general version of group models, i.e. \(\mathfrak {G} = \{{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_M\}\) is an arbitrary set of groups and \(G\in [M]\) and \(K\in [N]\) are given budgets. We study the structured sparsity model \({\mathfrak {G}}_{G,K}\) introduced in Sect. 2.4. Here additional to the number G of groups that can be selected we bound the number of indices to be selected in these groups by K (i.e. we consider a group-sparse model with an additional standard sparsity constraint). Note that setting \(K = N\) reduces this model \({\mathfrak {G}}_{G,K}\) to the general group model \({\mathfrak {G}}_{G}\).

If we want to apply exact projection recovery algorithms like the Model-IHT and MEIHT to group models, iteratively the model projection problem has to be solved, i.e. in each iteration for a given signal \({\mathbf {x}}\in {\mathbb {R}}^N\) we have to find the closest signal \(\hat{{\mathbf {x}}}\) which has support in the model \({\mathfrak {G}}_{G,K}\). In this section we will derive an efficient procedure based on the idea of Benders’ decomposition to solve the projection problem. This procedure is analysed and implemented in Sect. 5.

It has been proved that the group model projection problem for group models without a sparsity condition on the support is NP-hard [4]. Therefore the projection problem on the more general model \({\mathfrak {G}}_{G,K}\) is NP-hard as well. The latter problem can be reformulated by the integer programming formulation

$$\begin{aligned} \begin{aligned} \max&\quad \mathbf{w}^\top {\mathbf {u}}\\ s.t.&\quad \sum _{i=1}^{N} u_i \le K \\&\quad \sum _{j=1}^{M} v_j \le G \\&\quad u_i\le \sum _{j:i\in {\mathcal {G}}_j} v_j \quad \forall i=1,\ldots , N \\&\quad {\mathbf {u}}\in \{ 0,1\}^N, \ {\mathbf {v}}\in \{ 0,1\}^M. \end{aligned} \end{aligned}$$
(29)

Here \({\mathbf {w}}\) are the squared entries of the signal, the \({\mathbf {v}}\)-variables represent the groups and the \({\mathbf {u}}\)-variables represent the elements which are selected. Note that by choosing \(K=N\) we obtain the projection problem for classical group models \({\mathfrak {G}}_G\).

To derive an efficient algorithm for the projection problem we use the concept of Benders’ decomposition which was already studied in [7, 24]. The idea of this approach is to decompose Problem 29 into a master problem and a subproblem. Then iteratively the subproblem is used to derive feasible inequalities for the master problem until no feasible inequality can be found any more. This procedure has been applied to Problem 29 without the sparsity constraint on the \({\mathbf {u}}\)-variables in [17]. The following results for the more general Problem 29 are based on the the idea of Benders’ decomposition and extend the results in [17].

First we can relax the \({\mathbf {u}}\)-variables in the latter formulation without changing the optimal value, i.e. we may assume \({\mathbf {u}}\in [0,1]^N\). We can now reformulate (29) by

$$\begin{aligned} \begin{aligned} \max&\quad \mu \\ s.t. \quad&\quad \mu \le \max _{{\mathbf {u}}\in P({\mathbf {v}})} \mathbf{w^\top {\mathbf {u}}} \\&\quad \sum _{j=1}^{M} v_j \le G \\&\quad {\mathbf {v}}\in \{ 0,1\}^M \end{aligned} \end{aligned}$$
(30)

where \(P({\mathbf {v}})=\{{\mathbf {u}}\in [0,1]^N \ : \ \sum _{i=1}^{N} u_i \le K, \ u_i\le \sum _{j:i\in {\mathcal {G}}_j} v_j, \ i=1,\ldots , N\}\). Replacing the linear problem \(\max _{{\mathbf {u}}\in P({\mathbf {v}})} \mathbf{w^\top {\mathbf {u}}}\) in (30) by its dual formulation, we obtain

$$\begin{aligned} \begin{aligned} \max&\quad \mu \\ s.t. \quad&\quad \mu \le \min _{\alpha ,{\varvec{\beta }},{\varvec{\gamma }}\in P_D} \alpha K + \sum _{i=1}^{N}\beta _i\left( \sum _{j:i\in {\mathcal {G}}_j} v_j \right) + \sum _{i=1}^{N} \gamma _i \\&\quad \sum _{j=1}^{M} v_j \le G \\&\quad {\mathbf {v}}\in \{ 0,1\}^M \end{aligned} \end{aligned}$$
(31)

where \(P_D=\{\alpha ,{\varvec{\beta }},{\varvec{\gamma }}\ge 0 \ : \ \alpha + \beta _i + \gamma _i\ge w_i \ i=1,\ldots , N\}\) is the feasible set of the dual problem. Since \(P_D\) is a polyhedron and the minimum in (31) exists, the first constraint in (31) holds if and only if it holds for each vertex \((\alpha ^l,{\varvec{\beta }}^l,{\varvec{\gamma }}^l)\) of \(P_D\). Therefore Problem (31) can be reformulated by

$$\begin{aligned} \begin{aligned} \max&\quad \mu \\ s.t. \quad&\quad \mu \le \alpha ^l K + \sum _{i=1}^{N}\beta _i^l\left( \sum _{j:i\in {\mathcal {G}}_j} v_j \right) + \sum _{i=1}^{N} \gamma _i^l \quad l=1,\ldots ,t \\&\quad \sum _{j=1}^{M} v_j \le G \\&\quad {\mathbf {v}}\in \{ 0,1\}^M \end{aligned} \end{aligned}$$
(32)

where \((\alpha ^1,{\varvec{\beta }}^1,{\varvec{\gamma }}^1),\ldots ,(\alpha ^t,{\varvec{\beta }}^t,{\varvec{\gamma }}^t)\) are the vertices of \(P_D\). Each of the constraints

$$\begin{aligned} \mu \le \alpha ^l K + \sum _{i=1}^{N}\beta _i^l\left( \sum _{j:i\in {\mathcal {G}}_j} v_j \right) + \sum _{i=1}^{N} \gamma _i^l \end{aligned}$$

for \(l=1,\ldots ,t\) is called optimality cut.

The idea of Benders’ decomposition is, starting from Problem (32) containing no optimality cut (called the master problem), to iteratively calculate the optimal \(({{\mathbf {v}}}^*,\mu ^*)\) and then find a optimality cut which cuts off the latter optimal solution. In each step the most violating optimality cut is determined by solving

$$\begin{aligned} \begin{aligned} \min&\quad \alpha K + \sum _{i=1}^{N}\beta _i\left( \sum _{j:i\in {\mathcal {G}}_j} v_j^* \right) + \sum _{i=1}^{N} \gamma _i \\ s.t. \quad&\quad \alpha + \beta _i + \gamma _i\ge w_i \ \ i=1,\ldots , N \\&\quad \alpha ,{\varvec{\beta }},{\varvec{\gamma }}\ge 0, \end{aligned} \end{aligned}$$
(33)

for the actual optimal solution \({{\mathbf {v}}}^*\). If the optimal solution fulfills

$$\begin{aligned} \mu ^* > \alpha ^* K + \sum _{i=1}^{N}\beta _i^*\left( \sum _{j:i\in {\mathcal {G}}_j} v_j^* \right) + \sum _{i=1}^{N} \gamma _i^*, \end{aligned}$$

then the optimality cut related to the optimal vertex \((\alpha ^*,\varvec{\beta }^*,\varvec{\gamma }^*)\) is added to the master problem. This procedure is iterated until the latter inequality is not true any more. The last optimal \({\mathbf {v}}^*\) must then be optimal for Problem (29) since the first constraint in (31) is then true for \({\mathbf {v}}^*\).

If we use the latter Benders’ decomposition approach it is desired to use fast algorithms for the master problem (32) and the subproblem (33) in each iteration. By the following lemma an optimal solution of the subproblem can be easily calculated.

Lemma 4

For a given solution \({\mathbf {v}}\in \{ 0,1\}^M\) we define \(I_{{\mathbf {v}}}:=\left\{ i=1,\ldots ,N\ : \ \sum _{j:i\in {\mathcal {G}}_j}v_j > 0\right\} \) and \(I_{{\mathbf {v}}}^K\) by the indices of the K largest values \(w_i\) for \(i\in I_{{\mathbf {v}}}\). An optimal solution of Problem (33) is then given by \((\alpha ^*,\varvec{\beta }^*,\varvec{\gamma }^*)\) where \(\alpha ^* = \max _{i\in I_{{\mathbf {v}}}{\setminus } I_{{\mathbf {v}}}^K} w_i\) and

$$\begin{aligned} (\beta _i^*,\gamma _i^*) = {\left\{ \begin{array}{ll} (w_i,0) &{} \quad \text { if } i\in [N]{\setminus } I_{{\mathbf {v}}}\\ (0,0) &{} \quad \text { if } i\in I_{\mathbf {v}}{\setminus } I_{{\mathbf {v}}}^K \\ (0, w_i-\alpha ^*) &{} \quad \text { if } i\in I_{{\mathbf {v}}}^K \\ \end{array}\right. } \end{aligned}$$

Proof

Note that for a given \({{\mathbf {v}}}^*\) the dual problem of subproblem (33) is

$$\begin{aligned} \begin{aligned} \max&\quad \mathbf{w}^\top {{\mathbf {u}}}\\ s.t. \quad&\quad \sum _{i=1}^{N} u_i \le K \\&\quad u_i\le \sum _{j:i\in {\mathcal {G}}_j} v_j^* \ \ i=1,\ldots , N\\&\quad {{\mathbf {u}}}\in [0,1]^N. \end{aligned} \end{aligned}$$
(34)

It is easy to see that there exists an optimal solution \({\mathbf {u}}^*\) of Problem (34) where \(u_i=1\) if and only if \(i\in I_{{\mathbf {v}}}^K \) and \(u_i^*=0\) otherwise. We will use the dual slack conditions

$$\begin{aligned} \left( \sum _{j:i\in {\mathcal {G}}_j} v_j^* - u_i^*\right) \beta _i = 0 \text { and } \left( 1 - u_i^*\right) \gamma _i = 0 \ i=1,\ldots ,N, \end{aligned}$$
(35)

to derive the optimal values for \(\alpha ,\varvec{\beta },\varvec{\gamma }\). We obtain the following 4 cases:

  • Case 1 If \(u_i^*=0\) and \(\sum _{j:i\in {\mathcal {G}}_j} v_j^*>0\), i.e. \(i\in I_{{\mathbf {v}}}{\setminus } I_{{\mathbf {v}}}^K\), then by conditions (35) we have \(\beta _i=\gamma _i=0\). To ensure the constraint \(\alpha + \beta _i + \gamma _i\ge w_i\), the value of \(\alpha \) must be at least \(\max _{i\in I_{{\mathbf {v}}}{\setminus } I_{{\mathbf {v}}}^K} w_i\).

  • Case 2 If \(u_i^*=0\) and \(\sum _{j:i\in {\mathcal {G}}_j} v_j^*=0\), then \(i\in [N]{\setminus } I_{{\mathbf {v}}}\) and the objective coefficient of \(\beta _i\) in Problem (33) is 0. Therefore we can increase \(\beta _i\) as much as we want without changing the objective value and therefore we set \(\beta _i=w_i\) to ensure the i-th constraint \(\alpha + \beta _i + \gamma _i\ge w_i\) and set \(\gamma _i=0\).

  • Case 3 If \(u_i^*=1\), i.e. \(i\in I_{{\mathbf {v}}}^K\), and \(\sum _{j:i\in {\mathcal {G}}_j} v_j^*>1\) then by condition (35) \(\beta _i=0\). Therefore to ensure the i-th constraint \(\alpha + \beta _i + \gamma _i\ge w_i\) the value of \(\gamma _i\) must be at least \(w_i-\alpha \) and since we minimize \(\gamma _i\) in the objective function the latter holds with equality.

  • Case 4 If \(u_i^*=1\), i.e. \(i\in I_{{\mathbf {v}}}^K\), and \(\sum _{j:i\in {\mathcal {G}}_j} v_j^*=1\) then we cannot use condition (35) to derive the values for \(\beta _i\) and \(\gamma _i\). Nevertheless in this case both variables have an objective coefficient of 1 while \(\alpha \) has an objective coefficient of K. By increasing \(\alpha \) by 1 the objective value increases by K. In Cases 1 and 2 nothing changes, while for each index in Case 3 we could decrease \(\gamma _i\) by 1 to remain feasible. But since at most K indices i fulfil the conditions of Case 3 we cannot improve our objective value by this strategy. Therefore \(\alpha \) has to be selected as small as possible in Case 4, i.e. by Case 1 we set \(\alpha = \max _{i\in I_{{\mathbf {v}}}{\setminus } I_{{\mathbf {v}}}^K} w_i\) and to ensure feasibility we set \(\gamma _i=w_i-\alpha \). This concludes the proof.\(\square \)

Theorem 3

An optimal solution of subproblem (33) can be calculated in time \({\mathcal {O}} (NM|G_\text {max}|)\).

Proof

The set \(I_{{\mathbf {v}}}\) can be calculated in time \({\mathcal {O}}(NM|G_\text {max}|)\) by going through all groups for each index \(i\in [N]\) and checking if the index is contained in one of the groups. To obtain \(I_{{\mathbf {v}}}^K\) we have to find the K-th largest element in \(\left\{ w_i \ : \ i\in I_{{\mathbf {v}}}\right\} \), which can be done in time \({\mathcal {O}}(N)\); see [46]. Afterwards we select all values which are larger than the K-th largest element, which can be done in \({\mathcal {O}} (N)\). Assigning all values to \((\alpha ,\varvec{\beta },\varvec{\gamma })\) can also be done in time \({\mathcal {O}} (N)\). \(\square \)

The following theorem states that the masterproblem can be solved in pseudopolynomial time if the number of constraints is fixed. Nevertheless note that the number of iterations of the procedure described above may be exponential in N and M.

Theorem 4

Problem (32) with t constraints can be solved in \({\mathcal {O}} (MG(NW)^t)\) where \(W=\max _{i\in [N]}w_i\).

Proof

Problem (32) with t constraints can be reformulated by

$$\begin{aligned} \begin{aligned} \max _{{{\mathbf {v}}}\in \left\{ 0,1\right\} ^M}&\min _{l=1,\ldots ,t} \alpha ^l K + \sum _{i=1}^{N}\beta _i^l\left( \sum _{j:i\in {\mathcal {G}}_j} v_j \right) + \sum _{i=1}^{N} \gamma _i^l \\ s.t. \quad&\sum _{j=1}^{M} v_j \le G. \end{aligned} \end{aligned}$$
(36)

The latter problem is a special case of the robust knapsack problem with discrete uncertainty (sometimes called robust selection problem) with additional uncertain constant. In [12] the authors mention that the problem with an uncertain constant is equivalent to the problem without such a constant. Furthermore using the result in [41] Problem (36) can be solved in \({\mathcal {O}} (MGC^t)\) where

$$\begin{aligned}C:=\max _{l=1,\ldots , t} \alpha ^l + \sum _{i=1}^{N}\beta _i^l + \sum _{i=1}^{N}\gamma _i^l.\end{aligned}$$

Since for all solutions \((\alpha ^l,\varvec{\beta }^l,\varvec{\gamma }^l)\) generated in Lemma 4 it holds that \(\alpha ^l, \beta _i^l, \gamma _i^l \le \max _{i\in [N]} w_i\) we have \(C\le (2N+1)W\) which proves the result.\(\square \)

Algorithms with approximation projection oracles

As mentioned in the previous sections solving the group model projection problem is NP-hard in general. Therefore to use classical algorithms as the Model-IHT or the MEIHT we have to solve an NP-hard problem in each iteration. To tackle problems like this the authors in [27, 28] introduced an algorithm called Approximate Model-IHT (AM-IHT) which is based on the idea of the classical IHT but which does not require an exact projection oracle (see Sect. 2.5). Instead the authors show that under certain assumptions on the measurement matrix a signal can be recovered by just using two approximation variants of the projection problem which they call head- and tail-approximation (for further details again see Sect. 2.5).

In this section we apply the latter results to group models of bounded frequency: group models where the maximum number of groups an element is contained in is bounded by some number f. Note that from Theorem 2 we know that group model projection is NP-hard for group models of frequency 4. A particularly interesting case of such group structures is the graphic case, where each element is contained in at most two groups. Understanding this case was proposed as an open problem by Baldassarre et al. [4].

Furthermore we apply the theoretical results derived in [27, 28] to group models and show that the number of required measurements compared to the classical structured sparsity case increases by just a constant factor. In Sect. 5 we will computationally compare the AM-IHT to the Model-IHT and the MEIHT on interval groups.

Head- and tail-approximations for group models with low frequency

In this section we derive head- and tail-approximations for the case of group models with bounded frequency. We first recall the definition of head- and tail-approximations for the case of group models. Assume we have given a group model \({\mathfrak {G}}\) together with \(G\in {\mathbb {N}}\).

Given a vector \({\mathbf {x}}\), let \({\mathcal {H}}\) be an algorithm that computes a vector \({\mathcal {H}}({\mathbf {x}})\) with support in \({\mathfrak {G}}_{G'}\) for some \(G'\in {\mathbb {N}}\). Then, given some \(\alpha \in {\mathbb {R}}\) (typically \(\alpha < 1\)) we say that \({\mathcal {H}}\) is an \((\alpha ,G,G',p)\)-head approximation if

$$\begin{aligned} \Vert {\mathcal {H}}({\mathbf {x}}) \Vert _p \ge \alpha \cdot \Vert {\mathbf {x}}_{{\mathcal {S}}}\Vert _p \text{ for } \text{ all } {\mathcal {S}} \in {\mathfrak {G}}_G. \end{aligned}$$
(37)

In other words, \({\mathcal {H}}\) uses \(G'\) many groups to cover at least an \(\alpha \)-fraction of the maximum total weight covered by G groups. Note that \(G'\) can be chosen larger than G.

Moreover, let \({\mathcal {T}}\) be an algorithm which computes a vector \({\mathcal {T}}({\mathbf {x}})\) with support in \({\mathfrak {G}}_{G}\). Given some \(\beta \in {\mathbb {R}}\) (typically \(\beta > 1\)) we say that \({\mathcal {T}}\) is a \((\beta ,G,G',p)\)-tail approximation if

$$\begin{aligned} \Vert {\mathbf {x}}- {\mathcal {T}}({\mathbf {x}}) \Vert _p \le \beta \cdot \Vert {\mathbf {x}}- {\mathbf {x}}_{{\mathcal {S}}}\Vert _p \text{ for } \text{ all } {\mathcal {S}} \in {\mathfrak {G}}_G. \end{aligned}$$
(38)

This means that \({\mathcal {T}}\) may use \(G'\) many groups to leave at most a \(\beta \)-fraction of weight uncovered compared to the minimum total weight left uncovered by G groups.

In the following we derive the head- and tail-approximation just for the case \(p=1\). Note that equivalent approximation procedures can be easily derived for the case \(p=2\) by replacing the accuracies \(\alpha \) and \(\beta \) by \(\sqrt{\alpha }\) and \(\sqrt{\beta }\) and using the weights \(x_i^2\) instead of \(|x_i|\) in the latter proofs. We will first present a result which implies the existence of a head approximation.

Theorem 5

(Hochbaum and Pathria [30]) For each \(\epsilon > 0\) there exists an \(((1-\epsilon ),G, \lceil G \log _2 (1/\epsilon ) \rceil ,1)\)-head approximation running in polynomial time.

The algorithm derived in [30] was designed to solve the Maximum \(G\)-Coverage problem and is based on a simple greedy method. The idea is to iteratively select the group which covers the largest amount of uncovered weight. It is proven by the authors that if you are allowed to select enough groups, namely \(\lceil G \log _2 (1/\epsilon )\rceil \), then the optimal value is approximated up to an accuracy of \((1-\epsilon )\). The greedy procedure is given in Algorithm 4. Note that for a given signal \({\mathbf {x}}\) and a group \({\mathcal {G}}\in {\mathfrak {G}}\) we define

$$\begin{aligned} w({\mathcal {G}} ):= \sum _{i\in {\mathcal {G}}} |x_i|. \end{aligned}$$
figured

Next we derive a tail-approximation for our problems based on the idea of LP rounding. Note that in contrast to the head-approximation, the run-time bound of the following tail-approximation depends on the frequency of the group model.

Theorem 6

Suppose the frequency of the group model is f. For any \(\epsilon > 0\) and \(\kappa = (1+\epsilon ^{-1}) f\) there exists an \(((1+\epsilon ),G,\kappa G,1)\)-tail approximation running in polynomial time.

Proof

Given a signal \({\mathbf {x}}\in {\mathbb {R}}^N\), we define \({\mathbf {w}}= (|x_i|)_{i \in [N]}\). We consider the following linear relaxation of the group model projection problem

$$\begin{aligned} \begin{aligned} \max&\quad {\mathbf {w}}^\top {\mathbf {u}}\\ s.t.&\quad \sum _{i \in [M]} v_i = G \\&\quad u_j \le \sum _{j \in {\mathcal {G}}_i} v_i \quad \text{ for } \text{ all } j \in [N] \\&\quad {\mathbf {u}}\in [0,1]^N, \ {\mathbf {v}}\in [0,1]^M. \end{aligned} \end{aligned}$$
(39)

Consider an optimal solution \(({\mathbf {u}},{\mathbf {v}})\) of (39). We compute a group cover \({\mathcal {S}}\subseteq {\mathfrak {G}}\) by

$$\begin{aligned} {\mathcal {S}}= \{ {\mathcal {G}}_i \in {\mathfrak {G}}: v_i \ge \kappa ^{-1}\}. \end{aligned}$$

Note that \({\mathcal {S}}\) contains at most \(\kappa G\) many groups, since

$$\begin{aligned} |{\mathcal {S}}| \le \kappa \sum _{i=1}^M v_i = \kappa G. \end{aligned}$$

It remains to show that \({\mathcal {S}}\) is a tail approximation. To this end let R be the set of indices only barely covered by v, i.e.

$$\begin{aligned} R = \left\{ j \in [N] : \sum _{j \in {\mathcal {G}}_i} v_i \le (1+\epsilon ^{-1})^{-1}\right\} . \end{aligned}$$

Note that

$$\begin{aligned} {\mathcal {S}} \text{ covers } \text{ every } \text{ element } j \in [N] {\setminus } R, \end{aligned}$$
(40)

since \(j \notin R\) implies

$$\begin{aligned}\sum _{j \in {\mathcal {G}}_i} v_i > (1+\epsilon ^{-1})^{-1}\end{aligned}$$

and hence

$$\begin{aligned}v_i \ge \tfrac{(1+\epsilon ^{-1})^{-1}}{f} = \kappa ^{-1}\end{aligned}$$

for some i with \(j \in {\mathcal {G}}_i\). Moreover, note that

$$\begin{aligned} u_j \le \sum _{j \in {\mathcal {G}}_i} v_i \le (1+\epsilon ^{-1})^{-1} \end{aligned}$$

and hence

$$\begin{aligned} \frac{1-u_j}{1-(1+\epsilon ^{-1})^{-1}} \ge 1 \text{ for } \text{ each } j \in R. \end{aligned}$$
(41)

We obtain the inequalities

$$\begin{aligned} \Vert x - x_{\bigcup {\mathcal {S}}}\Vert _1 = \sum _{i\in [N] {\setminus } \bigcup {\mathcal {S}}} w_i \le \sum _{i\in R} w_i&\le \sum _{j \in R} \frac{1-u_j}{1-(1+\epsilon ^{-1})^{-1}} w_j \\&\le (1+\epsilon ) \sum _{j \in R} w_j (1-u_j) \\&\le (1+\epsilon ) ({\mathbf {w}}^\top \varvec{1} - {\mathbf {w}}^\top {\mathbf {u}}), \end{aligned}$$

where we used (40) and (41). Since \({\mathbf {u}}\) is an optimal solution of the relaxed problem (39), we have

$$\begin{aligned} (1+\epsilon ) ({\mathbf {w}}^\top \varvec{1} - {\mathbf {w}}^\top {\mathbf {u}}) \le (1+\epsilon ) ({\mathbf {w}}^\top \varvec{1} - {\mathbf {w}}^\top {\mathbf {u}}^*) = (1+\epsilon )\Vert x-x_{{\mathcal {S}}^*}\Vert _1 \end{aligned}$$

where \({\mathcal {S}}^*\) is an optimal solution of the group model projection problem and \({\mathbf {u}}^*\) the corresponding optimal vector of Problem (12). Therefore the latter procedure is a tail-approximation which completes the proof.\(\square \)

AM-IHT and AM-MEIHT for group models

As in the previous sections for a sensing matrix \({\mathbf {A}}\) and the true signal \({\mathbf {x}}\) we have a noisy measurement vector \({\mathbf {y}}= {\mathbf {A}}{\mathbf {x}} + {\mathbf {e}}\) for some noise vector \({\mathbf {e}}\). The task consists in recovering the original signal \({\mathbf {x}}\), or a vector close to \({\mathbf {x}}\). Furthermore we have given a group model \({\mathfrak {G}}\) together with \(G\in {\mathbb {N}}\) with frequency \(f\in [N]\).

In the last section we derived polynomial time algorithms for an \(((1-\epsilon ),G, \lceil G \log _2 (1/\epsilon ) \rceil ,2)\)-head approximation and an \(((1+\epsilon ),G,(1+\epsilon ^{-1}) f G,2)\)-tail approximation. Note that we can use any G here. Using the results in Sect. 2.5, we obtain convergence of the AM-IHT for group models if \({\mathcal {T}}\) is an \(((1+\epsilon ),G,G_T,2)\)-tail approximation, \({\mathcal {H}}\) is an \(((1-\epsilon ),G_T+G, G_H,2)\)-head approximation, where \(G_T:=(1+\epsilon ^{-1}) f G\) and \(G_H:= \lceil (G_T+G) \log _2 (1/\epsilon )\rceil \), and the sensing matrix \({\mathbf {A}}\) has \({\mathfrak {G}}_{\tilde{G}}\)-RIP with \(\tilde{G} = G+G_T+G_H\). Note that \(\tilde{G}\in {\mathcal {O}} (G)\) for fixed accuracy \(\epsilon >0\) and frequency f. Furthermore \(|{\mathfrak {G}}_{\tilde{G}}|\in {\mathcal {O}} (M^{cG})\) for a constant c. Using the bound (20) we obtain that the number of required measurements for a sub-Gaussian random matrix \({\mathbf {A}}\) having the \({\mathfrak {G}}_{\tilde{G}}\)-RIP with high probability is

$$\begin{aligned} m={\mathcal {O}}\left( \delta ^{-2} Gg_{\text {max}} \log (\delta ^{-1}) + cG\log (M) \right) \end{aligned}$$

which differs by just a constant factor from the number of measurements required in the case of exact projections (see Sect. 3). Under condition (19) convergence of the AM-IHT is ensured.

Extension to within-group sparsity and beyond

The head and tail approximation approach can be extended far beyond the standard group-sparsity model. It still works even if we are considering K-sparse and G-group-sparse (i.e. \({\mathfrak {G}}_{G,K}\)-sparse) vectors in our model, for example.

The reason is that the group model projection can be head approximated to a constant even in this case. Formally, if we search for the K weight maximal elements covered by G many groups, we are maximizing a submodular function subject to a knapsack constraint.Footnote 3 This is known to be approximable to a constant factor (cf. Kulik et al. [42, 43] and related work). Suppose we delete the covered elements and run such an approximation algorithm again. Then, after a constant number of steps, we obtain a collection of groups and elements such that the total weight of the elements is at least an \((1-\epsilon )\)-fraction of the total weight that a K-sparse and G-group-sparse solution could ever have. Moreover, the sparsity budgets are exceeded only by a constant factor each.

Similarly, the analysis given in the proof of Theorem 6 works even if we impose sparsity on both groups and elements. Again, we obtain a solution that has a \((1+\epsilon )\)-tail guarantee whose support exceeds that of a G-group-sparse K-sparse vector by at most a constant factor if we assume a bounded frequency. This leads to the positive consequences detailed above.

More generally, any knapsack constraints on groups and elements can be handled, leading to head and tail approximations in the case when there are non-uniform sparsity budgets on the groups and elements. However the corresponding head approximations are rather involved, and certainly much less efficient than the simple greedy procedure proposed by the Hochbaum and Pathria algorithm [30].

Computations

In this section we present the computational results for the Model-IHT, MEIHT, AM-IHT and AM-EIHT presented in Sect. 2 for block-group instances. Precisely, we study block-groups, i.e. each group \({\mathcal {G}} \in {\mathfrak {G}}\) is a set of sequenced indices, \({\mathcal {G}}=[s,t]\cap [N]\), where \(1\le s<t\le N\) and each group has the same size \(|{\mathcal {G}}|=l\). For a given dimension N we generate blocks of size \(l=\lfloor 0.02N\rfloor \). We consider two types of block-models, one where the successive groups overlap in \(\lfloor \frac{l-1}{2}\rfloor \) items and another where they overlap in \(l-1\) items. Note that the frequency is then given by \(f=2\) or \(f=l\) respectively.

We run all algorithms for random signals \({\mathbf {x}}\in {\mathbb {R}}^N\) in dimension \(N\in \left\{ 100,200,\ldots ,900\right\} \). For each dimension we vary the number of measurements \(m\in \left\{ 20,40,\ldots , N\right\} \) and generate 20 random matrices \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) each together with a random signal \({\mathbf {x}}\in {\mathbb {R}}^N\). We assume there is no noise, i.e. \({\mathbf {e}}=0\). For a given group model \({\mathfrak {G}}\) the support of the signal \({\mathbf {x}}\) is determined as the union of G randomly selected groups. The components of \({\mathbf {x}}\) in the support are calculated as identical and independent draws from a standard Gaussian distribution while all other components are set to 0. Our computations are processed for two classes of random matrices, Gaussian matrices and expander matrices as described in Sect. 2. The Gaussian matrices are generated by drawing identical and independent values from a standard Gaussian distribution for each entry of \({\mathbf {A}}\) and afterwards normalizing each entry by \(\frac{1}{\sqrt{m}}\). The expander matrices are generated by randomly selecting \(d=\lfloor 2\log (N)/\log (G l)\rfloor \) indices in \(\left\{ 1,\ldots , m\right\} \) for each column of the matrix. The choice of d is motivated by the choice in [3]. Each of the algorithms is stopped if either the number of iterations reaches 1000 or if for any iteration \(i+1\) we have \(\Vert {\mathbf {x}}^{(i+1)}-{\mathbf {x}}^{(i)}\Vert _p<10^{-5}\). For the error in each iteration we use \(p=1\) for the calculations corresponding to the expander matrices and \(p=2\) for the Gaussian matrices. After the determination of the algorithm we calculate the relative error of the returned signal \(\hat{{\mathbf {x}}}\) to the true signal \({\mathbf {x}}\), i.e. we calculate

$$\begin{aligned} \frac{\Vert {\mathbf {x}}-\hat{{\mathbf {x}}}\Vert _p}{\Vert {\mathbf {x}}\Vert _p}. \end{aligned}$$

Again we use \(p=1\) for the calculations corresponding to the expander matrices and \(p=2\) for the Gaussian matrices. We call a signal recovered if the relative error is smaller than \(10^{-5}\). For the AM-IHT and the AM-EIHT the approximation accuracy of the head- and the tail approximation algorithms are set to \(\alpha = 0.95\) and \(\beta =1.05\).

For the exact signal approximation problem which has to be solved in each iteration of the Model-IHT and the MEIHT we implemented the Benders’ decomposition procedure presented in Sect. 3.6. To this end the master problem is solved by CPLEX 12.6 while each optimal solution of the subproblem is calculated using the result of Lemma 4. Regarding the AM-IHT, for the head-approximation we implemented the greedy procedure given in Algorithm 4 while for the tail-approximation we implemented the procedure of Theorem 6. Again the LP in the latter procedure is solved by CPLEX 12.6. All computations were calculated on a cluster of 64-bit Intel(R) Xeon(R) CPU E5-2603 processors running at 1.60 GHz with 15 MB cache.

The results of the computations are presented in the following diagrams. For all instances we measure the smallest number of measurements \(m^\#\) for which the median of the relative error to the true signal is at most \(10^{-5}\), i.e. the smallest number of measurements for which at least 50% of the signals were recovered. Furthermore we show the average number of iterations and the average time in seconds the algorithms need to successfully recover a signal, given \(m^\#\) number of measurements. We stop increasing the number of measurements m if \(m^\#\) is reached.

Fig. 6
figure 6

Smallest number of measurements \(m^\#\) for which the median of the relative error over all random matrices is at most \(10^{-5}\)

Fig. 7
figure 7

Average number of iterations performed by the algorithm to successfully recover the signal with \(m^\#\) measurements

Fig. 8
figure 8

Average time in seconds needed by the algorithm to successfully recover the signal with \(\bar{m}\) measurements

In Figs. 67 and 8 we show the development of \(m^\#\), the number of iterations and the runtime in seconds of all algorithms over all dimensions \(N\in \left\{ 100,200,\ldots ,900\right\} \) for block-groups, generated as explained above, with fixed value \(G=5\).

The smallest number of measurements \(m^\#\) which leads to a median relative error of at most \(10^{-5}\) is nearly linear in the dimension; see Fig. 6. For all algorithms the corresponding \(m^\#\) is very close even for different number of overlaps. Nevertheless the number of measurements \(m^\#\) is smaller for expander matrices than for Gaussian matrices. Furthermore in the expander case the instances with overlap \(\lfloor \frac{l-1}{2}\rfloor \) have a smaller \(m^\#\). The average number of iterations performed by the algorithms fluctuates a lot for the Gaussian case. Here the value increases slowly for the Model-IHT while it increases more rapidly for the AM-IHT. In the expander case the number of iterations is very close to each other for all algorithms and lies between 50 and 70 most of the time; see Fig. 7. The drop from \(N=100\) to \(N=200\) is due to the small value of d when \(N=100\). Furthermore the number of iterations is much lower in the expander case.

Fig. 9
figure 9

Smallest number of measurements \(m^\#\) for which the median of the relative error over all random matrices is at most \(10^{-5}\)

Fig. 10
figure 10

Average number of iterations performed by the algorithm to successfully recover the signal with \(m^\#\) measurements

Fig. 11
figure 11

Average time in seconds needed by the algorithm to successfully recover the signal with \(m^\#\) measurements

The average runtime for the Gaussian case is a bit larger than the runtime for the expander case as expected, since operations with dense matrices are more costly than the sparse ones. However, it may be due to the larger number of iterations; see Fig. 8. Furthermore the runtime for the instances with l–1 overlap is much larger in both cases. Here the AM-IHT (AM-EIHT) is faster than the Model-IHT (MEIHT) for the instances with overlap l–1 while it is slightly slower for the others.

In Figs. 910 and 11 we show the same properties as above, but for varying \(G\in \left\{ 1,2,\ldots ,9\right\} \), a fixed dimension of \(N=200\) and d fixed to 7 for all values of G. Similar to Fig. 6 the value \(m^\#\) seems to be linear in G (see Fig. 9). Just the Model-IHT (MEIHT) for blocks with overlap \(l-1\) seems to require an exponential number of measurements in G to guarantee a small median relative error. Here the AM-IHT (AM-EIHT) performs much better. Interestingly the number of iterations does not change a lot for increasing G for Gaussian matrices while it grows for the AM-EIHT in the expander case. This is in contrast to the iteration results with increasing N; see Fig. 7. The runtime of all algorithms increases slowly with increasing G, except for the IHT for blocks with overlap \(l-1\) the runtime explodes. For both instances the AM-IHT (AM-EIHT) is faster than the Model-IHT (MEIHT).

To conclude this section we selected the instances for dimension \(N=800\) and \(G=5\) and present the development of the median relative error over the number of measurements m; see Fig. 12. In the expander case the median relative error decreases rapidly and is nearly 0 already for \(\frac{m}{N}\approx 0.45\). Just for the MEIHT for blocks with overlap \(l-1\) the relative error is close to 0 not until \(\frac{m}{N}\approx 0.55\). For the Gaussian case the results look similar with the only difference that a median relative error close to 0 is reached not until \(\frac{m}{N}\approx 0.6\).

Fig. 12
figure 12

Median relative error for instances with dimension \(N=800\) and \(G=5\). The diagrams are presented in log-scale

Latent group lasso

In this section we present computational results for the latent group Lasso approach introduced in Sect. 2. We consider block-groups and all group instances are generated as described in the previous section. We study the \(\ell _1/\ell _2\) variant presented in (14) and its \(\ell _0\) counterpart presented in (15). Both problems are implemented in CPLEX 12.6. For the \(\ell _0\) counterpart we implemented the integer programming formulation (16). For the given random Gaussian matrices \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) and its linear measurements \({\mathbf {y}}\in {\mathbb {R}}^m\) we use \(L({\mathbf {x}})=\Vert {\mathbf {A}}{\mathbf {x}}-{\mathbf {y}}\Vert _2^2\) while for expander matrices we use \(L({\mathbf {x}})=\Vert {\mathbf {A}}{\mathbf {x}}-{\mathbf {y}}\Vert _1\). The last choice is motivated by our comnputational tests which showed that the \(\ell _1\)-norm has a better performance for expander matrices.

We run all algorithms for random signals \({\mathbf {x}}\in {\mathbb {R}}^N\) in dimension \(N=200\). The number of measurements is varied in \(m\in \left\{ 20,40,\ldots , 2N\right\} \) and we generate 20 random matrices \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) each together with a random signal \({\mathbf {x}}\in {\mathbb {R}}^N\). We assume there is no noise, i.e. \({\mathbf {e}}=0\). For a given group model \({\mathfrak {G}}\) the support of the signal \({\mathbf {x}}\) is determined as the union of \(G=5\) randomly selected groups. The components of \({\mathbf {x}}\) in the support are calculated as identical and independent draws from a standard Gaussian distribution while all other components are set to 0. Our computations are processed for two classes of random matrices, Gaussian matrices and expander matrices generated as described in the previous section. After each calculation we calculate the relative error of the returned signal \(\hat{{\mathbf {x}}}\) to the true signal \({\mathbf {x}}\), i.e. we calculate

$$\begin{aligned} \frac{\Vert {\mathbf {x}}-\hat{{\mathbf {x}}}\Vert _p}{\Vert {\mathbf {x}}\Vert _p} \end{aligned}$$

where we use \(p=1\) for the calculations corresponding to the expander matrices and \(p=2\) for the Gaussian matrices. Additionally we calculate the pattern recovery error

$$\begin{aligned} \frac{1}{2N}\left( |{\text {supp}}({\mathbf {x}}){\setminus } {\text {supp}}(\hat{{\mathbf {x}}})| + |{\text {supp}}(\hat{{\mathbf {x}}}){\setminus } {\text {supp}}({\mathbf {x}})| \right) , \end{aligned}$$

which was defined in [49], and the probability of recovery, i.e. the fraction of instances which were successfully recovered. We call a signal recovered if the relative error is smaller or equal than \(10^{-4}\).

All computations were performed for \(\lambda \in \left\{ 0.25,0.5,\ldots ,5 \right\} \) and \(d_{{\mathcal {G}}} = 1\) for all groups \({\mathcal {G}}\in {\mathfrak {G}}\). For each \(m\in \left\{ 20,40,\ldots , 2N\right\} \) the \(\lambda \) with the best average relative error is calculated and the \(\lambda \) which has the best average relative error most often over all m is chosen. For all experiments the optimal value was \(\lambda ^*=0.25\).

All computations were calculated on a cluster of 64-bit Intel(R) Xeon(R) CPU E5-2603 processors running at 1.60 GHz with 15 MB cache.

The results of the computations are presented in Figs. 13141516 and 17. For each m we show the median relative error, the probability of recovery, the average pattern recovery error and the average number of selected groups G; each value calculated over all 20 matrices and for \(\lambda ^*=0.25\). In Fig. 17 we show for each m the value of \(\lambda \) with has the smallest average relative error.

Fig. 13
figure 13

Median relative error over all 20 instances

Fig. 14
figure 14

Probability of recovery

Fig. 15
figure 15

Average pattern recovery error over all 20 instances

Fig. 16
figure 16

Average number of groups G selected in \(\hat{x}\) over all 20 instances

Fig. 17
figure 17

Value of \(\lambda \) with best mean relative error

The results in Fig. 13 show that the \(\ell _0\) variant of the latent group Lasso performs very well for Gaussian matrices. Even for a small number of measurements the median relative error is 0 while for expander matrices the error never is smaller than 0.6. The \(\ell _1/\ell _2\) variant for Gaussian matrices has a larger error for small number of measurements which decreases rapidly and is always 0 for m larger than 0.5N. In the expander case it is never smaller than 0.6 as well. The same picture holds for the pattern recovery error; see Fig. 15. The only difference here is that also for expander matrices the error tends to 0. The results indicate that the optimal support is calculated for both variants and for both types of matrices if m is large enough, but for expander matrices the latent group Lasso struggles to find the optimal component-values of \({\mathbf {x}}\) in the support. Interestingly the frequency of the groups does not significantly influence the results.

The probability of recovery for Gaussian matrices is 1 for all m for the \(\ell _0\) variant and is 1 for m larger than 0.8N for the \(\ell _1/\ell _2\) variant; see Fig. 14. According to the results for the relative error the probability of recovery for expander matrices is 0 for all m. The number of groups selected by the latent group Lasso is close to 5 for all m for the \(\ell _0\) variant; see Fig. 16. For the \(\ell _1/\ell _2\) variant with Gaussian matrices it is close to 5 for all m larger than 0.8N while for the expander matrices it is already close to 5 for all m larger than 0.5N. The value of the optimal \(\lambda \) is large for small m and is always 0.25 for larger m; see Fig. 17.

To summarize the results, it seems that the \(\ell _0\) latent group Lasso outperforms the \(\ell _1/\ell _2\) variant. It can even compete with the iterative algorithms tested in the previous section; the number of required measurements can be even smaller for the \(\ell _0\) latent group Lasso while at the same time the support is correctly recovered. Nevertheless in a real-word application the optimal \(\lambda \) is not known and has to be found. Furthermore in contrast to the iterative algorithms studied in this work it can never be guaranteed that the recovered support and especially the number of groups calculated by the latent group Lasso is optimal. The expander variant of the latent group Lasso performs much worse than for the iterative algorithms. Especially this approach fails to recover the true signal for all instances, while the true support can be found.

Conclusion

In this paper we revisited the model-based compressed sensing problem focusing on overlapping group models with bounded treewidth and low frequency. We derived a polynomial time dynamic programming algorithm to solve the exact projection problem for group models with bounded treewidth, which is more general than the state-of-the-art considering loopless overlapping models. For general group models we derived an algorithm based on the idea of Bender’s decomposition, which may run in exponential time but often performs better than dynamic programming in practice. We proved that the latter procedure is generalizable from group-sparse models to group-sparse plus standard sparse models. The most dominant operation of iterative exact projection algorithms is the model projection. Hence our results show that the Model-IHT and the MEIHT run in polynomial time for group models with bounded treewidth. Alternatively, for group models with bounded frequency we show that another class of less accurate algorithms run in polynomial time. More precisely the AM-IHT and the AM-EIHT are algorithms using head- and tail-approximations instead of exact projections.

Using Benders’ decomposition (with Gaussian and model-expander sensing matrices) we compare the minimum number of measurements required by, and runtimes of, each of the four algorithms (Model-IHT, MEIHT, AM-IHT and AM-EIHT) to achieve a given accuracy. In summary the experimental results on overlapping block groups seem to indicate that the number of required measurements to recover a signal is smaller for expander matrices than for Gaussian matrices. Furthermore, we could observe that the number of measurements to ensure a small relative error is smaller for the approximate versions of the algorithms. The run-time gets much larger for Gaussian matrices with increasing N than for expander matrices, which might be just what is expected when applying dense versus sparse matrices. In general the approximate versions of the algorithms may have a larger number of iterations but the run-time is lower. This indicates that the larger number of iterations can be compensated by the faster computation of the approximate projection problems in each iteration. Additionally to the iterative algorithms we test the latent group Lasso approach on the same instances and show that the \(\ell _0\) variant outperforms the \(\ell _1/\ell _2\) variant and is even competitive to the iterative algorithms.

Notes

  1. 1.

    Recall that a clique in a graph is a set of mutually adjacent vertices.

  2. 2.

    Recall that a graph is cubic if every vertex is of degree 3, and planar if it can be drawn into the plane such that no two edges cross.

  3. 3.

    Actually, we are maximizing a submodular function subject to a uniform matroid constraint which is a simpler problem.

References

  1. 1.

    Ahn, K.J., Guha, S., McGregor, A.: Graph sketches: sparsification, spanners, and subgraphs. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 5–14. ACM, New York (2012)

  2. 2.

    Alam, M.J., Bekos, M.A., Kaufmann, M., Kindermann, P., Kobourov, S.G., Wolff, A.: Smooth orthogonal drawings of planar graphs. In: LATIN 2014: Theoretical Informatics—11th Latin American Symposium, Montevideo, Uruguay, March 31–April 4, 2014. Proceedings, pp. 144–155 (2014)

  3. 3.

    Bah, B., Baldassarre, L., Cevher, V.: Model-based sketching and recovery with expanders. In: SODA, pp. 1529–1543. SIAM, New York (2014)

  4. 4.

    Baldassarre, L., Bhan, N., Cevher, V., Kyrillidis, A., Satpathi, S.: Group-sparse model selection: hardness and relaxations. IEEE Trans. Inf. Theory 62(11), 6508–6534 (2016)

    MathSciNet  Article  Google Scholar 

  5. 5.

    Baraniuk, R., Cevher, V., Duarte, M., Hegde, C.: Model-based compressive sensing. IEEE Trans. Inf. Theory 56(4), 1982–2001 (2010)

    MathSciNet  Article  Google Scholar 

  6. 6.

    Baraniuk, R.G., Cevher, V., Wakin, M.B.: Low-dimensional models for dimensionality reduction and signal recovery: a geometric perspective. Proc. IEEE 98(6), 959–971 (2010)

    Article  Google Scholar 

  7. 7.

    Benders, J.F.: Partitioning procedures for solving mixed-variables programming problems. Numer. Math. 4(1), 238–252 (1962)

    MathSciNet  Article  Google Scholar 

  8. 8.

    Berinde, R., Gilbert, A., Indyk, P., Karloff, H., Strauss, M.: Combining geometry and combinatorics: a unified approach to sparse signal recovery. In: 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 798–805. IEEE, New York (2008)

  9. 9.

    Blumensath, T., Davies, M.E.: Sampling theorems for signals from the union of linear subspaces. IEEE Trans. Inf. Theory 2007, 30–56 (2007)

    Google Scholar 

  10. 10.

    Blumensath, T., Davies, M.E.: Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27(3), 265–274 (2009)

    MathSciNet  Article  Google Scholar 

  11. 11.

    Bodlaender, H.L.: A linear-time algorithm for finding tree-decompositions of small treewidth. SIAM J. Comput. 25(6), 1305–1317 (1996)

    MathSciNet  Article  Google Scholar 

  12. 12.

    Buchheim, C., Kurtz, J.: Robust combinatorial optimization under convex and discrete cost uncertainty. EURO J. Comput. Optim. 6(3), 211–238 (2018)

    MathSciNet  Article  Google Scholar 

  13. 13.

    Candès, E., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005)

    MathSciNet  Article  Google Scholar 

  14. 14.

    Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)

    MathSciNet  Article  Google Scholar 

  15. 15.

    Candès, E.J., Romberg, J., Tao, T.: Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. 59(8), 1207–1223 (2006)

    MathSciNet  Article  Google Scholar 

  16. 16.

    Chandar, V.: A negative result concerning explicit matrices with the restricted isometry property. Technical Report (2008)

  17. 17.

    Cordeau, J., Furini, F., Ljubic, I.: Benders decomposition for very large scale partial set covering and maximal covering problems. Technical Report (2018)

  18. 18.

    DeVore, R.: Deterministic constructions of compressed sensing matrices. J. Complex. 23(4), 918–925 (2007)

    MathSciNet  Article  Google Scholar 

  19. 19.

    Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)

    MathSciNet  Article  Google Scholar 

  20. 20.

    Dwork, C., McSherry, F., Talwar, K.: The price of privacy and the limits of LP decoding. In: Proceedings of the 39th Annual ACM Symposium on Theory of Computing, pp. 85–94. ACM, New York (2007)

  21. 21.

    Eldar, Y.C., Mishali, M.: Robust recovery of signals from a structured union of subspaces. IEEE Trans. Inf. Theory 55(11), 5302–5316 (2009)

    MathSciNet  Article  Google Scholar 

  22. 22.

    Foucart, S., Rauhut, H.: A Mathematical Introduction to Compressive Sensing. Springer, Berlin (2013)

    Book  Google Scholar 

  23. 23.

    Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, London (1979)

    MATH  Google Scholar 

  24. 24.

    Geoffrion, A.M.: Generalized benders decomposition. J. Optim. Theory Appl. 10(4), 237–260 (1972)

    MathSciNet  Article  Google Scholar 

  25. 25.

    Gilbert, A.C., Levchenko, K.: Compressing network graphs. In: Proceedings of the LinkKDD Workshop at the 10th ACM Conference on KDD, vol. 124 (2004)

  26. 26.

    Hegde, C., Indyk, P., Schmidt, L.: Approximation-tolerant model-based compressive sensing. In: Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1544–1561. Society for Industrial and Applied Mathematics, New York (2014)

  27. 27.

    Hegde, C., Indyk, P., Schmidt, L.: Approximation algorithms for model-based compressive sensing. IEEE Trans. Inf. Theory 61(9), 5129–5147 (2015)

    MathSciNet  Article  Google Scholar 

  28. 28.

    Hegde, C., Indyk, P., Schmidt, L.: Fast algorithms for structured sparsity. Bull. EATCS 3, 117 (2015)

    MATH  Google Scholar 

  29. 29.

    Hegde, C., Indyk, P., Schmidt, L.: A nearly-linear time framework for graph-structured sparsity. In: International Conference on Machine Learning, pp. 928–937 (2015)

  30. 30.

    Hochbaum, D.S., Pathria, A.: Analysis of the greedy approach in problems of maximum k-coverage. Nav. Res. Log. (NRL) 45(6), 615–627 (1998)

    MathSciNet  Article  Google Scholar 

  31. 31.

    Hoory, S., Linial, N., Wigderson, A.: Expander graphs and their applications. Bull. Am. Math. Soc. 43(4), 439–562 (2006)

    MathSciNet  Article  Google Scholar 

  32. 32.

    Huang, J., Zhang, T., Metaxas, D.: Learning with structured sparsity. J. Mach. Learn. Res. 12(November), 3371–3412 (2011)

    MathSciNet  MATH  Google Scholar 

  33. 33.

    Huang, J., Zhang, T., et al.: The benefit of group sparsity. Ann. Stat. 38(4), 1978–2004 (2010)

    MathSciNet  MATH  Google Scholar 

  34. 34.

    Indyk, P., Razenshteyn, I.: On model-based RIP-1 matrices. In: International Colloquium on Automata, Languages, and Programming, pp. 564–575. Springer, Berlin (2013)

  35. 35.

    Jafarpour, S., Xu, W., Hassibi, B., Calderbank, R.: Efficient and robust compressed sensing using optimized expander graphs. IEEE Trans. Inf. Theory 55(9), 4299–4308 (2009)

    MathSciNet  Article  Google Scholar 

  36. 36.

    Jenatton, R., Audibert, J.Y., Bach, F.: Structured variable selection with sparsity-inducing norms. J. Mach. Learn. Res. 12(October), 2777–2824 (2011)

    MathSciNet  MATH  Google Scholar 

  37. 37.

    Jenatton, R., Mairal, J., Obozinski, G., Bach, F.: Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res. 12(July), 2297–2334 (2011)

    MathSciNet  MATH  Google Scholar 

  38. 38.

    Joseph, A., Barron, A.R.: Fast sparse superposition codes have near exponential error probability for \(R<C\). IEEE Trans. Inf. Theory 60(2), 919–942 (2014)

    MathSciNet  Article  Google Scholar 

  39. 39.

    Kloks, T.: Treewidth, Computations and Approximations. Lecture Notes in Computer Science, vol. 842. Springer, Berlin (1994)

    MATH  Google Scholar 

  40. 40.

    Kolar, M., Lafferty, J., Wasserman, L.: Union support recovery in multi-task learning. J. Mach. Learn. Res. 12(July), 2415–2435 (2011)

    MathSciNet  MATH  Google Scholar 

  41. 41.

    Kouvelis, P., Yu, G.: Robust Discrete Optimization and Its Applications, vol. 14. Springer, Berlin (2013)

    MATH  Google Scholar 

  42. 42.

    Kulik, A., Shachnai, H., Tamir, T.: Maximizing submodular set functions subject to multiple linear constraints. In: Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 545–554. Society for Industrial and Applied Mathematics, New York (2009)

  43. 43.

    Kulik, A., Shachnai, H., Tamir, T.: Approximations for monotone and nonmonotone submodular maximization with knapsack constraints. Math. Oper. Res. 38(4), 729–739 (2013)

    MathSciNet  Article  Google Scholar 

  44. 44.

    Kyrillidis, A., Bah, B., Hasheminezhad, R., Dinh, Q.T., Baldassarre, L., Cevher, V.: Convex block-sparse linear regression with expanders–provably. In: Gretton, A., Robert, C.C. (eds.) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, vol 51, pp. 19–27 (2016)

  45. 45.

    Lounici, K., Pontil, M., Van De Geer, S., Tsybakov, A.B., et al.: Oracle inequalities and optimal inference under group sparsity. Ann. Stat. 39(4), 2164–2204 (2011)

    MathSciNet  Article  Google Scholar 

  46. 46.

    Musser, D.R.: Introspective sorting and selection algorithms. Softw. Pract. Exp. 27(8), 983–993 (1997)

    Article  Google Scholar 

  47. 47.

    Muthukrishnan, S., et al.: Data streams: algorithms and applications. Found. Trends\(^{\textregistered }\) Theor. Comput. Sci. 1(2), 117–236 (2005)

  48. 48.

    Negahban, S.N., Wainwright, M.J.: Simultaneous support recovery in high dimensions: benefits and perils of block \(\ell _{1}/\ell _{\infty }\)-regularization. IEEE Trans. Inf. Theory 57(6), 3841–3863 (2011)

    MathSciNet  Article  Google Scholar 

  49. 49.

    Obozinski, G., Jacob, L., Vert, J.P.: Group lasso with overlaps: the latent group lasso approach. Technical Report (2011)

  50. 50.

    Rao, N., Recht, B., Nowak, R.: Signal recovery in unions of subspaces with applications to compressive imaging. Technical Report (2012)

  51. 51.

    Rao, N.S., Nowak, R.D., Wright, S.J., Kingsbury, N.G.: Convex approaches to model wavelet sparsity patterns. In: 2011 18th IEEE International Conference on Image Processing, pp. 1917–1920. IEEE, New York (2011)

  52. 52.

    Robertson, N., Seymour, P.D.: Graph minors vs. excluding a planar graph. J. Comb. Theory Ser. B 41(1), 92–114 (1986)

    MathSciNet  Article  Google Scholar 

  53. 53.

    Schmidt, L., Hegde, C., Indyk, P.: The constrained earth mover distance model, with applications to compressive sensing. In: 10th International Conference on Sampling Theory and Applications (SAMPTA) (2013)

  54. 54.

    Takeishi, Y., Kawakita, M., Takeuchi, J.: Least squares superposition codes with Bernoulli dictionary are still reliable at rates up to capacity. IEEE Trans. Inf. Theory 60(5), 2737–2750 (2014)

    MathSciNet  Article  Google Scholar 

  55. 55.

    Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68(1), 49–67 (2006)

    MathSciNet  Article  Google Scholar 

  56. 56.

    Zhao, P., Rocha, G., Yu, B.: Grouped and hierarchical model selection through composite absolute penalties. Technical Report. Department of Statistics, UC Berkeley, p. 703 (2006)

Download references

Acknowledgements

Open Access funding provided by Projekt DEAL. BB acknowledges the support from the funding by the German Federal Ministry of Education and Research, administered by Alexander von Humboldt Foundation, for the German Research Chair at AIMS South Africa.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jannis Kurtz.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bah, B., Kurtz, J. & Schaudt, O. Discrete optimization methods for group model selection in compressed sensing. Math. Program. 190, 171–220 (2021). https://doi.org/10.1007/s10107-020-01529-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-020-01529-7

Keywords

  • Compressed sensing
  • Group models
  • Iterative hard thresholding
  • Maximum weight coverage problem

Mathematics Subject Classification

  • 90C10
  • 90C27