Abstract
In this article we study the problem of signal recovery for group models. More precisely for a given set of groups, each containing a small subset of indices, and for given linear sketches of the true signal vector which is known to be groupsparse in the sense that its support is contained in the union of a small number of these groups, we study algorithms which successfully recover the true signal just by the knowledge of its linear sketches. We derive model projection complexity results and algorithms for more general group models than the stateoftheart. We consider two versions of the classical iterative hard thresholding algorithm (IHT). The classical version iteratively calculates the exact projection of a vector onto the group model, while the approximate version (AMIHT) uses a head and a tailapproximation iteratively. We apply both variants to group models and analyse the two cases where the sensing matrix is a Gaussian matrix and a model expander matrix. To solve the exact projection problem on the group model, which is known to be equivalent to the maximum weight coverage problem, we use discrete optimization methods based on dynamic programming and Benders’ decomposition. The head and tailapproximations are derived by a classical greedymethod and LProunding, respectively.
Introduction
In many applications involving sensors or sensing systems an unknown sparse signal has to be recovered from a relatively small number of measurements. The reconstruction problem in standard compressed sensing attempts to recover an unknown ksparse signal \({\mathbf {x}}\in {\mathbb {R}}^N\), i.e. it has at most k nonzero entries, from its (potentially noisy) linear measurements \({\mathbf {y}}= {\mathbf {A}}{\mathbf {x}}+ {\mathbf {e}}\). Here, \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) for \(m\ll N\), \({\mathbf {y}}\in {\mathbb {R}}^m\) and \({\mathbf {e}}\in {\mathbb {R}}^m\) is a noise vector, typically with a bounded noise level \(\Vert {\mathbf {e}}\Vert _2 \le \eta \); see [13, 14, 19]. A wellknown result is that, if \({\mathbf {A}}\) is a random Gaussian matrix, the number of measurements required for most of the classical algorithms like \(\ell _1\)minimization or Iterative Hard Thresholding (IHT) to successfully recover the true signal is \(m={\mathcal {O}}\left( k\log \left( N/k\right) \right) \) [10, 15].
In modelbased compressed sensing we exploit secondorder structures beyond the first order sparsity and compressibility structures of a signal to more efficiently encode and more accurately decode the signal. Efficient encoding means taking a fewer number of measurements than in the standard compressed sensing setting, while accurate decoding does not only include smaller recovery error but better interpretability of the recovered solution than in the standard compressed sensing setting. The second order structures of the signal are usually referred to as the structured sparsity of the signal. The idea is, besides standard sparsity, to take into account more complicated structures of the signal [5]. Nevertheless most of the classical results and algorithms for standard compressed sensing can be adapted to the modelbased framework [5].
Numerous applications of modelbased compressed sensing exist in practice. Key amongst these applications is the multiple measurement vector (MMV) problem, which can be modelled as a blocksparse recovery problem [21]. The treesparse model has been wellexploited in a number of waveletbased signal processing applications [5]. In the sparse matrix setting (see Sect. 2) the modelbased compressed sensing was used to solve the Earth Mover Distance problem (EMD). The EMD problem introduced in [53] is motivated by the task of reconstructing time sequences of spatially sparse signals, e.g. seismic measurements; see also [26]. In addition, there are many more potential applications in linear sketching including data streaming [47], graph sketching [1, 25], breaking privacy of databases via aggregate queries [20], and in sparse regression codes or sparse superposition codes (SPARC) decoding [38, 54], which is also an MMV problem.
Structured sparsity models include treesparse, blocksparse, and groupsparse models. For instance, for blocksparse models with dense Gaussian sensing matrices it has been established in [5] that the number of required measurements to ensure recovery is \(m={\mathcal {O}}(k)\) as opposed to \(m={\mathcal {O}}\left( k\log \left( N/k\right) \right) \) in the standard compressed sensing setting. Furthermore, in the sparse matrix setting, precisely for adjacency matrices of model expander graphs (also known as model expander matrices), the treesparse model only requires \(m = {\mathcal {O}}\left( \log _k\left( N/k\right) \right) \) measurements [3, 34], which is smaller than the standard compressed sensing sampling rate stated above. Moreover, all proposed algorithms that perform an exact model projection, which is to find the closest vector in the model space for a given signal, guarantee recovery of a solution belonging to the model space, which is then more interpretable than applying offtheshelf standard compressed sensing algorithms [3].
As the exact model projection problem used in many of the classical algorithms may become theoretically and computationally hard for specific sparsity models, approximation variants of some wellknown algorithms like the ModelIHT have been introduced in [27]. Instead of iteratively solving the exact model projection problem, this algorithm, called AMIHT, uses a head and a tailapproximation to recover the signal which is computationally less demanding in general. The latter computational benefit comes along with a larger number of measurements required to obtain successful recovery with weaker recovery guarantees: the typical speed versus accuracy tradeoff.
A special class of structured sparsity models are group models, where the support of the signal is known to be contained in the union of a small number of groups of indices. Group models were already studied extensively in the literature in the compressed sensing context; see [6, 36, 49]. Its choice is motivated by several applications, e.g. in image processing; see [44, 50, 51]. As it was shown in [4] the exact projection problem for group models is NPhard in general but can be solved in polynomial time by dynamic programming if the intersection graph of the groups has no cycles. The latter case is quite restricting, since as a consequence each element is contained in at most two groups. In this work we extend existing results for the ModelIHT algorithm and its approximation variant (AMIHT) derived in [27] to group models and model expander matrices. We focus on deriving discrete optimization methods to solve the exact projection problem and the head and tailapproximations for much more general classes of group models than the stateoftheart.
In Sect. 2 we present the main preliminary results regarding compressed sensing for structured sparsity models and group models. In Sect. 3 we study recovery algorithms using exact projection oracles. We first show that for group models with low treewidth, the projection problem can be solved in polynomial time by dynamic programming which is a generalization of the result in [4]. We then adapt known theoretical results for model expander matrices to these more general group models. To solve the exact projection problem for general group models we apply a Benders’ decomposition procedure. It can be even used for the more general assumption that we seek a signal which is groupsparse and additionally sparse in the classical sense. In Sect. 4 we study recovery algorithms using approximation projection oracles, namely head and tailapproximations. We apply the known results in [27] to group models of low frequency and show that the required head and tailapproximations for group models can be solved by a classical greedymethod and LP rounding, respectively. In Sect. 5 we test all algorithms, including ModelIHT, AMIHT, MEIHT and AMEIHT, on overlapping blockgroups and compare the number of required measurements, iterations and the runtime.
Summary of our contributions

We study the ModelExpander IHT (MEIHT) algorithm, which was analysed in [3] for treesparse and loopless groupsparse signals, and extend the existing results to general group models, proving convergence of the algorithm.

We extend the results in [4] by proving that the projection problem can be solved in polynomial time if the incidence graph of the underlying group model has bounded treewidth. This includes the case when the intersection graph has bounded treewidth, which generalizes the result for acyclic graphs derived in [4]. We complement the latter result with a hardness result that we use to justify the bounded treewidth approach.

We derive a Benders’ decomposition procedure to solve the projection problem for arbitrary group models, assuming no restriction on the frequency or the structure of the groups. The latter procedure even works for the more general model combining groupsparsity with classical sparsity. We integrate the latter procedure into the ModelIHT and MEIHT algorithm.

We apply the ApproximateModel IHT (AMIHT) derived in [26, 28] to Gaussian and expander matrices and to the case of group models with bounded frequency, which is the maximal number of groups an element is contained in. In the expander case we call the algorithm AMEIHT. To this end we derive both, head and tailapproximations of arbitrary precision using a classical greedy method and LProunding. Using the AMIHT and the results in [26, 28], this implies compressive sensing \(\ell _2/\ell _2\) recovery guarantees for groupsparse signals. We show that the number of measurements needed to guarantee a successful recovery exceeds the number needed by the usual modelbased compressed sensing bound [5, 9] only by a constant factor.

We test the algorithms ModelIHT, MEIHT, AMIHT and AMEIHT on groups given by overlapping blocks for random signals and measurement matrices. We analyse and compare the minimal number of measurements needed for recovery, the runtime and the number of iterations of the algorithm.
Preliminaries
Notation
In most of this work scalars are denoted by ordinary letters (e.g. x, N), vectors and matrices by boldface letters (e.g. \(\mathbf{x}\), \(\mathbf{A}\)), and sets by calligraphic capital letters (e.g., \(\mathcal {S}\)).
The cardinality of a set \(\mathcal {S}\) is denoted by \(\mathcal {S}\) and we define \([N] := \{1, \ldots , N\}\). Given \(\mathcal {S} \subseteq [N]\), its complement is denoted by \(\mathcal {S}^c := [N] {\setminus } \mathcal {S}\) and \({\mathbf {x}}_\mathcal {S}\) is the restriction of \({\mathbf {x}}\in {\mathbb {R}}^N\) to \(\mathcal {S}\), i.e.
The support of a vector \({\mathbf {x}}\in {\mathbb {R}}^N\) is defined by \({\text {supp}}({\mathbf {x}})=\left\{ i\in [N] \ : \ x_i\ne 0\right\} \). For a given \(k\in {\mathbb {N}}\) we say a vector \({\mathbf {x}}\in {\mathbb {R}}^N\) is ksparse if \({\text {supp}}({\mathbf {x}})\le k\). For a matrix \({\mathbf {A}}\), the matrix \({\mathbf {A}}_{\mathcal {S}}\) denotes a submatrix of \({\mathbf {A}}\) with columns indexed by \({\mathcal {S}}\). For a graph \(G=(V,E)\) and \({\mathcal {S}}\subseteq V\), \(\varGamma ({\mathcal {S}})\) denotes the set of neighbours of \({\mathcal {S}}\), that is the set of nodes that are connected by an edge to the nodes in \({\mathcal {S}}\). We denote by \(e_{ij} = (i,j)\) an edge connecting node i to node j. The set \({\mathcal {G}}_i\) denotes a group of size \(g_i\) and a group model is any subset of \(\mathfrak {G} = \{{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_M\}\); while a group model of order \(G\in [N]\) is denoted by \({\mathfrak {G}}_G\), which is a collection of any G groups of \({\mathfrak {G}}\). For a subset of groups \({\mathcal {S}}\subset {\mathfrak {G}}\) we sometimes write
The \(\ell _p\) norm of a vector \(\mathbf{x} \in {\mathbb {R}}^N\) is defined as
Compressed sensing
Recall that the reconstruction problem in standard compressed sensing [13, 14, 19] attempts to recover an unknown ksparse signal \({\mathbf {x}}\in {\mathbb {R}}^N\), from its (potentially noisy) linear measurements \({\mathbf {y}}= {\mathbf {A}}{\mathbf {x}}+ {\mathbf {e}}\), where \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\), \({\mathbf {y}}\in {\mathbb {R}}^m\) for \(m\ll N\) and \({\mathbf {e}}\in {\mathbb {R}}^m\) is a noise vector, typically with a bounded noise level \(\Vert {\mathbf {e}}\Vert _2 \le \eta \). The reconstruction problem can be formulated as the optimization problem
where \(\Vert {\mathbf {x}}\Vert _0\) is the number of nonzero components of \({\mathbf {x}}\). Problem (1) is usually relaxed to an \(\ell _1\)minimization problem by replacing \(\Vert {\cdot } \Vert _0\) with the \(\ell _1\)norm. It has been established that the solution minimizing the \(\ell _1\)norm coincides with the optimal solution of (1) under certain conditions [14]. Besides the latter approach the compressed sensing problem can be solved by a class of greedy algorithms, including the IHT [10]. A detailed discussion on compressed sensing algorithms can be found in [22].
The idea behind the IHT can be explained by considering the problem
Under certain choices of \(\eta \) and k the latter problem is equivalent to (1) [10]. Based on the idea of gradient descent methods (2) can be solved by iteratively taking a gradient descent step, followed by a hard thresholding operation, which sets all components to zero except the largest k in magnitude. Starting with an initial guess \({\mathbf {x}}^{(0)} = {\varvec{0}}\), the \((n+1)\)th IHT update is given by
where \(\mathcal {H}_k:{\mathbb {R}}^N\rightarrow {\mathbb {R}}^N\) is the hard threshold operator and \({\mathbf {A}}^*\) is the adjoint matrix of \({\mathbf {A}}\).
Recovery guarantees of algorithms are typically given in terms of what is referred to as the \(\ell _p/\ell _q\) instance optimality [14]. Precisely, an algorithm has \(\ell _p/\ell _q\) instance optimality if for a given signal \({\mathbf {x}}\) it always returns a signal \(\widehat{{\mathbf {x}}}\) with the following error bound
where \(1\le q \le p \le 2\), \(c_1(k,p,q), c_2(k,p,q)\) are constants independent of the dimension of the signal and
is the best kterm approximation of a signal (in the \(\ell _q\)norm).
Ideally, we would like to have \(\ell _2/\ell _1\) instance optimality [14]. It turned out that the instance optimality of the known algorithms depends mainly on the sensing matrix \({\mathbf {A}}\). Key amongst the tools used to analyse the suitability of \({\mathbf {A}}\) is the restricted isometry property, which is defined in the following.
Definition 1
(RIP) A matrix \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) satisfies the \(\ell _p\)norm restricted isometry property (RIPp) of order k, with restricted isometry constant (RIC) \(\delta _k < 1\), if for all ksparse vectors \({\mathbf {x}}\)
Typically RIP without the subscript p refers to case when \(p=2\). We use this general definition here because we will study the case \(p=1\) later. The RIP is a sufficient condition on \({\mathbf {A}}\) that guarantees optimal recovery of \({\mathbf {x}}\) for most of the known algorithms. If the entries of \({\mathbf {A}}\) are drawn i.i.d from a subGaussian distribution and \(m = {\mathcal {O}}\left( k\log (N/k)\right) \), then \({\mathbf {A}}\) has RIP2 with high probability and leads to the ideal \(\ell _2/\ell _1\) instance optimality for most algorithms; see [15]. Note that the bound \({\mathcal {O}}\left( k\log (N/k)\right) \) is asymptotically tight. On the other hand, deterministic constructions of \({\mathbf {A}}\) or random \({\mathbf {A}}\) with binary entries with nonzero mean do not achieve this optimal m, and are faced with the socalled square root bottleneck where \(m=\varOmega \left( k^2\right) \); see [16, 18].
Sparse sensing matrices from expander graphs
The computational benefits of sparse sensing matrices necessitated finding a way to circumvent the square root bottleneck for nonzero mean binary matrices. One such class of binary matrices is the class of adjacency matrices of expander graphs (henceforth referred to as expander matrices), which satisfy the weaker RIP1. Expander graphs are objects of interest in pure mathematics and theoretical computer science, for a detailed discourse on this subject see [31]. We define an expander graph as follows:
Definition 2
(Expander graph) Let \(H=\left( [N],[m],{\mathcal {E}}\right) \) be a leftregular bipartite graph with N left vertices, m right vertices, a set of edges \({\mathcal {E}}\) and left degree d. If for any \(\epsilon \in (0,1/2)\) and any \({\mathcal {S}}\subset [N]\) of size \({\mathcal {S}}\le k\) it holds \(\varGamma ({\mathcal {S}}) \ge (1\epsilon )dk\), then H is referred to as a \((k,d,\epsilon )\)expander graph.
An expander matrix is the adjacency matrix of an expander graph. Choosing \(m = {\mathcal {O}}\left( k\log (N/k)\right) \), then a random bipartite graph \(H=\left( [N],[m],{\mathcal {E}}\right) \) with left degree \(d={\mathcal {O}}\left( \frac{1}{\epsilon }\log (\frac{N}{k})\right) \) is an \((k,d,\epsilon )\)expander graph with high probability [22]. Furthermore expander matrices achieve the suboptimal \(\ell _1/\ell _1\) instance optimality [8]. For completeness we state the lemma in [35] deriving the RIC for such matrices.
Lemma 1
(RIP1 for expander matrices [35]) Let \({\mathbf {A}}\) be the adjacency matrix of a \((k,d,\epsilon )\)expander graph H, then for any ksparse vector \({\mathbf {x}}\), we have
The most relevant algorithm that exploits the structure of the expander matrices is the ExpanderIHT (EIHT) proposed in [22]. Similar to the IHT algorithm it performs updates
where \({\mathcal {M}}: {\mathbb {R}}^m \rightarrow {\mathbb {R}}^N\) is the median operator and \([{\mathcal {M}}({\mathbf {z}})]_i = \text{ median }\left( \{z_j\}_{j\in \varGamma (i)}\right) \) for each \({\mathbf {z}}\in {\mathbb {R}}^m\). For expander matrices the EIHT achieves \(\ell _1/\ell _1\) instance optimality [22].
Modelbased compressed sensing
Besides sparsity (and compressibility) signals do exhibit more complicated structures. When compressed sensing takes into account these more complicated structures (or models) in addition to sparsity, it is usually referred to as modelbased compressed sensing or structured sparse recovery [5]. A precise definition is given in the following:
Definition 3
(Structured sparsity model [5]) A structured sparsity model is a collection of sets, \({\mathfrak {M}}=\left\{ {\mathcal {S}}_1,\ldots , {\mathcal {S}}_M\right\} \) with \({\mathfrak {M}} = M\), of allowed structured supports \({\mathcal {S}}_i\subseteq [N]\).
Note that the classical ksparsity studied in Sect. 2.2 is a special case of a structured sparsity model where all supports of size at most k are allowed. Popular structured sparsity models include treesparse, blocksparse, and groupsparse models [5]. In this work we study groupsparse models which we will introduce in Sect. 2.4.
Similar to the classical sparsity case the RIP property is defined for structured sparsity models.
Definition 4
(ModelRIP [5]) A matrix \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) satisfies the \(\ell _p\)norm model restricted isometry property (\(\mathfrak {M}\)RIPp) with model restricted isometry constant (\(\mathfrak {M}\)RIC) \(\delta _{{\mathfrak {M}}} < 1\), if for all vectors \({\mathbf {x}}\) with \({\text {supp}}({\mathbf {x}})\in {\mathfrak {M}}\) it holds
In [5] it was shown that for a matrix \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) to have the ModelRIP with high probability the required number of measurements is \(m = {\mathcal {O}}(k)\) for treesparse signals and \(m = {\mathcal {O}}\left( kg + \log \left( N/(kg)\right) \right) \) for blocksparse signals with block size g, when the sensing matrices are dense (typically subGaussian). In general for a given structured sparsity model \({\mathfrak {M}}\) for subGaussian random matrices the number of required measurements is \(m={\mathcal {O}}\left( \delta _{{\mathfrak {M}}}^{2}g\log (\delta _{{\mathfrak {M}}}^{1}) + \log ({\mathfrak {M}} )\right) \) where g is the cardinality of the largest support in \({\mathfrak {M}}\).
Furthermore the authors in [5] show that classical algorithms like the IHT can be modified for structured sparsity models to achieve instance optimality. To this end the hard thresholding operator \(\mathcal {H}_k\) used in the classical IHT is replaced by a modelprojection oracle which for a given signal \({\mathbf {x}}\in {\mathbb {R}}^N\) returns the closest signal over all signals having support in \({\mathfrak {M}}\). We define the modelprojection oracle in the following.
Definition 5
(Modelprojection oracle [5]) Given \(p\ge 1\), a modelprojection oracle is a function \({\mathcal {P}}_{{\mathfrak {M}}}:{\mathbb {R}}^N \rightarrow {\mathbb {R}}^N\) such that for all \({\mathbf {x}}\in {\mathbb {R}}^N\) we have \({\text {supp}}({\mathcal {P}}_{{\mathfrak {M}}}({\mathbf {x}}))\in {\mathfrak {M}}\) and it holds
From the definition it directly follows that \({\mathcal {P}}_{{\mathfrak {M}}}({\mathbf {x}})_i = {\mathbf {x}}_i\) if \(i\in {\text {supp}}({\mathcal {P}}_{{\mathfrak {M}}}({\mathbf {x}}))\) and 0 otherwise. Note that in the case of classical ksparsity the modelprojection oracle is given by the hard thresholding operator \(\mathcal {H}_k\). In contrast to this case, calculating the optimal model projection \(P_{{\mathfrak {M}}} ({\mathbf {x}})\) for a given signal \({\mathbf {x}}\in {\mathbb {R}}^N\) and a given structured sparsity model \({\mathfrak {M}}\) may be computationally hard. Depending on the model \({\mathfrak {M}}\) finding the optimal model projection vector may be even NPhard; see Sect. 2.4. The modified version of the IHT derived in [5] is presented in Algorithm 1.
Note that common halting criterion is given by a maximum number of iterations or a bound on the iteration error \(\Vert {\mathbf {x}}^{(n+1)}{\mathbf {x}}^{(n)}\Vert _p\).
Modelsparse sensing matrices from expander graphs
In the sparse matrix setting the sparse matrices we consider are called model expander matrices, which are adjacency matrices of model expander graphs, defined thus.
Definition 6
(Model expander graph) Let \(H=\left( [N],[m],{\mathcal {E}}\right) \) be a leftregular bipartite graph with N left vertices, m right vertices, a set of edges \({\mathcal {E}}\) and left degree d. Given a model \({\mathfrak {M}}\), if for any \(\epsilon _{{\mathfrak {M}}} \in (0,1/2)\) and any \({\mathcal {S}}= \cup _{{\mathcal {S}}_i \in {\mathcal {K}}}{\mathcal {S}}_i\), with \({\mathcal {K}} \subset {\mathfrak {M}}\) and \({\mathcal {S}} \le s\), we have \(\varGamma ({\mathcal {S}}) \ge (1\epsilon _{{\mathfrak {M}}})d{\mathcal {S}}\), then H is referred to as an \((s,d,\epsilon _{{\mathfrak {M}}})\)model expander graph.
In this setting the known results are suboptimal. Using model expander matrices for treesparse models the required number of measurements to obtain instance optimality is \(m = k\log \left( N/k\right) /\log \log \left( N/k\right) \) which was shown in [3, 34].
A key ingredient in the analysis for the aforementioned sample complexity results for model expanders is the modelRIP1, which is just RIP1 for model expander matrices (hence they are also called modelRIP1 matrices [34]). Consequently, Lemma 1 also holds for these modelRIP1 matrices [34].
First, in [3] the Model Expander IHT (MEIHT) was studied for loopless overlapping groups and Dary tree models. Similar to Algorithm 1 the MEIHT is a modification of EIHT where the hard threshold operator \(\mathcal {H}_k\) is replaced by the projection oracle \(\mathcal {P}_{{\mathfrak {M}}}\) onto the model \({\mathfrak {M}}\). Thus the update of the MEIHT in each iteration is given by
In [3] the authors show that this algorithm always returns a solution in the model, which is highly desirable for some applications. The running time is given in the proposition below.
Proposition 1
([3, Proposition 3.1]) The runtime of MEIHT is \({\mathcal {O}}\left( kN\bar{n}\right) \) and \({\mathcal {O}}\left( M^2G\bar{n} + N\bar{n}\right) \) for Dary tree models and loopless overlapping group models respectively, where k is the sparsity of the tree model and G is the number of active groups (i.e. group sparsity of the model), \(\bar{n}\) is the number of iterations, M is the number of groups and N is the dimension of the signal.
Group models
The models of interest in this paper are group models. A group model is a collection \(\mathfrak {G} = \{{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_M\}\) of groups of indices, i.e. \({\mathcal {G}}_i\subset [N]\), together with a budget \(G \in [M]\). We denote \({\mathfrak {G}}_G\) as the structured sparsity model (i.e. groupsparse model) which contains all supports contained in the union of at most G groups in \({\mathfrak {G}}\), i.e.
We will always tacitly assume that \(\bigcup _{i=1}^{M} {\mathcal {G}}_i = [N]\). We say that a signal \({\mathbf {x}} \in {\mathbb {R}}^N\) is Ggroupsparse if the support of \({\mathbf {x}}\) is contained in \({\mathfrak {G}}_G\). If G is clear from the context, we simply say that \({\mathbf {x}}\) is groupsparse. Let \(g_i = {\mathcal {G}}_i\) and denote \(g_{\text {max}} = \max _{i\in [M]} g_i\) as the size of largest support in \({\mathfrak {G}}_G\). The intersection graph of a group model is the graph which has a node for each group \({\mathcal {G}}_i\in {\mathfrak {G}}\) and has an edge between \({\mathcal {G}}_i\) and \({\mathcal {G}}_j\) if the groups overlap, i.e. if \({\mathcal {G}}_i\cap {\mathcal {G}}_j \ne \emptyset \); see [4]. We call a group model loopless if the intersection graph of the group model has no cycles. We call a group model block model if all groups have equal size and if they are pairwise disjoint. In this case the groups are sometimes called blocks. We define the frequency f of a group model as the maximum number of groups an element is contained in, i.e.
In [4] besides the latter group models, the more general models are considered where an additional sparsity in the classical sense is required on the signal. More precisely for a given budget \(G\in [M]\) and a sparsity \(K\in [N]\) they study the structured sparsity model
Note that for \(K=N\) we obtain a standard group model defined as above.
Both variants of group models defined above clearly are special cases of a structured sparsity model defined in Sect. 2.3. Therefore all results for structured sparsity models can be used for groupsparse models. To adapt Algorithm 1 a model projection oracle \({\mathcal {P}}_{{\mathfrak {G}}_G}\) (or \({\mathcal {P}}_{{\mathfrak {G}}_{G,K}}\)) has to be provided. Note that for several applications we are not only interested in the optimal support of the latter projection but we want to find at most G groups covering this support. The main work of this paper is to analyse the complexity of the latter problem for group models and to provide efficient algorithms to solve it exactly or approximately. Given a signal \({\mathbf {x}}\in {\mathbb {R}}^N\), the group model projection problem or sometimes called signal approximation problem is then to find a support \({\mathcal {S}}\in {\mathfrak {G}}_{G,K}\) together with G groups covering this support such that \(\Vert {\mathbf {x}}{\mathbf {x}}_{{\mathcal {S}}}\Vert _p\) is minimal, i.e. we want to solve the problem
If the parameter K is not mentioned we assume \(K=N\).
Baldassarre et al. [4] observed the following close relation to the NPhard Maximum \(G\)Coverage problem. Given a signal \({\mathbf {x}} \in {\mathbb {R}}^N\), a groupsparse vector \(\hat{{\mathbf {x}}}\) for which \(\Vert {\mathbf {x}}\hat{{\mathbf {x}}}\Vert _2^2\) is minimum satisfies \(\hat{x}_i \in \{0,x_i\}\) for all \(i \in [N]\). For a vector with the latter property,
holds and so minimizing \(\Vert {\mathbf {x}}\hat{{\mathbf {x}}}\Vert _2^2\) is equivalent to maximizing \(\sum _{i=1}^N \hat{x}_i^2\). Consequently, the group model projection problem with \(K=N\) is equivalent to the problem of finding an index set \({\mathcal {I}}\subset [M]\) of at most G groups, i.e. \({\mathcal {I}} \le G\), maximizing \(\sum _{i\in \bigcup _{j\in {\mathcal {I}}} {\mathcal {G}}_j} x_i^2\). This problem is called Maximum \(G\)Coverage in the literature [30]. Despite the prominence of the latter problem, we will stick to the group model notation, since it is closer to the applications we have in mind and we will leave the regime of Maximum \(G\)Coverage by introducing more constraints later.
We simplify the notation by defining \(w_i = x_i^2\) for all \(i \in [N]\). Using this notation, the group model projection problem is equivalent to finding an optimal solution of the following integer program:
Here, the variable \(u_i\) is one if and only if the ith index is contained in the support of the signal approximation, and \(v_i\) is one if and only if the group \({\mathcal {G}}_i\) is chosen.
Besides the NPhardness for the general case the authors in [4] show that the group model projection problem can be solved in polynomial time by dynamic programming for the special case of loopless groups. Furthermore the authors show that if the intersection graph is bipartite the projection problem can be solved in polynomial time by relaxing problem (12). Similar results are obtained for the more general problem, where additional to the groupsparsity the classical Ksparsity is assumed, i.e. the additional constraint
is added to problem (12).
As stated in Sect. 2.3, the authors of [5] first study a special case of group models, i.e. block models, where the groups are nonoverlapping and are all of equal size. The sample complexity they derived in that work for subGaussian measurement matrices is \(m = {\mathcal {O}}\left( Gg + \log \left( N/(Gg)\right) \right) \), where g is the fixed block size. However, in [3] the authors studied group models in the sparse matrix setting, besides other results they proposed the MEIHT algorithm for tree and group models. The more relevant result to this work show that for loopless overlapping groupsparse models with maximum group size \(g_{\text {max}}\), using model expander measurement matrices, the number of measurements required for successful recovery is \(m = Gg_{\text {max}}\log \left( N/(Gg_{\text {max}})\right) /\log \left( Gg_{\text {max}}\right) \); see [3]. This results holds for general groups, the “looplessness” condition is only necessary for the polynomial time reconstruction using the MEIHT algorithm. Therefore, this sample complexity result also holds for the general group models we consider in this manuscript.
Group lasso
The classical Lasso approach for ksparse signals seeks to minimize a quadratic error penalized by the \(\ell _1\)norm [22]. More precisely, for a given \(\lambda >0\) we want to find an optimal solution of the problem
It is well known that using the latter approach for appropriate choices of \(\lambda \) leads to sparse solutions.
The Lasso approach was already extended to group models in [55] and afterwards studied in several works for nonoverlapping groups; see [33, 40, 45, 48]. The idea is again to minimize a loss function, e.g. the quadratic loss, and to penalize the objective value for each group by a norm of the weights of the recovered vector restricted to the items in each group. An extension which can also handle overlapping groups was studied in [37, 56]. In [49] the authors study what they call the latent group Lasso. To this end they consider a loss function \(L:{\mathbb {R}}^N\rightarrow {\mathbb {R}}\) and propose to solve the \(\ell _1 /\ell _2\)penalized problem
for given weights \(\lambda >0\) and \(d_{{\mathcal {G}}}\ge 0\) for each \({\mathcal {G}}\in {\mathfrak {G}}\). The idea is that for ideal choices of the latter weights a solution of Problem (14) will be sparse and its support is likely to be a union of groups. Nevertheless it is not guaranteed that the number of selected groups is optimal as it is the case for the iterative methods in the previous sections. Note that equivalently we can replace each norm \(\Vert {\mathbf {w}}^{{\mathcal {G}}}\Vert \) by a variable \(z^{{\mathcal {G}}}\) in the objective function and add the quadratic constraint \(\Vert {\mathbf {w}}^{{\mathcal {G}}}\Vert _2^2\le (z^{{\mathcal {G}}})^2\). Hence Problem (14) can be modelled as a quadratic problem and can be solved by standard solvers like CPLEX.
The \(l_0\) counterpart of Problem (14) was considered in [32] under the name block coding and can be formulated as
Note that in contrast to Problem (14) an easy reformulation of Problem (15) into a continuous quadratic problem is not possible. Nevertheless we can reformulate it using the mixedinteger programming formulation
where \(M_i\in {\mathbb {R}}\) can be chosen larger or equal to the entry \(x_i\) of the true signal for each \(i\in [N]\). The variables \({\mathbf {v}}^{{\mathcal {G}}}\in \left\{ 0,1\right\} \) have value 1 if and only if group \({\mathcal {G}}\) is selected for the support of \({\mathbf {x}}\). As for the \(\ell _1/\ell _2\) variant it is not guaranteed that the number of selected groups is optimal. Note that the latter problem is a mixedinteger problem and therefore hard to solve in large dimension in general. Furthermore the efficiency of classical methods as the branch and bound algorithm depend on the quality of the calculated lower bound which depends on the values \(M_i\). Hence in practical applications where the true signal is not known good estimations of the \(M_i\) values are crucial for the success of the latter method. Another drawback is that the best values for \(\lambda \) and the weights \(d_{{\mathcal {G}}}\) are not known in advance and have to be chosen appropriately for each application.
We study Problems (14) and (15) computationally in Sect. 5.
Approximation algorithms for modelbased compressed sensing
As mentioned in the last section solving the projection problem, given in Definition 5, may be computationally hard. To overcome this problem the authors in [27, 28] present algorithms, based on the idea of IHT (and CoSaMP), which instead of solving the projection problems exactly, use two approximation procedures called head and tailapproximation. In this section we will shortly describe the concept and the results in [27, 28]. Note that we just present results related to the IHT, although similar results for the CoSaMP were derived as well in [27, 28].
Given two structured sparsity models \({\mathfrak {M}}, {\mathfrak {M}}_H\) and a vector \({\mathbf {x}}\), let \({\mathcal {H}}\) be an algorithm that computes a vector \({\mathcal {H}}({\mathbf {x}})\) with support in \({\mathfrak {M}}_H\). Then, given some \(\alpha \in {\mathbb {R}}\) (typically \(\alpha < 1\)) we say that \({\mathcal {H}}\) is an \((\alpha ,{\mathfrak {M}}, {\mathfrak {M}}_H,p)\)head approximation if
Note that the support of the vector calculated by \({\mathcal {H}}\) is contained in \({\mathfrak {M}}_H\) while the approximation guarantee must be fulfilled for all supports in \({\mathfrak {M}}\).
Moreover given two structured sparsity models \({\mathfrak {M}}, {\mathfrak {M}}_T\) let \({\mathcal {T}}\) be an algorithm which computes a vector \({\mathcal {T}}({\mathbf {x}})\) with support in \({\mathfrak {M}}_T\). Given some \(\beta \in {\mathbb {R}}\) (typically \(\beta > 1\)) we say that \({\mathcal {T}}\) is a \((\beta ,{\mathfrak {M}}, {\mathfrak {M}}_T,p)\)tail approximation if
Note that in general a head approximation does not need to be a tail approximation and vice versa.
The cases studied in [27] are \(p=1\) and \(p=2\). For the case \(p=2\) the authors propose an algorithm called Approximate ModelIHT (AMIHT), shown in Algorithm 2.
Assume that \({\mathcal {T}}\) is a \((\beta ,{\mathfrak {M}}, {\mathfrak {M}}_T,2)\)tail approximation and \({\mathcal {H}}\) is a \((\alpha ,{\mathfrak {M}}_T\oplus {\mathfrak {M}}, {\mathfrak {M}}_H,2)\)head approximation where \({\mathfrak {M}}_T\oplus {\mathfrak {M}}\) is the Minkowski sum of \({\mathfrak {M}}_T\) and \({\mathfrak {M}}\). Furthermore we assume the condition
holds. The authors in [27] prove that for a signal \({\mathbf {x}}\in {\mathbb {R}}^N\) with \({\text {supp}}({\mathbf {x}})\in {\mathfrak {M}}\), noisy measurements \({\mathbf {y}}={\mathbf {A}}{\mathbf {x}}+{\mathbf {e}}\) where \({\mathbf {A}}\) has \({\mathfrak {M}} \oplus {\mathfrak {M}}_T\oplus {\mathfrak {M}}_H\)model RIP with RIC \(\delta \),
Algorithm 2 calculates a signal estimate \(\hat{{\mathbf {x}}}\) satisfying
where \(\tau \) depends on \(\delta , ~\alpha \) and \(\beta \). Note that the condition (19) holds e.g. for approximation accuracies \(\alpha >0.9\) and \(\beta <1.1\).
For the case \(p=1\) the authors replace Step 3 in Algorithm 2 by the update
where \({\mathcal {M}}\) is the median operator defined as in Sect. 2.2. Under the same assumptions as above, but considering \(p=1\) for the head and tailapproximations and \({\mathbf {A}}\) having the \({\mathfrak {M}}\)RIP1, the authors in [27] show convergence of the adapted algorithm.
Comparison to related works
In a more illustrative way in Tables 1 and 2 below we show where our results stand visavis other results. In Table 1 we show the studied models together with the derived sample complexity and the studied class of measurements matrices. In Table 2 we present the names of the studied algorithms, the class of the model projection, the class of the algorithm used to solve the model projection, the runtime complexity of the projection problem and the class of instance optimality.
Algorithms with exact projection oracles
In this section we study the exact group model projection problem which has to be solved iteratively in the ModelIHT and the MEIHT. We extend existing results for groupsparse models and pass from loopless overlapping group models (which was the most general prior to this work) to overlapping group models of bounded treewidth and to general group models without any restriction on the structure. The graph representing a loopless overlapping group model has a treewidth of 1.
We start by showing that it is possible to perform exact projections onto overlapping groups with bounded treewidth using dynamic programming, see Sect. 3.1. While this procedure has a polynomial runtime bound it is restricted to the class of group models with bounded treewidth. Nevertheless we prove that the exact projection problem is NPhard if the incidence graph is a grid which is the most basic graph structure without bounded treewidth. For the sake of completeness we solve the exact projection problem for all instances of group models by a method based on Benders’ decomposition in Sect. 3.6. Solving an NPhard problem this method does not yield a polynomial runtime guarantee but works well in practice as shown in [17]. In Sect. 3.5 we present an appropriately modified algorithm (MEIHT) with exact projection oracles for the recovery of signals from structured sparsity models. We derive corollaries for the general groupmodel case from existing works about runtime and convergence of this modified algorithm.
Recall the following notation: \({\mathcal {G}}_i\) denotes a group of size \(g_i\), \(i\in [M]\), and a group model is a collection \(\mathfrak {G} = \{{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_M\}\). The groupsparse model of order G is denoted by \({\mathfrak {G}}_G\), which contains all supports \({\mathcal {S}}\) which are contained in the union of at most G groups of \({\mathfrak {G}}\), i.e. \({\mathcal {S}}\subseteq \bigcup _{j\in \mathcal {I}} {\mathcal {G}}_j, ~ \mathcal {I}\subseteq [M]\) and \(\mathcal {I}\le G\); see (10). We will interchangeably say that \({\mathbf {x}}\) or \({\mathcal {S}}\) is \({\mathfrak {G}}_G\)sparse. Clearly group models are a special case of structured sparsity models. Assume \(s_{\text {max}}\) is the size of the maximal support which is possible by selecting G groups out of \({\mathfrak {G}}\). For \(g_{\text {max}}\) denoting the maximal size of a single group in \({\mathfrak {G}}\), i.e.,
we have \(s_{\text {max}}\in {\mathcal {O}}\left( G g_{\text {max}}\right) \). Furthermore the number of possible supports is in \({\mathcal {O}}(M^G)\). Therefore applying the result from Sect. 2.3 we obtain
as the number of required measurements for a subGaussian matrix to obtain the groupmodelRIP with RIC \(\delta \) with high probability, which induces the convergence of Algorithm 1 for small enough \(\delta \).
Group models of low treewidth
One approach to overcome the hardness of the group model projection problem is to restrict the structure of the group models considered. To this end we follow Baldassarre et al. [4] and consider two graphs associated to a group model \({\mathfrak {G}}\).
The intersection graph of \({\mathfrak {G}}\), \(I({\mathfrak {G}})\), is given by the vertex set \(V(I({\mathfrak {G}})) = {\mathfrak {G}}\), and the edge set
The incidence graph of \({\mathfrak {G}}\), \(B({\mathfrak {G}})\), is given by the vertex set \(V(B({\mathfrak {G}})) = [N] \cup {\mathfrak {G}}\), and the edge set
Note that the incidence graph is bipartite since an edge is always adjacent to an element e and a group S. See Fig. 1 for a simple illustration of the two constructions. Baldassarre et al. [4] prove that there is a polynomial time algorithm to solve the group model projection problem in the case that the intersection graph is an acyclic graph. Their algorithm uses dynamic programming on the acyclic structure of the intersection graph.
We generalize this approach and show that the same problem can be solved in polynomial time if the treewidth of the incidence graph is bounded. Following Proposition 2 below, this implies that the group model projection Problem can be solved in polynomial time if the treewidth of the intersection graph is bounded. We proceed by formally introducing the relevant concepts.
Tree decomposition
Let \(\bar{G}=(V,E)\) be a graph. A tree decomposition of \(\bar{G}\) is a tree T where each node \(x \in V(T)\) of T has a bag \(B_x \subseteq V\) of vertices of \(\bar{G}\) such that the following properties hold:

1.
\(\bigcup _{x \in V(T)} B_x = V\).

2.
If \(B_x\) and \(B_y\) both contain a vertex \(v \in V\), then the bags of all nodes of T on the path between x and y contain v as well. Equivalently, the tree nodes containing the vertex v form a connected subtree of T.

3.
For every edge vw in E there is some bag that contains both v and w. That is, vertices in V can be adjacent only if the corresponding subtrees in T have a node in common.
The width of a tree decomposition is the size of its largest bag minus one, i.e. \(\max _{x \in V(T)} B_x1\). The treewidth of \(\bar{G}\), \(\text{ tw }(\bar{G})\), is the minimum width among all possible tree decompositions of \(\bar{G}\).
Intuitively, the treewidth measures how ‘treelike’ a graph is: the smaller the treewidth is, the more treelike is the graph. The graphs of treewidth one are the acyclic graphs. Figure 2 shows a graph of treewidth 2, together with a tree decomposition.
Before stating any algorithms, we discuss the relation of the treewidth of the intersection and the incidence graphs of a given group model.
When bounding the treewidth of the graphs associated to a group model, it makes sense to consider the incidence graph rather than the intersection graph. This is due to the following simple observation.
Proposition 2
For any group model \({\mathfrak {G}}\) it holds that \(\text{ tw }(B({\mathfrak {G}})) \le \text{ tw }(I({\mathfrak {G}}))+1\). However, for every t there exists a group model \({\mathfrak {G}}\) such that \(\text{ tw }(I({\mathfrak {G}}))\text{ tw }(B({\mathfrak {G}})) = t\).
This statement is not necessarily new, but we quickly prove it in our language in order to be selfcontained.
Proof of Proposition 2
To see the first assertion, let \({\mathfrak {G}}\) be a group model. Let T be a tree decomposition of \(I({\mathfrak {G}})\) of width \(\text{ tw }(I({\mathfrak {G}}))\). In the following, we attach leaves to T, one for each element in [N], and obtain a tree decomposition of \(B({\mathfrak {G}})\). Each leaf will contain at most \(\text{ tw }(I({\mathfrak {G}}))\) many elements of \({\mathfrak {G}}\) and at most one element of [N]. Hence we get \(\text{ tw }(B({\mathfrak {G}})) \le \text{ tw }(I({\mathfrak {G}}))+1\).
To construct the tree decomposition of \(B({\mathfrak {G}})\) pick any \(i \in [N]\), and let \({\mathfrak {G}}_i\) be the set of groups in \({\mathfrak {G}}\) containing i. Since all groups in \({\mathfrak {G}}_i\) contain i, the set \({\mathfrak {G}}_i\) is a clique in \(I({\mathfrak {G}})\).^{Footnote 1} Moreover, since T is a tree decomposition of \(I({\mathfrak {G}})\), the subtrees of the groups in \({\mathfrak {G}}_i\) mutually share a node. As subtrees of a tree have the Helly property, there is at least one node x of T such that \({\mathfrak {G}}_i \subseteq B_x\). We now add a new node \(x_i\) with bag \(B_{x_i} = {\mathfrak {G}}_i \cup \{i\}\) and an edge between \(x_i\) and x in T. Doing this for all \(i\in [N]\) simultaneously, it is easy to see that we arrive at a tree decomposition \(T'\) of \(B({\mathfrak {G}})\) of width at most \(\text{ tw }(I({\mathfrak {G}}))+1\) which proves the first assertion.
To prove the second assertion consider, for any t, the group model \({\mathfrak {G}}\) where
Note that \(B({\mathfrak {G}})\) is a tree, hence \(\text{ tw }(B({\mathfrak {G}}))=1\). In \(I({\mathfrak {G}})\), however, the set \({\mathfrak {G}}\) is a clique of size \(t+2\). Thus, \(\text{ tw }(I({\mathfrak {G}}))=t+1\) which implies \(\text{ tw }(I({\mathfrak {G}}))\text{ tw }(B({\mathfrak {G}})) = t\). \(\square \)
Consider now a tree decomposition T of a graph \(\bar{G}\). We say that T is a nice tree decomposition if every node x is of one of the following types.

Leaf: x has no children and \(B_x=\emptyset \).

Introduce: x has one child, say y, and there is a vertex \(v \notin B_y\) of \(\bar{G}\) with \(B_x = B_y \cup \{v\}\).

Forget: x has one child, say y, and there is a vertex \(v \notin B_x\) of \(\bar{G}\) with \(B_y = B_x \cup \{v\}\).

Join: x has two children y and z such that \(B_x=B_y=B_z\).
This kind of decomposition limits the structure of the difference of two adjacent nodes in the decomposition. A folklore statement (explained in detail in the classic survey by Kloks [39]) says that such a nice decomposition is easily computed given any tree decomposition of \(\bar{G}\) without increasing the width.
Dynamic programming
In this section we derive a polynomial time algorithm for the group model projection problem for fixed treewidth. As in Sect. 2.4 we assume we have a given signal \(x\in {\mathbb {R}}^N\) and define \(w\in {\mathbb {R}}^N\) with \(w_i=x_i^2\). In the following we use the notation
for a subset \({\mathcal {S}}\subseteq [N]\). The algorithm is presented using a nice tree decomposition of the incidence graph \(B({\mathfrak {G}})\) and uses the following concept. Fix a node x of the decomposition tree of \(B({\mathfrak {G}})\), a number i with \(0 \le i \le G\) and a map \(c:B_x\rightarrow \{0,1,1_?\}\). We say that c is a colouring of \(B_x\). We consider solutions to the problem group model projection(x, i, c) which is defined as follows.
A set \({\mathcal {S}}\subseteq {\mathfrak {G}}\) is a feasible solution of group model projection(x, i, c) if

(a)
\({\mathcal {S}}\) contains only groups that appear in some bag of a node in the subtree rooted at node x,

(b)
\({\mathcal {S}} = i\),

(c)
\({\mathcal {S}}\cap B_x\) contains exactly those groupvertices of \(B_x\) that are in \(c^{1}(1)\), that is,
$$\begin{aligned} {\mathcal {S}}\cap c^{1}(1) = {\mathfrak {G}}\cap c^{1}(1), \text { and} \end{aligned}$$ 
(d)
of the elements in \(B_x\), \({\mathcal {S}}\) covers exactly those that are in \(c^{1}(1)\). Formally,
$$\begin{aligned}\left( \bigcup {\mathcal {S}}\right) \cap B_x = [N] \cap c^{1}(1).\end{aligned}$$
The objective value of the set \({\mathcal {S}}\) is given by \(w(\bigcup {\mathcal {S}}) + w(c^{1}(1_?))\). Intuitively, a feasible solution to group model projection(x, i, c) does not cover elements labelled 0 or \(1_?\), but covers all elements labelled 1. The elements labelled \(1_?\) are assumed to be covered by groups not yet visited in the tree decomposition.
If group model projection(x, i, c) does not admit a feasible solution, we say that group model projection(x, i, c) is infeasible. The maximum objective value attained by a feasible solution to group model projection(x, i, c), if feasible, we denote by \(\text{ OPT }(x,i,c)\). If group model projection(x, i, c) is infeasible, we set \(\text{ OPT }(x,i,c) =  \infty \).
Assertion (d) implies that group model projection(x, i, c) is infeasible if the groups in \(c^{1}(1)\) cover elements in \(c^{1}(0)\) or \(c^{1}(1_?)\). That is, group model projection(x, i, c) is infeasible if \(\bigcup \left( {\mathfrak {G}} \cap c^{1}(1)\right) \not \subseteq [N] \cap c^{1}(1)\). To deal with this exceptional case we call c consistent if \(\bigcup \left( {\mathfrak {G}} \cap c^{1}(1)\right) \subseteq [N] \cap c^{1}(1)\), and inconsistent otherwise. Note that consistency of c is necessary to ensure feasibility of group model projection(x, i, c), but not sufficient.
Our algorithm processes the nodes of a nice tree decomposition in a bottomup fashion. Fix a node x, a number i with \(0 \le i \le G\) and a map \(c:B_x\rightarrow \{0,1,1_?\}\). We use dynamic programming to compute the value \(\text{ OPT }(x,i,c)\), assuming we know all possible values \(\text{ OPT }(y,j,c')\) for all children y of x, all j with \(0 \le j \le G\), and all \(c': B_y\rightarrow \{0,1,1_?\}\).
In the following, for a subset \(S\subset B_x\) the function \(c_S:S\rightarrow \{0,1,1_?\}\) is the restriction of c to S. We use \(\varGamma (v)\) to denote the neighborhood of v in \(B({\mathfrak {G}})\).
If c is not consistent, we may set \(\text{ OPT }(x,i,c) =  \infty \) right away. We thus assume that c is consistent and distinguish the type of node x as follows.

Leaf: set \(\text{ OPT }(x,0,c) = 0\) and \(\text{ OPT }(x,i,c) = \infty \) for all \(i \in [G]\).

Introduce: let y be the unique child of x and let \(v \notin B_y\) such that \(B_x = B_y \cup \{v\}\).
If \(v \in [N]\), we set
$$\begin{aligned} \text{ OPT }(x,i,c) = {\left\{ \begin{array}{ll} \text{ OPT }(y,i,c_{B_y}), &{}\quad \text { if } c(v)=0\\ \text{ OPT }(y,i,c_{B_y}) + w(v), &{}\quad \text { if } c(v)=1 \text { and } i > 0\\ \text{ OPT }(y,i,c_{B_y}) + w(v),&{}\quad \text { if } c(v)=1_?\\ \infty , &{}\quad \text { otherwise} \end{array}\right. } \end{aligned}$$If \(v \in {\mathfrak {G}}\), we set
$$\begin{aligned} \text{ OPT }(x,i,c) = {\left\{ \begin{array}{ll} \text{ OPT }(y,i,c_{B_y}), &{}\quad \text { if } c(v)=0\\ \max \{ \text{ OPT }(y,i1,c') : (y,c') \text { is compatible to } (x,c) \}, &{}\quad \text { if } c(v)=1 \text { and } c^{1}(1) \cap \varGamma (v) \ne \emptyset \\ \,\infty , &{}\quad \text { otherwise} \end{array}\right. } \end{aligned}$$(21)where \((y,c')\) is compatible to (x, c) if

\(c':B_y \rightarrow \{0,1,1_?\}\) is a consistent colouring of \(B_y\),

\(c^{1}(0) = c'^{1}(0)\), and

\(c^{1}(1) = c'^{1}(1) \cup (c'^{1}(1_?) \cap \varGamma (v))\).


Forget: let y be the unique child of x and let \(v \notin B_x\) such that \(B_y = B_x \cup \{v\}\). We set
$$\begin{aligned} \text{ OPT }(x,i,c) = \max \{\text{ OPT }(y,i,c') : ~ c':B_y \rightarrow \{0,1,1_?\} \text{, } c = c'_{B_x} \text {, and } c'(v) \ne 1_?\} \end{aligned}$$(22) 
Join: we set
$$\begin{aligned} \text{ OPT }(x,i,c)= & {} \max \{\text{ OPT }(y,i_1,c') + \text{ OPT }(z,i_2,c'') \nonumber \\& w(((B_x \cap [N]) \cup \bigcup (B_x \cap {\mathfrak {G}})) {\setminus } c^{1}(0)) : i_1\nonumber \\&+ i_2  c'^{1}(1) \cap c''^{1}(1) \cap {\mathfrak {G}} = i\}, \end{aligned}$$(23)where y and z are the two children of x. The maximum is taken over all consistent colourings \(c',c'':B_x \rightarrow \{0,1,1_?\}\) with \(c^{1}(0)=c'^{1}(0)=c''^{1}(0)\) and \(c^{1}(1) = c'^{1}(1) \cup c''^{1}(1)\).

Root: first we compute \(\text{ OPT }(x,G,c)\) for all relevant choices of c, depending on the type of node x. The algorithm then terminates with the output
$$\begin{aligned} \text{ OPT } = \max \left\{ \text{ OPT }(x,G,c) : ~ c : B_x \rightarrow \{0,1,1_?\}\right\} . \end{aligned}$$
Lemma 2
The output OPT is the objective value of an optimal solution of the group model projection problem.
Proof
The proof follows the individual steps of the dynamic programming algorithm. Consider the problem group model projection(x, i, c). If c is not consistent, we correctly set \(\text{ OPT }(x,i,c)=\,\infty \). We thus proceed to the case when c is consistent. Fix an optimal solution \({\mathcal {S}}\) of group model projection(x, i, c) if existent.
The leafnode case is clear, so we proceed to the case of x being an introducenode. Let y be the unique child of x and let \(v \notin B_y\) such that \(B_x = B_y \cup \{v\}\). First assume that \(v \in [N]\).
Assume that \(c(v)=0\). Since c is consistent, \(c^{1}(1) \cap \varGamma (v) = \emptyset \) holds, and so \({\mathcal {S}}\) is an optimal solution to group model projection \((y,i,c_{B_y})\). We may thus set \(\text{ OPT }(x,i,c)=\text{ OPT }(y,i,c_{B_y})\).
If \(c(v)=1\), \({\mathcal {S}}\) covers v, so we need to make sure some vertex \(u \in \varGamma (v) \cap B_x\) is contained in the solution in order for group model projection(x, i, c) to be feasible. Hence we have \(\text{ OPT }(x,i,c) = \text{ OPT }(y,i,c_{B_y}) + w(v)\) if \(c^{1}(1) \cap \varGamma (v) \ne \emptyset \), and \(\text{ OPT }(x,i,c)=\,\infty \) otherwise.
Next we assume that \(v \in {\mathfrak {G}}\). If \(c(v)=0\), we may simply put \(\text{ OPT }(x,i,c) = \text{ OPT }(y,i,c_{B_y})\). So, assume that \(c(v)=1\). If group model projection(x, i, c) is feasible and thus \({\mathcal {S}}\) exists, define \({\mathcal {S}}' = {\mathcal {S}}{\setminus } \{v\}\). Now \({\mathcal {S}}'\) is a solution to \(\text{ OPT }(y,i,c')\) for some \(c' : B_y \rightarrow \{0,1,1_?\}\) with \(c^{1}(0) = c'^{1}(0)\) and \(c^{1}(1) = c'^{1}(1) \cup (c'^{1}(1_?) \cap \varGamma (v))\). Note that \((y,c')\) is compatible to (x, c). Consequently, \(\text{ OPT }(x,i,c)\) is upper bounded by the right hand side of (21).
To see that \(\text{ OPT }(x,i,c)\) is at least the right hand side of (21), let \((y,c'')\) be compatible to (x, c) and let \({\mathcal {S}}''\) be a solution to group model projection \((y,i,c'')\) of objective value \(\lambda \in {\mathbb {R}}\). Then \({\mathcal {S}}'' \cup \{v\}\) is a solution to group model projection(x, i, c) of objective value \(\lambda \). Consequently,
If x is a forgetnode, let y be the unique child of x and let \(v \notin B_x\) such that \(B_y = B_x \cup \{v\}\). If \(v \in {\mathcal {S}}\), we have
Otherwise, if \(v \notin {\mathcal {S}}\), we have
Moreover, any solution of group model projection \((y,i,c')\), where \(c':B_y \rightarrow \{0,1,1_?\}, \ \ c = c'_{B_x}\) and \(c'(v) =1\) is a solution of group model projection(x, i, c). This proves (22).
If x is a joinnode, let y and z be the two children of x and recall that \(B_x=B_y=B_z\).
Let \({\mathcal {S}}'\) be the collection of groups in \({\mathcal {S}}\) contained in the subtree rooted at y, and let \({\mathcal {S}}''\) be the collection of groups in \({\mathcal {S}}\) contained in the subtree rooted at z. Since T is a tree decomposition, \({\mathcal {S}}' \cap {\mathcal {S}}'' = {\mathcal {S}}\cap B_x\).
Note that \({\mathcal {S}}'\) is a solution of group model projection \((y,i_1,c')\) and \({\mathcal {S}}''\) is a solution of group model projection \((z,i_1,c'')\) for some \(c',c'':B_x \rightarrow \{0,1,1_?\}\) and \(i_1,i_2\) with \(i_1 + i_2  c'^{1}(1) \cap c''^{1}(1) \cap {\mathfrak {G}} = i\). It is easy to see that \(c^{1}(0)=c'^{1}(0)=c''^{1}(0)\) and \(c^{1}(1) = c'^{1}(1) \cup c''^{1}(1)\).
The objective value of \({\mathcal {S}}\) equals
This shows that \(\text{ OPT }(x,i,c)\) is at most the right hand side of (23).
Now let \(\tilde{{\mathcal {S}}}\) be an optimal solution of group model projection \((y,j_1,\tilde{c})\) and let \(\hat{{\mathcal {S}}}\) be an optimal solution of group model projection \((z,j_2,\hat{c})\) where

\(\tilde{c}, \hat{c}:B_x \rightarrow \{0,1,1_?\}\) are both consistent,

\(c^{1}(0)=\tilde{c}^{1}(0)=\hat{c}^{1}(0)\) and \(c^{1}(1) = \tilde{c}^{1}(1) \cup \hat{c}^{1}(1)\), and

\(j_1 + j_2  \tilde{c}^{1}(1) \cap \hat{c}^{1}(1) \cap {\mathfrak {G}} = i\).
Note that \(\tilde{{\mathcal {S}}}\) and \(\hat{{\mathcal {S}}}\) exist since, as we have shown earlier, the colourings \(c'\) and \(c''\) satisfy the above assertions. Then \(\hat{{\mathcal {S}}} \cup \tilde{{\mathcal {S}}}\) is a solution of group model projection(x, i, c) with objective value
Consequently, \(\text{ OPT }(x,i,c)\) is at least the right hand side of (23) and thus (23) holds. \(\square \)
By storing the best current solution alongside the OPT(x, i, c)values we can compute an optimal solution together with OPT.
Runtime of the algorithm
The computational complexity of the individual steps are as follows.

(a)
Given the incidence graph \(B({\mathfrak {G}})\) on \(n=M+N\) vertices of treewidth \(w_T\), one can compute a tree decomposition of width \(w_T\) in time \({\mathcal {O}}(2^{{\mathcal {O}}(w_T^3)}n)\) using Bodlaender’s algorithm [11]. The number of nodes of the decomposition is in \({\mathcal {O}}(n)\).

(b)
Given a tree decomposition of width \(w_T\) with t nodes, one can compute a nice tree decomposition of width \(w_T\) on \({\mathcal {O}}(w_Tt)\) nodes in \({\mathcal {O}}(w_T^2t)\) time in a straightforward way [39].
The running time of the dynamic programming is bounded as follows.
Theorem 1
The dynamic programming algorithm can be implemented to run in \({\mathcal {O}}(w_T 5^{w_T} G^2 N t)\) time given a nice tree decomposition of \(B({\mathfrak {G}})\) of width \(w_T\) on t nodes.
Note that we can assume that \(t = {\mathcal {O}}(w_T n)\) with \(n=M+N\). Together with the running time of the construction of the nice tree decomposition we can solve the exact projection problem on graphs with treewidth \(w_T\) in \({\mathcal {O}}((N+M)(w_T^2 5^{w_T} G^2 N + 2^{{\mathcal {O}}(w_T^3)}+w_T^2))\).
Proof of Theorem 1
Since the joinnodes are clearly the bottleneck of the algorithm, we discuss how to organize the computation in a way that the desired running time bound of \({\mathcal {O}}(w_T 5^{w_T} G^2 N)\) holds in a node of this type.
So, let x be a joinnode and assume that y and z are the two children of x. We want to compute \(\text{ OPT }(x,i,c)\), for all colourings \(c:B_x \rightarrow \{0,1,1_?\}\) and all i with \(0 \le i \le G\). Recall that we need to compute this value according to (23), that is,
where the maximum is taken over all consistent colourings \(c',c'':B_x \rightarrow \{0,1,1_?\}\) with \(c^{1}(0)=c'^{1}(0)=c''^{1}(0)\) and \(c^{1}(1) = c'^{1}(1) \cup c''^{1}(1)\).
We enumerate all \(5^{w_T+1}\) colourings \(C: B_x \rightarrow \{(0,0),(1,1),(1,1_?),(1_?,1),(1_?,1_?)\}\) and derive c, \(c'\), and \(c''\). We put
If either of c, \(c'\), or \(c''\) are inconsistent, we discard this choice of C. In this way we capture every consistent colouring \(c:B_x \rightarrow \{0,1,1_?\}\), and all consistent choices of \(c'\), and \(c''\) satisfying \(c^{1}(0)=c'^{1}(0)=c''^{1}(0)\) and \(c^{1}(1) = c'^{1}(1) \cup c''^{1}(1)\).
It remains to discuss the computation of the value \(w((c^{1}(1) \cap [N]) \cup \bigcup (c^{1}(1) \cap {\mathfrak {G}})))\). This value can be computed in \({\mathcal {O}}(w_TN)\) time, since we are computing differences and unions of at most \(w_T\) groups of size N each. We arrive at a total running time in \({\mathcal {O}}(w_T 5^{w_T} G^2 N)\). \(\square \)
Remark 1
The dynamic programming algorithm can be extended to include a sparsity restriction on the support of the signal approximation itself. So, we can compute an optimal Ksparse Ggroupsparse signal approximation if the bipartite incidence graph of the studied group models is bounded. The running time of the algorithm increases by a factor of \({\mathcal {O}}(K)\).
Hardness on gridlike group structures
An \(r \times r\)grid is a graph with vertex set \([r]\times [r]\), and two vertices \((a,b),(c,d) \in [r]\times [r]\) are adjacent if and only if \(ac=1\) and \(bd=0\), or if \(ac=0\) and \(bd=1\). We also say that r is the size of the grid. Figure 3 shows a \(6 \times 6\)grid.
Recall the group model projection problem can be solved efficiently when the treewidth of the incidence graph of the group structure is bounded, as shown in Sect. 3.1.
Definition 7
(Graph minor) Let \(G_1\) and \(G_2\) be two graphs. The graph \(G_2\) is called a minor of \(G_1\) if \(G_2\) can be obtained from \(G_1\) by deleting edges and/or vertices and by contracting edges.
A classical theorem by Robertson and Seymour [52] says that in a graph class \({\mathcal {C}}\) closed under taking minors either the treewidth is bounded or \({\mathcal {C}}\) contains all grid graphs.
Consequently, if \({\mathcal {C}}\) is a class of graphs that does not have a bounded treewidth, it contains all grids. Our next theorem shows that group model projection is NPhard on group models \({\mathfrak {G}}\) for which \(B({\mathfrak {G}})\) is a grid, thus complementing Theorem 1.
Theorem 2
The group model projection problem is NPhard even if restricted to instances \({\mathfrak {G}}\) for which \(B({\mathfrak {G}})\) is a grid graph and the weight of any element is either 0 or 1.
Consider the following problem: given an \(n \times n\)pixel blackandwhite image, pick k \(2 \times 2\)pixel windows to cover as many black pixels as possible. This problem can be modeled as the group model projection problem on a grid graph where the weight of any element is either 0 or 1. See Fig. 4 for an illustration. Note that we added artifical pixelnodes with weight 0 to the boundary of the graph to obtain the desired gridstructure. This group model is of frequency at most 4, and so we can do an approximate model projection and signal recovery using the result of Sect. 4.
Our proof is a reduction from the Vertex Cover problem. Recall that for a graph \(\bar{G}\), a vertex cover is a subset of the vertices of \(\bar{G}\) such that any edge of \(\bar{G}\) has at least one endpoint in this subset. The size of the smallest vertex cover of \(\bar{G}\), the vertex cover number, is denoted \(\tau (\bar{G})\).
Given a graph \(\bar{G}\) and a number k as input, the task in the Vertex Cover problem is to decide whether \(\bar{G}\) admits a vertex cover of size k. That is, whether \(\tau (\bar{G}) \le k\). This problem is NPcomplete even if restricted to cubic planar graphs [23].^{Footnote 2}
We use the following simple lemma in our proof.
Lemma 3
Let \(\bar{G}\) be a graph and let \(G'\) be the graph obtained by subdividing some edge of \(\bar{G}\) twice. Then \(\tau (\bar{G}) = \tau (G')1\).
We can now prove our theorem.
Proof of Theorem 2
The reduction is from Vertex Cover on planar cubic graphs. Consider \(\bar{G}=(V,E)\) to be a planar cubic graph and let k be some number. Our aim is to compute an instance \(({\mathfrak {G}},{\mathbf {w}},k')\) of the group model projection problem where \(B({\mathfrak {G}})\) is a grid such that \(\bar{G}\) has a vertex cover of size k if and only if \({\mathfrak {G}}\) admits a selection of \(k'\) groups that together cover elements of a total weight at least some threshold t.
First we embed the graph \(\bar{G}\) in some grid H of polynomial size, meaning the vertices of \(\bar{G}\) get mapped to the vertices of the grid and edges get mapped to mutually internally disjoint paths in the grid connecting its endvertices. This can be done in polynomial time using an algorithm for orthogonal planar embedding [2]. We denote the mapping by \(\pi \), hence \(\pi (u)\) is some vertex of H and \(\pi (vw)\) is a path from \(\pi (v)\) to \(\pi (w)\) in H for all \(u \in V\) and \(vw \in E\).
Next we subdivide each edge of the grid 9 times, so that a vertical/horizontal edge of H becomes a vertical/horizontal path of length 10 in some new, larger grid \(H'\). We choose \(H'\) such that the corners of H are mapped to the corners of \(H'\). In particular, \(V(H') \le 100 V(H)\). Let us denote the obtained embedded subdivision of \(\bar{G}\) in \(H'\) by \(G'\), and let \(\pi '\) denote the embedding. Moreover, let \(\phi \) be the corresponding embedding of H into the subdivided grid \(H'\). Note that \(\text{ im }~\pi '_{V} \subseteq \text{ im }~\phi _{V}\).
Let (A, B) be a bipartition of \(H'\). We may assume that \(\pi '(u)\) is in A for all \(u \in V\). We consider \(H'\) to be the incidence graph \(B({\mathfrak {G}})\) of a group model \({\mathfrak {G}}\) where the vertices in B correspond to the elements and the vertices in A correspond to the groups of \({\mathfrak {G}}\). We refer to the vertices in A as groupvertices and to vertices in B as elementvertices. Slightly abusing notation, we identify each group with its groupvertex and each element with its elementvertex and write \({\mathfrak {G}}=A\).
We observe that

(a)
\(G'\) is an induced subgraph of \(H'\),

(b)
every vertex \(\pi '(u)\), \(u \in V\), has degree 3 in \(G'\) and is a groupvertex,

(c)
every other vertex has degree 2 in \(G'\), and

(d)
for every groupvertex \(x \in V(H') {\setminus } V(G')\) there is some groupvertex \(u \in V(G')\) with
$$\begin{aligned} \varGamma _{H'}(x) \cap V(G') \subseteq \varGamma _{H'}(u) \cap V(G'). \end{aligned}$$
Next we will tweak the embedding of \(\bar{G}\) a bit, to remove paths \(\pi (uv)\) with the wrong parity. We do so in a way that preserves the properties (a)–(d). Let \({\mathcal {P}}_0 \subseteq \{\pi '(uv) : uv \in E(H)\}\) be the set of all paths with a length 0 (mod 4), and let \({\mathcal {P}}_2 = \{\pi '(uv) : uv \in E(H)\} {\setminus } {\mathcal {P}}_0\). We want to substitute each path in \({\mathcal {P}}_0\) by a path of length 2 (mod 4). For this, let \(u'\) be the neighbour of u in the path \(\pi (uv)\). Note that the path \(\pi '(uu')\) in \(H'\) starts with a vertical or horizontal path P from \(\pi '(u)\) to \(\pi '(u')\) of length 10. We bypass the middle vertex of this path (an elementvertex) by going over two new elementvertices and one groupvertex instead. See Fig. 5 for an illustration.
To keep the notation easy we denote the newly obtained path by \(\pi ''(uv)\). Note that, after adding the bypass, the new path \(\pi ''(uv)\) is two edges longer and thus has length 2 (mod 4). We complete \(\pi ''\) to an embedding of \(\bar{G}\) by putting \(\pi ''(u) = \pi '(u)\) and \(\pi ''(vu') = \pi '(vu')\) for all \(u \in V\) and \(vu' \in E\) with \(\pi '(vu') \in {\mathcal {P}}_2\). Moreover, let us denote the changed embedding of \(\bar{G}\) by \(G''\).
We observe that the new embedding \(G''\) still satisfies the assertions (a)–(d) and, in addition, it holds that

(e)
every path connecting two vertices of degree 3 over vertices of degree 2 only has length 2 (mod 4).
Next we define the weights of the elementvertices by putting
Assertion (d) implies that, for any subset \({\mathcal {S}} \subseteq {\mathfrak {G}}\) of size k there is an \({\mathcal {S}}' \subseteq {\mathfrak {G}}\) of size at most k such that

\({\mathcal {S}}' \subseteq A \cap V(G'')\), and

\({\mathbf {w}}(\bigcup {\mathcal {S}}') \ge {\mathbf {w}}(\bigcup {\mathcal {S}})\).
Since \({\mathbf {w}}(u)=0\) for all elements in \(B {\setminus } V(G'')\), we may thus restrict our attention to the restricted group model \({\mathfrak {G}}'= A \cap V(G'')\) on the element set \(B\cap V(G'')\).
Slightly abusing notation, any subset \({\mathcal {S}}\subseteq {\mathfrak {G}}'\) is a vertex subset of \(I({\mathfrak {G}}')\) and \({\mathbf {w}}(\bigcup {\mathcal {S}})\) equals the number of edges of \(I({\mathfrak {G}}')\) adjacent to the vertex set \({\mathcal {S}}\) in \(I({\mathfrak {G}}')\). Moreover, the graph \(I({\mathfrak {G}}')\) is obtained from the graph \(\bar{G}\) by subdividing each edge an even number of times.
From Lemma 3 we know that there is some number t such that \(\tau (\bar{G}) = \tau (I({\mathfrak {G}}))t\). Hence, \(\bar{G}\) has a vertex cover of size k if and only if \(M'\) has a cover of size \(k'=k+t\) of total weight \(E(I({\mathfrak {G}}'))\). This, in turn, is the case if and only if M admits a cover of size \(k'\) of total weight \(E(I({\mathfrak {G}}'))\). Since the construction of \({\mathfrak {G}}\) can be done in polynomial time, the proof is complete. \(\square \)
MEIHT for general group structures
In this section we apply the results for structured sparsity models and for expander matrices to the group model case. The ModelExpander IHT (MEIHT) algorithm is one of the exact projection algorithms with provable guarantees for treesparse and loopless overlapping groupsparse models using modelexpander sensing matrices [3]. In this work we show how to use the MEIHT for more general group structures. The only modification of the MEIHT algorithm is the projection on these new group structures. We show MEIHT’s guaranteed convergence and polynomial runtime.
Note that as in [4], we are able to do model projections with an additional sparsity constraint, i.e. projection onto \(\mathcal {P}_{{\mathfrak {G}}_{G,K}}\) defined in (11). Therefore Algorithm 3 works with an extra input K and the model projection \(\mathcal {P}_{{\mathfrak {G}}_{G}}\) becomes \(\mathcal {P}_{{\mathfrak {G}}_{G,K}}\), retuning a \({\mathfrak {G}}_{G,K}\)sparse approximation to \({\mathbf {x}}\).
The convergence analysis of MEIHT with the more general group structures considered here remains the same as for loopless overlapping group models discussed in [3]. We are able to perform the exact projection of \(\mathcal {P}_{{\mathfrak {G}}_G}\) (and \(\mathcal {P}_{{\mathfrak {G}}_{G,K}}\)) as discussed in Sect. 3.6. With the possibility of doing the projection onto the model, we present the convergence results in Corollaries 1 and 2 as corollaries to Theorem 3.1 and Corollary 3.1 in [3] respectively.
Corollary 1
Consider \({\mathfrak {G}}\) to be a group model of bounded treewidth and \({\mathcal {S}}\) to be \({\mathfrak {G}}_G\)sparse. Let the matrix \({\mathbf {A}}\in \{0,1\}^{m\times N}\) be a model expander matrix with \(\epsilon _{{\mathfrak {G}}_{3G}} < 1/12\) and d ones per column. For any \({\mathbf {x}}\in {\mathbb {R}}^N\) and \({\mathbf {e}}\in {\mathbb {R}}^m\), the sequence of updates \({\mathbf {x}}^{(n)}\) of MEIHT with \({\mathbf {y}}= {\mathbf {A}}{\mathbf {x}}+ {\mathbf {e}}\) satisfies, for any \(n\ge 0\)
where \(\alpha = 8\epsilon _{{\mathfrak {G}}_{3G}}\left( 14\epsilon _{{\mathfrak {G}}_{3G}}\right) ^{1} \in (0,1)\) and \(\beta = 4d^{1}\left( 1  12\epsilon _{{\mathfrak {G}}_{3G}}\right) ^{1} \in (0,1)\).
Note that \(\epsilon _{{\mathfrak {G}}_{3G}}\) is the expansion coefficient of the underlying \((s,d,\epsilon _{{\mathfrak {G}}_{3G}})\)model expander graph for \({\mathbf {A}}\). This ensures that \({\mathbf {A}}\) satisfies model RIP1 over all \({\mathfrak {G}}_{3G}\)sparse signals.
The proof of this Corollary can be done analogously to the proof of Theorem 3.1 in [3]. It is thus skipped and the interested reader is referred to [3].
Let us define the \(\ell _1\)error of the best \({\mathfrak {G}}_G\)term approximation to a vector \({\mathbf {x}}\in {\mathbb {R}}^N\)
This is then used in the following corollary.
Corollary 2
Consider the setting of Corollary 1. After \(n = \left\lceil \log \left( \frac{\Vert {\mathbf {x}}\Vert _1}{\Vert {\mathbf {e}}\Vert _1}\right) /\log \left( \frac{1}{\alpha }\right) \right\rceil \) iterations, MEIHT returns a solution \(\widehat{{\mathbf {x}}}\) satisfying
where \(c_1 = \beta d\) and \(c_2 = 1+\beta \).
Proof
Without loss of generality we initialize MEIHT with \({\mathbf {x}}^{(0)} = 0\). Upper bounding \(1\alpha ^n\) by 1 and using triangle inequalities with some algebraic manipulations (24) simplifies to
Using the fact that \({\mathbf {A}}\) is a binary matrix with d ones per column we have \(\Vert {\mathbf {A}}{\mathbf {x}}_{{\mathcal {S}}^c}\Vert _1 \le d\Vert {\mathbf {x}}_{{\mathcal {S}}^c}\Vert _1\). We also have \(\Vert {\mathbf {e}}\Vert _1 \ge \alpha ^n \Vert {\mathbf {x}}\Vert _1\) when \(n \ge \log \left( \frac{\Vert {\mathbf {x}}\Vert _1}{\Vert {\mathbf {e}}\Vert _1}\right) /\log \left( \frac{1}{\alpha }\right) \). Applying these bounds to (27) leads to
for \(n = \left\lceil \log \left( \frac{\Vert {\mathbf {x}}\Vert _1}{\Vert {\mathbf {e}}\Vert _1}\right) /\log \left( \frac{1}{\alpha }\right) \right\rceil \). This is equivalent to (26) with \(c_1 = \beta d\), \(c_2 = 1+\beta \), \({\mathbf {x}}^{(n)} = \widehat{{\mathbf {x}}}\) for the given n, and \(\sigma _{{\mathfrak {G}}_G}({\mathbf {x}})_1 = \Vert {\mathbf {x}}_{{\mathcal {S}}^c}\Vert _1\) because \({\mathbf {x}}_{{\mathcal {S}}}\) is the best \({\mathfrak {G}}_G\)term approximation to the \({\mathfrak {G}}_G\)sparse \({\mathbf {x}}\). This completes the proof.\(\square \)
The runtime complexity of MEIHT still depends on the median operation and the complexity of the projection onto the model. However, as observed in [3], the projection onto the model is the dominant operation of the algorithm. Therefore, the complexity of MEIHT is of the order of the complexity of the projection onto the model. In the case of overlapping group models with bounded treewidth, MEIHT achieves a polynomial runtime complexity as shown in Proposition 3 below. On the other hand, when the treewidth of the group model is unbounded MEIHT can still be implemented by using the Bender’s decomposition procedure in Sect. 3.6 for the projection which may have an exponential runtime complexity.
Proposition 3
The runtime of MEIHT is \({\mathcal {O}}((N+M)(w_T^2 5^{w_T} G^2 N) \bar{n} + (N+M)(2^{{\mathcal {O}}(w_T^3)}+w_T^2))\) for the \({\mathfrak {G}}_G\)groupsparse model with bounded treewidth \(w_T\), where \(\bar{n}\) is the number of iterations, M is the number of groups, G is the group budget and N is the signal dimension.
Proof
Before we start the MEIHT procedure we have to calculate a nice tree decomposition of the incidence graph of the group model. This can be done in \({\mathcal {O}} ((N+M)2^{{\mathcal {O}}(w_T^3)}+w_T^2)\). Then in each of the iterations of the MEIHT we have to solve the exact projection onto the model which is the dominant operation of the MEIHT. Since the projection on the group model with bounded treewidth \(w_T\) can be done through the dynamic programming algorithm that runs in \({\mathcal {O}}((N+M)(w_T^2 5^{w_T} G^2 N))\), as proven in Sect. 3.1, this proves the result. \(\square \)
Remark 2
The convergence results above hold when MEIHT is modified appropriately to solve the standard Ksparse and Ggroupsparse problem with groups having bounded treewidth, where the projection becomes \(\mathcal {P}_{{\mathfrak {G}}_{G,K}}\). However, in this case the runtime complexity in each iteration grows by a factor of \({\mathcal {O}}(K)\), as indicated in Remark 1.
Exact projection for general group models
In this section we consider the most general version of group models, i.e. \(\mathfrak {G} = \{{\mathcal {G}}_1,\ldots ,{\mathcal {G}}_M\}\) is an arbitrary set of groups and \(G\in [M]\) and \(K\in [N]\) are given budgets. We study the structured sparsity model \({\mathfrak {G}}_{G,K}\) introduced in Sect. 2.4. Here additional to the number G of groups that can be selected we bound the number of indices to be selected in these groups by K (i.e. we consider a groupsparse model with an additional standard sparsity constraint). Note that setting \(K = N\) reduces this model \({\mathfrak {G}}_{G,K}\) to the general group model \({\mathfrak {G}}_{G}\).
If we want to apply exact projection recovery algorithms like the ModelIHT and MEIHT to group models, iteratively the model projection problem has to be solved, i.e. in each iteration for a given signal \({\mathbf {x}}\in {\mathbb {R}}^N\) we have to find the closest signal \(\hat{{\mathbf {x}}}\) which has support in the model \({\mathfrak {G}}_{G,K}\). In this section we will derive an efficient procedure based on the idea of Benders’ decomposition to solve the projection problem. This procedure is analysed and implemented in Sect. 5.
It has been proved that the group model projection problem for group models without a sparsity condition on the support is NPhard [4]. Therefore the projection problem on the more general model \({\mathfrak {G}}_{G,K}\) is NPhard as well. The latter problem can be reformulated by the integer programming formulation
Here \({\mathbf {w}}\) are the squared entries of the signal, the \({\mathbf {v}}\)variables represent the groups and the \({\mathbf {u}}\)variables represent the elements which are selected. Note that by choosing \(K=N\) we obtain the projection problem for classical group models \({\mathfrak {G}}_G\).
To derive an efficient algorithm for the projection problem we use the concept of Benders’ decomposition which was already studied in [7, 24]. The idea of this approach is to decompose Problem 29 into a master problem and a subproblem. Then iteratively the subproblem is used to derive feasible inequalities for the master problem until no feasible inequality can be found any more. This procedure has been applied to Problem 29 without the sparsity constraint on the \({\mathbf {u}}\)variables in [17]. The following results for the more general Problem 29 are based on the the idea of Benders’ decomposition and extend the results in [17].
First we can relax the \({\mathbf {u}}\)variables in the latter formulation without changing the optimal value, i.e. we may assume \({\mathbf {u}}\in [0,1]^N\). We can now reformulate (29) by
where \(P({\mathbf {v}})=\{{\mathbf {u}}\in [0,1]^N \ : \ \sum _{i=1}^{N} u_i \le K, \ u_i\le \sum _{j:i\in {\mathcal {G}}_j} v_j, \ i=1,\ldots , N\}\). Replacing the linear problem \(\max _{{\mathbf {u}}\in P({\mathbf {v}})} \mathbf{w^\top {\mathbf {u}}}\) in (30) by its dual formulation, we obtain
where \(P_D=\{\alpha ,{\varvec{\beta }},{\varvec{\gamma }}\ge 0 \ : \ \alpha + \beta _i + \gamma _i\ge w_i \ i=1,\ldots , N\}\) is the feasible set of the dual problem. Since \(P_D\) is a polyhedron and the minimum in (31) exists, the first constraint in (31) holds if and only if it holds for each vertex \((\alpha ^l,{\varvec{\beta }}^l,{\varvec{\gamma }}^l)\) of \(P_D\). Therefore Problem (31) can be reformulated by
where \((\alpha ^1,{\varvec{\beta }}^1,{\varvec{\gamma }}^1),\ldots ,(\alpha ^t,{\varvec{\beta }}^t,{\varvec{\gamma }}^t)\) are the vertices of \(P_D\). Each of the constraints
for \(l=1,\ldots ,t\) is called optimality cut.
The idea of Benders’ decomposition is, starting from Problem (32) containing no optimality cut (called the master problem), to iteratively calculate the optimal \(({{\mathbf {v}}}^*,\mu ^*)\) and then find a optimality cut which cuts off the latter optimal solution. In each step the most violating optimality cut is determined by solving
for the actual optimal solution \({{\mathbf {v}}}^*\). If the optimal solution fulfills
then the optimality cut related to the optimal vertex \((\alpha ^*,\varvec{\beta }^*,\varvec{\gamma }^*)\) is added to the master problem. This procedure is iterated until the latter inequality is not true any more. The last optimal \({\mathbf {v}}^*\) must then be optimal for Problem (29) since the first constraint in (31) is then true for \({\mathbf {v}}^*\).
If we use the latter Benders’ decomposition approach it is desired to use fast algorithms for the master problem (32) and the subproblem (33) in each iteration. By the following lemma an optimal solution of the subproblem can be easily calculated.
Lemma 4
For a given solution \({\mathbf {v}}\in \{ 0,1\}^M\) we define \(I_{{\mathbf {v}}}:=\left\{ i=1,\ldots ,N\ : \ \sum _{j:i\in {\mathcal {G}}_j}v_j > 0\right\} \) and \(I_{{\mathbf {v}}}^K\) by the indices of the K largest values \(w_i\) for \(i\in I_{{\mathbf {v}}}\). An optimal solution of Problem (33) is then given by \((\alpha ^*,\varvec{\beta }^*,\varvec{\gamma }^*)\) where \(\alpha ^* = \max _{i\in I_{{\mathbf {v}}}{\setminus } I_{{\mathbf {v}}}^K} w_i\) and
Proof
Note that for a given \({{\mathbf {v}}}^*\) the dual problem of subproblem (33) is
It is easy to see that there exists an optimal solution \({\mathbf {u}}^*\) of Problem (34) where \(u_i=1\) if and only if \(i\in I_{{\mathbf {v}}}^K \) and \(u_i^*=0\) otherwise. We will use the dual slack conditions
to derive the optimal values for \(\alpha ,\varvec{\beta },\varvec{\gamma }\). We obtain the following 4 cases:

Case 1 If \(u_i^*=0\) and \(\sum _{j:i\in {\mathcal {G}}_j} v_j^*>0\), i.e. \(i\in I_{{\mathbf {v}}}{\setminus } I_{{\mathbf {v}}}^K\), then by conditions (35) we have \(\beta _i=\gamma _i=0\). To ensure the constraint \(\alpha + \beta _i + \gamma _i\ge w_i\), the value of \(\alpha \) must be at least \(\max _{i\in I_{{\mathbf {v}}}{\setminus } I_{{\mathbf {v}}}^K} w_i\).

Case 2 If \(u_i^*=0\) and \(\sum _{j:i\in {\mathcal {G}}_j} v_j^*=0\), then \(i\in [N]{\setminus } I_{{\mathbf {v}}}\) and the objective coefficient of \(\beta _i\) in Problem (33) is 0. Therefore we can increase \(\beta _i\) as much as we want without changing the objective value and therefore we set \(\beta _i=w_i\) to ensure the ith constraint \(\alpha + \beta _i + \gamma _i\ge w_i\) and set \(\gamma _i=0\).

Case 3 If \(u_i^*=1\), i.e. \(i\in I_{{\mathbf {v}}}^K\), and \(\sum _{j:i\in {\mathcal {G}}_j} v_j^*>1\) then by condition (35) \(\beta _i=0\). Therefore to ensure the ith constraint \(\alpha + \beta _i + \gamma _i\ge w_i\) the value of \(\gamma _i\) must be at least \(w_i\alpha \) and since we minimize \(\gamma _i\) in the objective function the latter holds with equality.

Case 4 If \(u_i^*=1\), i.e. \(i\in I_{{\mathbf {v}}}^K\), and \(\sum _{j:i\in {\mathcal {G}}_j} v_j^*=1\) then we cannot use condition (35) to derive the values for \(\beta _i\) and \(\gamma _i\). Nevertheless in this case both variables have an objective coefficient of 1 while \(\alpha \) has an objective coefficient of K. By increasing \(\alpha \) by 1 the objective value increases by K. In Cases 1 and 2 nothing changes, while for each index in Case 3 we could decrease \(\gamma _i\) by 1 to remain feasible. But since at most K indices i fulfil the conditions of Case 3 we cannot improve our objective value by this strategy. Therefore \(\alpha \) has to be selected as small as possible in Case 4, i.e. by Case 1 we set \(\alpha = \max _{i\in I_{{\mathbf {v}}}{\setminus } I_{{\mathbf {v}}}^K} w_i\) and to ensure feasibility we set \(\gamma _i=w_i\alpha \). This concludes the proof.\(\square \)
Theorem 3
An optimal solution of subproblem (33) can be calculated in time \({\mathcal {O}} (NMG_\text {max})\).
Proof
The set \(I_{{\mathbf {v}}}\) can be calculated in time \({\mathcal {O}}(NMG_\text {max})\) by going through all groups for each index \(i\in [N]\) and checking if the index is contained in one of the groups. To obtain \(I_{{\mathbf {v}}}^K\) we have to find the Kth largest element in \(\left\{ w_i \ : \ i\in I_{{\mathbf {v}}}\right\} \), which can be done in time \({\mathcal {O}}(N)\); see [46]. Afterwards we select all values which are larger than the Kth largest element, which can be done in \({\mathcal {O}} (N)\). Assigning all values to \((\alpha ,\varvec{\beta },\varvec{\gamma })\) can also be done in time \({\mathcal {O}} (N)\). \(\square \)
The following theorem states that the masterproblem can be solved in pseudopolynomial time if the number of constraints is fixed. Nevertheless note that the number of iterations of the procedure described above may be exponential in N and M.
Theorem 4
Problem (32) with t constraints can be solved in \({\mathcal {O}} (MG(NW)^t)\) where \(W=\max _{i\in [N]}w_i\).
Proof
Problem (32) with t constraints can be reformulated by
The latter problem is a special case of the robust knapsack problem with discrete uncertainty (sometimes called robust selection problem) with additional uncertain constant. In [12] the authors mention that the problem with an uncertain constant is equivalent to the problem without such a constant. Furthermore using the result in [41] Problem (36) can be solved in \({\mathcal {O}} (MGC^t)\) where
Since for all solutions \((\alpha ^l,\varvec{\beta }^l,\varvec{\gamma }^l)\) generated in Lemma 4 it holds that \(\alpha ^l, \beta _i^l, \gamma _i^l \le \max _{i\in [N]} w_i\) we have \(C\le (2N+1)W\) which proves the result.\(\square \)
Algorithms with approximation projection oracles
As mentioned in the previous sections solving the group model projection problem is NPhard in general. Therefore to use classical algorithms as the ModelIHT or the MEIHT we have to solve an NPhard problem in each iteration. To tackle problems like this the authors in [27, 28] introduced an algorithm called Approximate ModelIHT (AMIHT) which is based on the idea of the classical IHT but which does not require an exact projection oracle (see Sect. 2.5). Instead the authors show that under certain assumptions on the measurement matrix a signal can be recovered by just using two approximation variants of the projection problem which they call head and tailapproximation (for further details again see Sect. 2.5).
In this section we apply the latter results to group models of bounded frequency: group models where the maximum number of groups an element is contained in is bounded by some number f. Note that from Theorem 2 we know that group model projection is NPhard for group models of frequency 4. A particularly interesting case of such group structures is the graphic case, where each element is contained in at most two groups. Understanding this case was proposed as an open problem by Baldassarre et al. [4].
Furthermore we apply the theoretical results derived in [27, 28] to group models and show that the number of required measurements compared to the classical structured sparsity case increases by just a constant factor. In Sect. 5 we will computationally compare the AMIHT to the ModelIHT and the MEIHT on interval groups.
Head and tailapproximations for group models with low frequency
In this section we derive head and tailapproximations for the case of group models with bounded frequency. We first recall the definition of head and tailapproximations for the case of group models. Assume we have given a group model \({\mathfrak {G}}\) together with \(G\in {\mathbb {N}}\).
Given a vector \({\mathbf {x}}\), let \({\mathcal {H}}\) be an algorithm that computes a vector \({\mathcal {H}}({\mathbf {x}})\) with support in \({\mathfrak {G}}_{G'}\) for some \(G'\in {\mathbb {N}}\). Then, given some \(\alpha \in {\mathbb {R}}\) (typically \(\alpha < 1\)) we say that \({\mathcal {H}}\) is an \((\alpha ,G,G',p)\)head approximation if
In other words, \({\mathcal {H}}\) uses \(G'\) many groups to cover at least an \(\alpha \)fraction of the maximum total weight covered by G groups. Note that \(G'\) can be chosen larger than G.
Moreover, let \({\mathcal {T}}\) be an algorithm which computes a vector \({\mathcal {T}}({\mathbf {x}})\) with support in \({\mathfrak {G}}_{G}\). Given some \(\beta \in {\mathbb {R}}\) (typically \(\beta > 1\)) we say that \({\mathcal {T}}\) is a \((\beta ,G,G',p)\)tail approximation if
This means that \({\mathcal {T}}\) may use \(G'\) many groups to leave at most a \(\beta \)fraction of weight uncovered compared to the minimum total weight left uncovered by G groups.
In the following we derive the head and tailapproximation just for the case \(p=1\). Note that equivalent approximation procedures can be easily derived for the case \(p=2\) by replacing the accuracies \(\alpha \) and \(\beta \) by \(\sqrt{\alpha }\) and \(\sqrt{\beta }\) and using the weights \(x_i^2\) instead of \(x_i\) in the latter proofs. We will first present a result which implies the existence of a head approximation.
Theorem 5
(Hochbaum and Pathria [30]) For each \(\epsilon > 0\) there exists an \(((1\epsilon ),G, \lceil G \log _2 (1/\epsilon ) \rceil ,1)\)head approximation running in polynomial time.
The algorithm derived in [30] was designed to solve the Maximum \(G\)Coverage problem and is based on a simple greedy method. The idea is to iteratively select the group which covers the largest amount of uncovered weight. It is proven by the authors that if you are allowed to select enough groups, namely \(\lceil G \log _2 (1/\epsilon )\rceil \), then the optimal value is approximated up to an accuracy of \((1\epsilon )\). The greedy procedure is given in Algorithm 4. Note that for a given signal \({\mathbf {x}}\) and a group \({\mathcal {G}}\in {\mathfrak {G}}\) we define
Next we derive a tailapproximation for our problems based on the idea of LP rounding. Note that in contrast to the headapproximation, the runtime bound of the following tailapproximation depends on the frequency of the group model.
Theorem 6
Suppose the frequency of the group model is f. For any \(\epsilon > 0\) and \(\kappa = (1+\epsilon ^{1}) f\) there exists an \(((1+\epsilon ),G,\kappa G,1)\)tail approximation running in polynomial time.
Proof
Given a signal \({\mathbf {x}}\in {\mathbb {R}}^N\), we define \({\mathbf {w}}= (x_i)_{i \in [N]}\). We consider the following linear relaxation of the group model projection problem
Consider an optimal solution \(({\mathbf {u}},{\mathbf {v}})\) of (39). We compute a group cover \({\mathcal {S}}\subseteq {\mathfrak {G}}\) by
Note that \({\mathcal {S}}\) contains at most \(\kappa G\) many groups, since
It remains to show that \({\mathcal {S}}\) is a tail approximation. To this end let R be the set of indices only barely covered by v, i.e.
Note that
since \(j \notin R\) implies
and hence
for some i with \(j \in {\mathcal {G}}_i\). Moreover, note that
and hence
We obtain the inequalities
where we used (40) and (41). Since \({\mathbf {u}}\) is an optimal solution of the relaxed problem (39), we have
where \({\mathcal {S}}^*\) is an optimal solution of the group model projection problem and \({\mathbf {u}}^*\) the corresponding optimal vector of Problem (12). Therefore the latter procedure is a tailapproximation which completes the proof.\(\square \)
AMIHT and AMMEIHT for group models
As in the previous sections for a sensing matrix \({\mathbf {A}}\) and the true signal \({\mathbf {x}}\) we have a noisy measurement vector \({\mathbf {y}}= {\mathbf {A}}{\mathbf {x}} + {\mathbf {e}}\) for some noise vector \({\mathbf {e}}\). The task consists in recovering the original signal \({\mathbf {x}}\), or a vector close to \({\mathbf {x}}\). Furthermore we have given a group model \({\mathfrak {G}}\) together with \(G\in {\mathbb {N}}\) with frequency \(f\in [N]\).
In the last section we derived polynomial time algorithms for an \(((1\epsilon ),G, \lceil G \log _2 (1/\epsilon ) \rceil ,2)\)head approximation and an \(((1+\epsilon ),G,(1+\epsilon ^{1}) f G,2)\)tail approximation. Note that we can use any G here. Using the results in Sect. 2.5, we obtain convergence of the AMIHT for group models if \({\mathcal {T}}\) is an \(((1+\epsilon ),G,G_T,2)\)tail approximation, \({\mathcal {H}}\) is an \(((1\epsilon ),G_T+G, G_H,2)\)head approximation, where \(G_T:=(1+\epsilon ^{1}) f G\) and \(G_H:= \lceil (G_T+G) \log _2 (1/\epsilon )\rceil \), and the sensing matrix \({\mathbf {A}}\) has \({\mathfrak {G}}_{\tilde{G}}\)RIP with \(\tilde{G} = G+G_T+G_H\). Note that \(\tilde{G}\in {\mathcal {O}} (G)\) for fixed accuracy \(\epsilon >0\) and frequency f. Furthermore \({\mathfrak {G}}_{\tilde{G}}\in {\mathcal {O}} (M^{cG})\) for a constant c. Using the bound (20) we obtain that the number of required measurements for a subGaussian random matrix \({\mathbf {A}}\) having the \({\mathfrak {G}}_{\tilde{G}}\)RIP with high probability is
which differs by just a constant factor from the number of measurements required in the case of exact projections (see Sect. 3). Under condition (19) convergence of the AMIHT is ensured.
Extension to withingroup sparsity and beyond
The head and tail approximation approach can be extended far beyond the standard groupsparsity model. It still works even if we are considering Ksparse and Ggroupsparse (i.e. \({\mathfrak {G}}_{G,K}\)sparse) vectors in our model, for example.
The reason is that the group model projection can be head approximated to a constant even in this case. Formally, if we search for the K weight maximal elements covered by G many groups, we are maximizing a submodular function subject to a knapsack constraint.^{Footnote 3} This is known to be approximable to a constant factor (cf. Kulik et al. [42, 43] and related work). Suppose we delete the covered elements and run such an approximation algorithm again. Then, after a constant number of steps, we obtain a collection of groups and elements such that the total weight of the elements is at least an \((1\epsilon )\)fraction of the total weight that a Ksparse and Ggroupsparse solution could ever have. Moreover, the sparsity budgets are exceeded only by a constant factor each.
Similarly, the analysis given in the proof of Theorem 6 works even if we impose sparsity on both groups and elements. Again, we obtain a solution that has a \((1+\epsilon )\)tail guarantee whose support exceeds that of a Ggroupsparse Ksparse vector by at most a constant factor if we assume a bounded frequency. This leads to the positive consequences detailed above.
More generally, any knapsack constraints on groups and elements can be handled, leading to head and tail approximations in the case when there are nonuniform sparsity budgets on the groups and elements. However the corresponding head approximations are rather involved, and certainly much less efficient than the simple greedy procedure proposed by the Hochbaum and Pathria algorithm [30].
Computations
In this section we present the computational results for the ModelIHT, MEIHT, AMIHT and AMEIHT presented in Sect. 2 for blockgroup instances. Precisely, we study blockgroups, i.e. each group \({\mathcal {G}} \in {\mathfrak {G}}\) is a set of sequenced indices, \({\mathcal {G}}=[s,t]\cap [N]\), where \(1\le s<t\le N\) and each group has the same size \({\mathcal {G}}=l\). For a given dimension N we generate blocks of size \(l=\lfloor 0.02N\rfloor \). We consider two types of blockmodels, one where the successive groups overlap in \(\lfloor \frac{l1}{2}\rfloor \) items and another where they overlap in \(l1\) items. Note that the frequency is then given by \(f=2\) or \(f=l\) respectively.
We run all algorithms for random signals \({\mathbf {x}}\in {\mathbb {R}}^N\) in dimension \(N\in \left\{ 100,200,\ldots ,900\right\} \). For each dimension we vary the number of measurements \(m\in \left\{ 20,40,\ldots , N\right\} \) and generate 20 random matrices \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) each together with a random signal \({\mathbf {x}}\in {\mathbb {R}}^N\). We assume there is no noise, i.e. \({\mathbf {e}}=0\). For a given group model \({\mathfrak {G}}\) the support of the signal \({\mathbf {x}}\) is determined as the union of G randomly selected groups. The components of \({\mathbf {x}}\) in the support are calculated as identical and independent draws from a standard Gaussian distribution while all other components are set to 0. Our computations are processed for two classes of random matrices, Gaussian matrices and expander matrices as described in Sect. 2. The Gaussian matrices are generated by drawing identical and independent values from a standard Gaussian distribution for each entry of \({\mathbf {A}}\) and afterwards normalizing each entry by \(\frac{1}{\sqrt{m}}\). The expander matrices are generated by randomly selecting \(d=\lfloor 2\log (N)/\log (G l)\rfloor \) indices in \(\left\{ 1,\ldots , m\right\} \) for each column of the matrix. The choice of d is motivated by the choice in [3]. Each of the algorithms is stopped if either the number of iterations reaches 1000 or if for any iteration \(i+1\) we have \(\Vert {\mathbf {x}}^{(i+1)}{\mathbf {x}}^{(i)}\Vert _p<10^{5}\). For the error in each iteration we use \(p=1\) for the calculations corresponding to the expander matrices and \(p=2\) for the Gaussian matrices. After the determination of the algorithm we calculate the relative error of the returned signal \(\hat{{\mathbf {x}}}\) to the true signal \({\mathbf {x}}\), i.e. we calculate
Again we use \(p=1\) for the calculations corresponding to the expander matrices and \(p=2\) for the Gaussian matrices. We call a signal recovered if the relative error is smaller than \(10^{5}\). For the AMIHT and the AMEIHT the approximation accuracy of the head and the tail approximation algorithms are set to \(\alpha = 0.95\) and \(\beta =1.05\).
For the exact signal approximation problem which has to be solved in each iteration of the ModelIHT and the MEIHT we implemented the Benders’ decomposition procedure presented in Sect. 3.6. To this end the master problem is solved by CPLEX 12.6 while each optimal solution of the subproblem is calculated using the result of Lemma 4. Regarding the AMIHT, for the headapproximation we implemented the greedy procedure given in Algorithm 4 while for the tailapproximation we implemented the procedure of Theorem 6. Again the LP in the latter procedure is solved by CPLEX 12.6. All computations were calculated on a cluster of 64bit Intel(R) Xeon(R) CPU E52603 processors running at 1.60 GHz with 15 MB cache.
The results of the computations are presented in the following diagrams. For all instances we measure the smallest number of measurements \(m^\#\) for which the median of the relative error to the true signal is at most \(10^{5}\), i.e. the smallest number of measurements for which at least 50% of the signals were recovered. Furthermore we show the average number of iterations and the average time in seconds the algorithms need to successfully recover a signal, given \(m^\#\) number of measurements. We stop increasing the number of measurements m if \(m^\#\) is reached.
In Figs. 6, 7 and 8 we show the development of \(m^\#\), the number of iterations and the runtime in seconds of all algorithms over all dimensions \(N\in \left\{ 100,200,\ldots ,900\right\} \) for blockgroups, generated as explained above, with fixed value \(G=5\).
The smallest number of measurements \(m^\#\) which leads to a median relative error of at most \(10^{5}\) is nearly linear in the dimension; see Fig. 6. For all algorithms the corresponding \(m^\#\) is very close even for different number of overlaps. Nevertheless the number of measurements \(m^\#\) is smaller for expander matrices than for Gaussian matrices. Furthermore in the expander case the instances with overlap \(\lfloor \frac{l1}{2}\rfloor \) have a smaller \(m^\#\). The average number of iterations performed by the algorithms fluctuates a lot for the Gaussian case. Here the value increases slowly for the ModelIHT while it increases more rapidly for the AMIHT. In the expander case the number of iterations is very close to each other for all algorithms and lies between 50 and 70 most of the time; see Fig. 7. The drop from \(N=100\) to \(N=200\) is due to the small value of d when \(N=100\). Furthermore the number of iterations is much lower in the expander case.
The average runtime for the Gaussian case is a bit larger than the runtime for the expander case as expected, since operations with dense matrices are more costly than the sparse ones. However, it may be due to the larger number of iterations; see Fig. 8. Furthermore the runtime for the instances with l–1 overlap is much larger in both cases. Here the AMIHT (AMEIHT) is faster than the ModelIHT (MEIHT) for the instances with overlap l–1 while it is slightly slower for the others.
In Figs. 9, 10 and 11 we show the same properties as above, but for varying \(G\in \left\{ 1,2,\ldots ,9\right\} \), a fixed dimension of \(N=200\) and d fixed to 7 for all values of G. Similar to Fig. 6 the value \(m^\#\) seems to be linear in G (see Fig. 9). Just the ModelIHT (MEIHT) for blocks with overlap \(l1\) seems to require an exponential number of measurements in G to guarantee a small median relative error. Here the AMIHT (AMEIHT) performs much better. Interestingly the number of iterations does not change a lot for increasing G for Gaussian matrices while it grows for the AMEIHT in the expander case. This is in contrast to the iteration results with increasing N; see Fig. 7. The runtime of all algorithms increases slowly with increasing G, except for the IHT for blocks with overlap \(l1\) the runtime explodes. For both instances the AMIHT (AMEIHT) is faster than the ModelIHT (MEIHT).
To conclude this section we selected the instances for dimension \(N=800\) and \(G=5\) and present the development of the median relative error over the number of measurements m; see Fig. 12. In the expander case the median relative error decreases rapidly and is nearly 0 already for \(\frac{m}{N}\approx 0.45\). Just for the MEIHT for blocks with overlap \(l1\) the relative error is close to 0 not until \(\frac{m}{N}\approx 0.55\). For the Gaussian case the results look similar with the only difference that a median relative error close to 0 is reached not until \(\frac{m}{N}\approx 0.6\).
Latent group lasso
In this section we present computational results for the latent group Lasso approach introduced in Sect. 2. We consider blockgroups and all group instances are generated as described in the previous section. We study the \(\ell _1/\ell _2\) variant presented in (14) and its \(\ell _0\) counterpart presented in (15). Both problems are implemented in CPLEX 12.6. For the \(\ell _0\) counterpart we implemented the integer programming formulation (16). For the given random Gaussian matrices \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) and its linear measurements \({\mathbf {y}}\in {\mathbb {R}}^m\) we use \(L({\mathbf {x}})=\Vert {\mathbf {A}}{\mathbf {x}}{\mathbf {y}}\Vert _2^2\) while for expander matrices we use \(L({\mathbf {x}})=\Vert {\mathbf {A}}{\mathbf {x}}{\mathbf {y}}\Vert _1\). The last choice is motivated by our comnputational tests which showed that the \(\ell _1\)norm has a better performance for expander matrices.
We run all algorithms for random signals \({\mathbf {x}}\in {\mathbb {R}}^N\) in dimension \(N=200\). The number of measurements is varied in \(m\in \left\{ 20,40,\ldots , 2N\right\} \) and we generate 20 random matrices \({\mathbf {A}}\in {\mathbb {R}}^{m\times N}\) each together with a random signal \({\mathbf {x}}\in {\mathbb {R}}^N\). We assume there is no noise, i.e. \({\mathbf {e}}=0\). For a given group model \({\mathfrak {G}}\) the support of the signal \({\mathbf {x}}\) is determined as the union of \(G=5\) randomly selected groups. The components of \({\mathbf {x}}\) in the support are calculated as identical and independent draws from a standard Gaussian distribution while all other components are set to 0. Our computations are processed for two classes of random matrices, Gaussian matrices and expander matrices generated as described in the previous section. After each calculation we calculate the relative error of the returned signal \(\hat{{\mathbf {x}}}\) to the true signal \({\mathbf {x}}\), i.e. we calculate
where we use \(p=1\) for the calculations corresponding to the expander matrices and \(p=2\) for the Gaussian matrices. Additionally we calculate the pattern recovery error
which was defined in [49], and the probability of recovery, i.e. the fraction of instances which were successfully recovered. We call a signal recovered if the relative error is smaller or equal than \(10^{4}\).
All computations were performed for \(\lambda \in \left\{ 0.25,0.5,\ldots ,5 \right\} \) and \(d_{{\mathcal {G}}} = 1\) for all groups \({\mathcal {G}}\in {\mathfrak {G}}\). For each \(m\in \left\{ 20,40,\ldots , 2N\right\} \) the \(\lambda \) with the best average relative error is calculated and the \(\lambda \) which has the best average relative error most often over all m is chosen. For all experiments the optimal value was \(\lambda ^*=0.25\).
All computations were calculated on a cluster of 64bit Intel(R) Xeon(R) CPU E52603 processors running at 1.60 GHz with 15 MB cache.
The results of the computations are presented in Figs. 13, 14, 15, 16 and 17. For each m we show the median relative error, the probability of recovery, the average pattern recovery error and the average number of selected groups G; each value calculated over all 20 matrices and for \(\lambda ^*=0.25\). In Fig. 17 we show for each m the value of \(\lambda \) with has the smallest average relative error.
The results in Fig. 13 show that the \(\ell _0\) variant of the latent group Lasso performs very well for Gaussian matrices. Even for a small number of measurements the median relative error is 0 while for expander matrices the error never is smaller than 0.6. The \(\ell _1/\ell _2\) variant for Gaussian matrices has a larger error for small number of measurements which decreases rapidly and is always 0 for m larger than 0.5N. In the expander case it is never smaller than 0.6 as well. The same picture holds for the pattern recovery error; see Fig. 15. The only difference here is that also for expander matrices the error tends to 0. The results indicate that the optimal support is calculated for both variants and for both types of matrices if m is large enough, but for expander matrices the latent group Lasso struggles to find the optimal componentvalues of \({\mathbf {x}}\) in the support. Interestingly the frequency of the groups does not significantly influence the results.
The probability of recovery for Gaussian matrices is 1 for all m for the \(\ell _0\) variant and is 1 for m larger than 0.8N for the \(\ell _1/\ell _2\) variant; see Fig. 14. According to the results for the relative error the probability of recovery for expander matrices is 0 for all m. The number of groups selected by the latent group Lasso is close to 5 for all m for the \(\ell _0\) variant; see Fig. 16. For the \(\ell _1/\ell _2\) variant with Gaussian matrices it is close to 5 for all m larger than 0.8N while for the expander matrices it is already close to 5 for all m larger than 0.5N. The value of the optimal \(\lambda \) is large for small m and is always 0.25 for larger m; see Fig. 17.
To summarize the results, it seems that the \(\ell _0\) latent group Lasso outperforms the \(\ell _1/\ell _2\) variant. It can even compete with the iterative algorithms tested in the previous section; the number of required measurements can be even smaller for the \(\ell _0\) latent group Lasso while at the same time the support is correctly recovered. Nevertheless in a realword application the optimal \(\lambda \) is not known and has to be found. Furthermore in contrast to the iterative algorithms studied in this work it can never be guaranteed that the recovered support and especially the number of groups calculated by the latent group Lasso is optimal. The expander variant of the latent group Lasso performs much worse than for the iterative algorithms. Especially this approach fails to recover the true signal for all instances, while the true support can be found.
Conclusion
In this paper we revisited the modelbased compressed sensing problem focusing on overlapping group models with bounded treewidth and low frequency. We derived a polynomial time dynamic programming algorithm to solve the exact projection problem for group models with bounded treewidth, which is more general than the stateoftheart considering loopless overlapping models. For general group models we derived an algorithm based on the idea of Bender’s decomposition, which may run in exponential time but often performs better than dynamic programming in practice. We proved that the latter procedure is generalizable from groupsparse models to groupsparse plus standard sparse models. The most dominant operation of iterative exact projection algorithms is the model projection. Hence our results show that the ModelIHT and the MEIHT run in polynomial time for group models with bounded treewidth. Alternatively, for group models with bounded frequency we show that another class of less accurate algorithms run in polynomial time. More precisely the AMIHT and the AMEIHT are algorithms using head and tailapproximations instead of exact projections.
Using Benders’ decomposition (with Gaussian and modelexpander sensing matrices) we compare the minimum number of measurements required by, and runtimes of, each of the four algorithms (ModelIHT, MEIHT, AMIHT and AMEIHT) to achieve a given accuracy. In summary the experimental results on overlapping block groups seem to indicate that the number of required measurements to recover a signal is smaller for expander matrices than for Gaussian matrices. Furthermore, we could observe that the number of measurements to ensure a small relative error is smaller for the approximate versions of the algorithms. The runtime gets much larger for Gaussian matrices with increasing N than for expander matrices, which might be just what is expected when applying dense versus sparse matrices. In general the approximate versions of the algorithms may have a larger number of iterations but the runtime is lower. This indicates that the larger number of iterations can be compensated by the faster computation of the approximate projection problems in each iteration. Additionally to the iterative algorithms we test the latent group Lasso approach on the same instances and show that the \(\ell _0\) variant outperforms the \(\ell _1/\ell _2\) variant and is even competitive to the iterative algorithms.
Notes
 1.
Recall that a clique in a graph is a set of mutually adjacent vertices.
 2.
Recall that a graph is cubic if every vertex is of degree 3, and planar if it can be drawn into the plane such that no two edges cross.
 3.
Actually, we are maximizing a submodular function subject to a uniform matroid constraint which is a simpler problem.
References
 1.
Ahn, K.J., Guha, S., McGregor, A.: Graph sketches: sparsification, spanners, and subgraphs. In: Proceedings of the 31st ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems, pp. 5–14. ACM, New York (2012)
 2.
Alam, M.J., Bekos, M.A., Kaufmann, M., Kindermann, P., Kobourov, S.G., Wolff, A.: Smooth orthogonal drawings of planar graphs. In: LATIN 2014: Theoretical Informatics—11th Latin American Symposium, Montevideo, Uruguay, March 31–April 4, 2014. Proceedings, pp. 144–155 (2014)
 3.
Bah, B., Baldassarre, L., Cevher, V.: Modelbased sketching and recovery with expanders. In: SODA, pp. 1529–1543. SIAM, New York (2014)
 4.
Baldassarre, L., Bhan, N., Cevher, V., Kyrillidis, A., Satpathi, S.: Groupsparse model selection: hardness and relaxations. IEEE Trans. Inf. Theory 62(11), 6508–6534 (2016)
 5.
Baraniuk, R., Cevher, V., Duarte, M., Hegde, C.: Modelbased compressive sensing. IEEE Trans. Inf. Theory 56(4), 1982–2001 (2010)
 6.
Baraniuk, R.G., Cevher, V., Wakin, M.B.: Lowdimensional models for dimensionality reduction and signal recovery: a geometric perspective. Proc. IEEE 98(6), 959–971 (2010)
 7.
Benders, J.F.: Partitioning procedures for solving mixedvariables programming problems. Numer. Math. 4(1), 238–252 (1962)
 8.
Berinde, R., Gilbert, A., Indyk, P., Karloff, H., Strauss, M.: Combining geometry and combinatorics: a unified approach to sparse signal recovery. In: 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 798–805. IEEE, New York (2008)
 9.
Blumensath, T., Davies, M.E.: Sampling theorems for signals from the union of linear subspaces. IEEE Trans. Inf. Theory 2007, 30–56 (2007)
 10.
Blumensath, T., Davies, M.E.: Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27(3), 265–274 (2009)
 11.
Bodlaender, H.L.: A lineartime algorithm for finding treedecompositions of small treewidth. SIAM J. Comput. 25(6), 1305–1317 (1996)
 12.
Buchheim, C., Kurtz, J.: Robust combinatorial optimization under convex and discrete cost uncertainty. EURO J. Comput. Optim. 6(3), 211–238 (2018)
 13.
Candès, E., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005)
 14.
Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)
 15.
Candès, E.J., Romberg, J., Tao, T.: Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. 59(8), 1207–1223 (2006)
 16.
Chandar, V.: A negative result concerning explicit matrices with the restricted isometry property. Technical Report (2008)
 17.
Cordeau, J., Furini, F., Ljubic, I.: Benders decomposition for very large scale partial set covering and maximal covering problems. Technical Report (2018)
 18.
DeVore, R.: Deterministic constructions of compressed sensing matrices. J. Complex. 23(4), 918–925 (2007)
 19.
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
 20.
Dwork, C., McSherry, F., Talwar, K.: The price of privacy and the limits of LP decoding. In: Proceedings of the 39th Annual ACM Symposium on Theory of Computing, pp. 85–94. ACM, New York (2007)
 21.
Eldar, Y.C., Mishali, M.: Robust recovery of signals from a structured union of subspaces. IEEE Trans. Inf. Theory 55(11), 5302–5316 (2009)
 22.
Foucart, S., Rauhut, H.: A Mathematical Introduction to Compressive Sensing. Springer, Berlin (2013)
 23.
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NPCompleteness. W.H. Freeman, London (1979)
 24.
Geoffrion, A.M.: Generalized benders decomposition. J. Optim. Theory Appl. 10(4), 237–260 (1972)
 25.
Gilbert, A.C., Levchenko, K.: Compressing network graphs. In: Proceedings of the LinkKDD Workshop at the 10th ACM Conference on KDD, vol. 124 (2004)
 26.
Hegde, C., Indyk, P., Schmidt, L.: Approximationtolerant modelbased compressive sensing. In: Proceedings of the 25th Annual ACMSIAM Symposium on Discrete Algorithms, pp. 1544–1561. Society for Industrial and Applied Mathematics, New York (2014)
 27.
Hegde, C., Indyk, P., Schmidt, L.: Approximation algorithms for modelbased compressive sensing. IEEE Trans. Inf. Theory 61(9), 5129–5147 (2015)
 28.
Hegde, C., Indyk, P., Schmidt, L.: Fast algorithms for structured sparsity. Bull. EATCS 3, 117 (2015)
 29.
Hegde, C., Indyk, P., Schmidt, L.: A nearlylinear time framework for graphstructured sparsity. In: International Conference on Machine Learning, pp. 928–937 (2015)
 30.
Hochbaum, D.S., Pathria, A.: Analysis of the greedy approach in problems of maximum kcoverage. Nav. Res. Log. (NRL) 45(6), 615–627 (1998)
 31.
Hoory, S., Linial, N., Wigderson, A.: Expander graphs and their applications. Bull. Am. Math. Soc. 43(4), 439–562 (2006)
 32.
Huang, J., Zhang, T., Metaxas, D.: Learning with structured sparsity. J. Mach. Learn. Res. 12(November), 3371–3412 (2011)
 33.
Huang, J., Zhang, T., et al.: The benefit of group sparsity. Ann. Stat. 38(4), 1978–2004 (2010)
 34.
Indyk, P., Razenshteyn, I.: On modelbased RIP1 matrices. In: International Colloquium on Automata, Languages, and Programming, pp. 564–575. Springer, Berlin (2013)
 35.
Jafarpour, S., Xu, W., Hassibi, B., Calderbank, R.: Efficient and robust compressed sensing using optimized expander graphs. IEEE Trans. Inf. Theory 55(9), 4299–4308 (2009)
 36.
Jenatton, R., Audibert, J.Y., Bach, F.: Structured variable selection with sparsityinducing norms. J. Mach. Learn. Res. 12(October), 2777–2824 (2011)
 37.
Jenatton, R., Mairal, J., Obozinski, G., Bach, F.: Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res. 12(July), 2297–2334 (2011)
 38.
Joseph, A., Barron, A.R.: Fast sparse superposition codes have near exponential error probability for \(R<C\). IEEE Trans. Inf. Theory 60(2), 919–942 (2014)
 39.
Kloks, T.: Treewidth, Computations and Approximations. Lecture Notes in Computer Science, vol. 842. Springer, Berlin (1994)
 40.
Kolar, M., Lafferty, J., Wasserman, L.: Union support recovery in multitask learning. J. Mach. Learn. Res. 12(July), 2415–2435 (2011)
 41.
Kouvelis, P., Yu, G.: Robust Discrete Optimization and Its Applications, vol. 14. Springer, Berlin (2013)
 42.
Kulik, A., Shachnai, H., Tamir, T.: Maximizing submodular set functions subject to multiple linear constraints. In: Proceedings of the 20th Annual ACMSIAM Symposium on Discrete Algorithms, pp. 545–554. Society for Industrial and Applied Mathematics, New York (2009)
 43.
Kulik, A., Shachnai, H., Tamir, T.: Approximations for monotone and nonmonotone submodular maximization with knapsack constraints. Math. Oper. Res. 38(4), 729–739 (2013)
 44.
Kyrillidis, A., Bah, B., Hasheminezhad, R., Dinh, Q.T., Baldassarre, L., Cevher, V.: Convex blocksparse linear regression with expanders–provably. In: Gretton, A., Robert, C.C. (eds.) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, vol 51, pp. 19–27 (2016)
 45.
Lounici, K., Pontil, M., Van De Geer, S., Tsybakov, A.B., et al.: Oracle inequalities and optimal inference under group sparsity. Ann. Stat. 39(4), 2164–2204 (2011)
 46.
Musser, D.R.: Introspective sorting and selection algorithms. Softw. Pract. Exp. 27(8), 983–993 (1997)
 47.
Muthukrishnan, S., et al.: Data streams: algorithms and applications. Found. Trends\(^{\textregistered }\) Theor. Comput. Sci. 1(2), 117–236 (2005)
 48.
Negahban, S.N., Wainwright, M.J.: Simultaneous support recovery in high dimensions: benefits and perils of block \(\ell _{1}/\ell _{\infty }\)regularization. IEEE Trans. Inf. Theory 57(6), 3841–3863 (2011)
 49.
Obozinski, G., Jacob, L., Vert, J.P.: Group lasso with overlaps: the latent group lasso approach. Technical Report (2011)
 50.
Rao, N., Recht, B., Nowak, R.: Signal recovery in unions of subspaces with applications to compressive imaging. Technical Report (2012)
 51.
Rao, N.S., Nowak, R.D., Wright, S.J., Kingsbury, N.G.: Convex approaches to model wavelet sparsity patterns. In: 2011 18th IEEE International Conference on Image Processing, pp. 1917–1920. IEEE, New York (2011)
 52.
Robertson, N., Seymour, P.D.: Graph minors vs. excluding a planar graph. J. Comb. Theory Ser. B 41(1), 92–114 (1986)
 53.
Schmidt, L., Hegde, C., Indyk, P.: The constrained earth mover distance model, with applications to compressive sensing. In: 10th International Conference on Sampling Theory and Applications (SAMPTA) (2013)
 54.
Takeishi, Y., Kawakita, M., Takeuchi, J.: Least squares superposition codes with Bernoulli dictionary are still reliable at rates up to capacity. IEEE Trans. Inf. Theory 60(5), 2737–2750 (2014)
 55.
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68(1), 49–67 (2006)
 56.
Zhao, P., Rocha, G., Yu, B.: Grouped and hierarchical model selection through composite absolute penalties. Technical Report. Department of Statistics, UC Berkeley, p. 703 (2006)
Acknowledgements
Open Access funding provided by Projekt DEAL. BB acknowledges the support from the funding by the German Federal Ministry of Education and Research, administered by Alexander von Humboldt Foundation, for the German Research Chair at AIMS South Africa.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bah, B., Kurtz, J. & Schaudt, O. Discrete optimization methods for group model selection in compressed sensing. Math. Program. 190, 171–220 (2021). https://doi.org/10.1007/s10107020015297
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107020015297
Keywords
 Compressed sensing
 Group models
 Iterative hard thresholding
 Maximum weight coverage problem
Mathematics Subject Classification
 90C10
 90C27