Discrete Optimization Methods for Group Model Selection in Compressed Sensing

In this article we study the problem of signal recovery for group models. More precisely for a given set of groups, each containing a small subset of indices, and for given linear sketches of the true signal vector which is known to be group-sparse in the sense that its support is contained in the union of a small number of these groups, we study algorithms which successfully recover the true signal just by the knowledge of its linear sketches. We derive model projection complexity results and algorithms for more general group models than the state-of-the-art. We consider two versions of the classical Iterative Hard Thresholding algorithm (IHT). The classical version iteratively calculates the exact projection of a vector onto the group model, while the approximate version (AM-IHT) uses a head- and a tail-approximation iteratively. We apply both variants to group models and analyse the two cases where the sensing matrix is a Gaussian matrix and a model expander matrix. To solve the exact projection problem on the group model, which is known to be equivalent to the maximum weight coverage problem, we use discrete optimization methods based on dynamic programming and Benders' Decomposition. The head- and tail-approximations are derived by a classical greedy-method and LP-rounding, respectively.


Introduction
In many applications involving sensors or sensing systems an unknown sparse signal has to be recovered from a relatively small number of measurements. The reconstruction problem in standard compressed sensing attempts to recover an unknown k-sparse signal x ∈ R N , i.e. it has at most k non-zero entries, from its (potentially noisy) linear measurements y = Ax + e. Here, A ∈ R m×N for m N , y ∈ R m and e ∈ R m is a noise vector, typically with a bounded noise level e 2 ≤ η; see [19,13,14]. A well-known result is that, if A is a random Gaussian matrix, the number of measurements required for most of the classical algorithms like 1 -minimization or Iterative Hard Thresholding (IHT) to successfully recover the true signal is m = O (k log (N/k)) [15,10].
In model-based compressed sensing we exploit second-order structures beyond the first order sparsity and compressibility structures of a signal to more efficiently encode and more accurately decode the signal. Efficient encoding means taking a fewer number of measurements than in the standard compressed sensing setting, while accurate decoding does not only include smaller recovery error but better interpretability of the recovered solution than in the standard compressed sensing setting. The second order structures of the signal are usually referred to as the structured sparsity of the signal. The idea is, besides standard sparsity, to take into account more complicated structures of the signal [5]. Nevertheless most of the classical results and algorithms for standard compressed sensing can be adapted to the model-based framework [5]. Numerous applications of model-based compressed sensing exist in practice. Key amongst these applications is the multiple measurement vector (MMV) problem, which can be modelled as a block-sparse recovery • We extend the results in [4] by proving that the projection problem can be solved in polynomial time if the incidence graph of the underlying group model has bounded treewidth. This includes the case when the intersection graph has bounded treewidth, which generalizes the result for acyclic graphs derived in [4]. We complement the latter result with a hardness result that we use to justify the bounded treewidth approach.
• We derive a Benders' decomposition procedure to solve the projection problem for arbitrary group models, assuming no restriction on the frequency or the structure of the groups. The latter procedure even works for the more general model combining group-sparsity with classical sparsity. We integrate the latter procedure into the Model-IHT and MEIHT algorithm.
• We apply the Approximate-Model IHT (AM-IHT) derived in [26,28] to Gaussian and expander matrices and to the case of group models with bounded frequency, which is the maximal number of groups an element is contained in. In the expander case we call the algorithm AM-EIHT. To this end we derive both, head-and tail-approximations of arbitrary precision using a classical greedy method and LProunding. Using the AM-IHT and the results in [26,28], this implies compressive sensing 2 / 2 recovery guarantees for group-sparse signals. We show that the number of measurements needed to guarantee a successful recovery exceeds the number needed by the usual model-based compressed sensing bound [5,9] only by a constant factor.
• We test the algorithms Model-IHT, MEIHT, AM-IHT and AM-EIHT on groups given by overlapping blocks for random signals and measurement matrices. We analyse and compare the minimal number of measurements needed for recovery, the run-time and the number of iterations of the algorithm.

Notation
In most of this work scalars are denoted by ordinary letters (e.g. x, N ), vectors and matrices by boldface letters (e.g. x, A), and sets by calligraphic capital letters (e.g., S). The cardinality of a set S is denoted by The support of a vector x ∈ R N is defined by supp(x) = {i ∈ [N ] | x i = 0}. For a given k ∈ N we say a vector x ∈ R N is k-sparse if |supp(x)| ≤ k. For a matrix A, the matrix A S denotes a sub-matrix of A with columns indexed by S. For a graph G = (V, E) and S ⊆ V , Γ(S) denotes the set of neighbours of S, that is the set of nodes that are connected by an edge to the nodes in S. We denote by e ij = (i, j) an edge connecting node i to node j. The set G i denotes a group of size g i and a group model is any subset of G = {G 1 , . . . , G M }; while a group model of order G ∈ [N ] is denoted by G G , which is a collection of any G groups of G. For a subset of groups S ⊂ G we sometimes write The p norm of a vector x ∈ R N is defined as

Compressed Sensing
Recall that the reconstruction problem in standard compressed sensing [19,13,14] attempts to recover an unknown k-sparse signal x ∈ R N , from its (potentially noisy) linear measurements y = Ax + e, where A ∈ R m×N , y ∈ R m for m N and e ∈ R m is a noise vector, typically with a bounded noise level e 2 ≤ η. The reconstruction problem can be formulated as the optimization problem where x 0 is the number of non-zero components of x. Problem (1) is usually relaxed to an 1 -minimization problem by replacing · 0 with the 1 -norm. It has been established that the solution minimizing the 1 -norm coincides with the optimal solution of (1) under certain conditions [14]. Besides the latter approach the compressed sensing problem can be solved by a class of greedy algorithms, including the IHT [10]. A detailed discussion on compressed sensing algorithms can be found in [22]. The idea behind the IHT can be explained by considering the problem Under certain choices of η and k the latter problem is equivalent to (1) [10]. Based on the idea of gradient descent methods, (2) can be solved by iteratively taking a gradient descent step, followed by a hard thresholding operation, which sets all components to zero except the largest k in magnitude. Starting with an initial guess x (0) = 0, the (n + 1)-th IHT update is given by where H k : R N → R N is the hard threshold operator and A * is the adjoint matrix of A.
Recovery guarantees of algorithms are typically given in terms of what is referred to as the p / q instance optimality [14]. Precisely, an algorithm has p / q instance optimality if for a given signal x it always returns a signal x with the following error bound x − x p ≤ c 1 (k, p, q)σ k (x) q + c 2 (k, p, q)η, where 1 ≤ q ≤ p ≤ 2, c 1 (k, p, q), c 2 (k, p, q) are constants independent of the dimension of the signal and σ k (x) q = min z: z 0≤k x − z q is the best k-term approximation of a signal (in the q -norm). Ideally, we would like to have 2 / 1 instance optimality [14]. It turned out that the instance optimality of the known algorithms depends mainly on the sensing matrix A. Key amongst the tools used to analyse the suitability of A is the restricted isometry property, which is defined in the following.
Definition 1 (RIP). A matrix A ∈ R m×N satisfies the p -norm restricted isometry property (RIP-p) of order k, with restricted isometry constant (RIC) δ k < 1, if for all k-sparse vectors x Typically RIP without the subscript p refers to case when p = 2. We use this general definition here because we will study the case p = 1 later. The RIP is a sufficient condition on A that guarantees optimal recovery of x for most of the known algorithms. If the entries of A are drawn i.i.d from a sub-Gaussian distribution and m = O (k log(N/k)), then A has RIP-2 with high probability and leads to the ideal 2 / 1 instance optimality for most algorithms; see [15]. Note that the bound O (k log(N/k)) is asymptotically tight. On the other hand, deterministic constructions of A or random A with binary entries with non-zero mean do not achieve this optimal m, and are faced with the so-called square root bottleneck where m = Ω k 2 ; see [18,16].

Sparse Sensing Matrices from Expander Graphs.
The computational benefits of sparse sensing matrices necessitated finding a way to circumvent the square root bottleneck for non-zero mean binary matrices. One such class of binary matrices is the class of adjacency matrices of expander graphs (henceforth referred to as expander matrices), which satisfy the weaker RIP-1. Expander graphs are objects of interest in pure mathematics and theoretical computer science, for a detailed discourse on this subject see [31]. We define an expander graph as follows: , E) be a left-regular bipartite graph with N left vertices, m right vertices, a set of edges E and left degree d. If for any ∈ (0, 1/2) and any S ⊂ [N ] of size |S| ≤ k it holds |Γ(S)| ≥ (1 − )dk, then H is referred to as a (k, d, )-expander graph.
An expander matrix is the adjacency matrix of an expander graph. Choosing m = O (k log(N/k)), then a random bipartite graph is an (k, d, )-expander graph with high probability [22]. Furthermore expander matrices achieve the sub-optimal 1 / 1 instance optimality [8]. For completeness we state the lemma in [35] deriving the RIC for such matrices.
Lemma 3 (RIP-1 for Expander Matrices, [35]). Let A be the adjacency matrix of a (k, d, )-expander graph H, then for any k-sparse vector x, we have The most relevant algorithm that exploits the structure of the expander matrices is the Expander-IHT (EIHT) proposed in [22]. Similar to the IHT algorithm it performs updates where M : R m → R N is the median operator and [M(z)] i = median {z j } j∈Γ(i) for each z ∈ R m . For expander matrices the EIHT achieves 1 / 1 instance optimality [22].

Model-based Compressed Sensing
Besides sparsity (and compressibility) signals do exhibit more complicated structures. When compressed sensing takes into account these more complicated structures (or models) in addition to sparsity, it is usually referred to as model-based compressed sensing or structured sparse recovery [5]. A precise definition is given in the following: Note that the classical k-sparsity studied in Section 2.2 is a special case of a structured sparsity model where all supports of size at most k are allowed. Popular structured sparsity models include tree-sparse, block-sparse, and group-sparse models [5]. In this work we study group-sparse models which we will introduce in Section 2.4.
Similar to the classical sparsity case the RIP property is defined for structured sparsity models.
In [5] it was shown that for a matrix A ∈ R m×N to have the Model-RIP with high probability the required number of measurements is m = O(k) for tree-sparse signals and m = O (kg + log (N/(kg))) for block-sparse signals with block size g, when the sensing matrices are dense (typically sub-Gaussian). In general for a given structured sparsity model M for sub-Gaussian random matrices the number of required measurements is m = O δ −2 M g log(δ −1 M ) + log(|M|) where g is the cardinality of the largest support in M. Furthermore the authors in [5] show that classical algorithms like the IHT can be modified for structured sparsity models to achieve instance optimality. To this end the hard thresholding operator H k used in the classical IHT is replaced by a model-projection oracle which for a given signal x ∈ R N returns the closest signal over all signals having support in M. We define the model-projection oracle in the following.
Definition 6 (Model-Projection Oracle [5]). Given p ≥ 1, a model-projection oracle is a function P M : R N → R N such that for all x ∈ R N we have supp(P M (x)) ∈ M and it holds From the definition it directly follows that P M (x) i = x i if i ∈ supp(P M (x)) and 0 otherwise. Note that in the case of classical k-sparsity the model-projection oracle is given by the hard thresholding operator H k . In contrast to this case, calculating the optimal model projection P M (x) for a given signal x ∈ R N and a given structured sparsity model M may be computationally hard. Depending on the model M finding the optimal model projection vector may be even NP-hard; see Section 2.4. The modified version of the IHT derived in [5] is presented in Algorithm 1.

Algorithm 1 (Model-IHT)
Input A, y, e with y = Ax + e and M Output Model-IHT(y, A, M), an M-sparse approximation to x 1: x (0) ← 0; n ← 0 2: while halting criterion false do 3: n ← n + 1 5: end while 6: Return: Model-IHT(y, A, M) ← x (n) Note that common halting criterion is given by a maximum number of iterations or a bound on the iteration error Model-Sparse Sensing Matrices from Expander Graphs.
In the sparse matrix setting the sparse matrices we consider as model expander matrices, which are adjacency matrices of model expander graphs, defined thus. In this setting the known results are sub-optimal. Using model expander matrices for tree-sparse models the required number of measurements to obtain instance optimality is m = k log (N/k) / log log (N/k) which was shown in [34,3].
A key ingredient in the analysis for the afore-mentioned sample complexity results for model expanders is the model-RIP-1, which is just RIP-1 for model expander matrices (hence they are also called model-RIP-1 matrices [34]). Consequently, Lemma 3 also holds for these model-RIP-1 matrices [34].
First, in [3] the Model Expander IHT (MEIHT) was studied for loopless overlapping groups and D-ary tree models. Similar to Algorithm 1 the MEIHT is a modification of EIHT where the hard threshold operator H k is replaced by the projection oracle P M onto the model M. Thus the update of the MEIHT in each iteration is given by In [3] the authors show that this algorithm always returns a solution in the model, which is highly desirable for some applications. The running time is given in the proposition below.
Proposition 8 (Proposition 3.1, [3]). The runtime of MEIHT is O (kNn) and O M 2 Gn + Nn for D-ary tree models and loopless overlapping group models respectively, where k is the sparsity of the tree model and G is the number of active groups (i.e. group sparsity of the model),n is the number of iterations, M is the number of groups and N is the dimension of the signal.

Group Models
The models of interest in this paper are group models. A group model is a collection . We denote G G as the structured sparsity model (i.e. group-sparse model) which contains all supports contained in the union of at most G groups in G, i.e.
We will always tacitly assume that . We say that a signal x ∈ R N is G-group-sparse if the support of x is contained in G G . If G is clear from the context, we simply say that x is group-sparse. Let g i = |G i | and denote g max = max i∈[M ] g i as the size of largest support in G G . The intersection graph of a group model is the graph which has a node for each group G i ∈ G and has an edge between G i and G j if the groups overlap, i.e. if G i ∩ G j = ∅; see [4]. We call a group model loopless if the intersection graph of the group model has no cycles. We call a group model block model if all groups have equal size and if they are pairwise disjoint. In this case the groups are sometimes called blocks. We define the frequency f of a group model as the maximum number of groups an element is contained in, i.e.
In [4] besides the latter group models, the more general models are considered where an additional sparsity in the classical sense is required on the signal. More precisely for a given budget G ∈ [M ] and a sparsity K ∈ [N ] they study the structured sparsity model Note that for K = N we obtain a standard group model defined as above. Both variants of group models defined above clearly are special cases of a structured sparsity model defined in Section 2.3. Therefore all results for structured sparsity models can be used for group-sparse models. To adapt Algorithm 1 a model projection oracle P G G (or P G G,K ) has to be provided. Note that for several applications we are not only interested in the optimal support of the latter projection but we want to find at most G groups covering this support. The main work of this paper is to analyse the complexity of the latter problem for group models and to provide efficient algorithms to solve it exactly or approximately. Given a signal x ∈ R N , the group model projection problem or sometimes called signal approximation problem is then to find a support S ∈ G G,K together with G groups covering this support such that ||x−x S || p is minimal, i.e. we want to solve the problem If the parameter K is not mentioned we assume K = N .
Baldassarre et al. [4] observed the following close relation to the NP-hard Maximum G-Coverage problem. Given a signal x ∈ R N , a group-sparse vectorx for which ||x −x|| 2 2 is minimum satisfiesx i ∈ {0, x i } for all i ∈ [N ]. For a vector with the latter property, holds and so minimizing ||x −x|| 2 2 is equivalent to maximizing N i=1x 2 i . Consequently, the group model projection problem with K = N is equivalent to the problem of finding an index set I ⊂ [M ] of at most G groups, i.e. |I| ≤ G, maximizing i∈ j∈I Gj x 2 i . This problem is called Maximum G-Coverage in the literature [30]. Despite the prominence of the latter problem, we will stick to the group model notation, since it is closer to the applications we have in mind and we will leave the regime of Maximum G-Coverage by introducing more constraints later.
We simplify the notation by defining w i = x 2 i for all i ∈ [N ]. Using this notation, the group model projection problem is equivalent to finding an optimal solution of the following integer program: Here, the variable u i is one if and only if the i-th index is contained in the support of the signal approximation, and v i is one if and only if the group G i is chosen.
Besides the NP-hardness for the general case the authors in [4] show that the group model projection problem can be solved in polynomial time by dynamic programming for the special case of loopless groups. Furthermore the authors show that if the intersection graph is bipartite the projection problem can be solved in polynomial time by relaxing problem (12). Similar results are obtained for the more general problem, where additional to the group-sparsity the classical K-sparsity is assumed, i.e. the additional constraint i∈ [N ] u i ≤ K is added to problem (12).
As stated in Section 2.3, the authors of [5] first study a special case of group models, i.e. block models, where the groups are non-overlapping and are all of equal size. The sample complexity they derived in that work for sub-Gaussian measurement matrices is m = O (Gg + log (N/(Gg))), where g is the fixed block size. However, in [3] the authors studied group models in the sparse matrix setting, besides other results they proposed the MEIHT algorithm for tree and group models. The more relevant result to this work show that for loopless overlapping group-sparse models with maximum group size g max , using model expander measurement matrices, the number of measurements required for successful recovery is m = Gg max log (N/(Gg max )) / log (Gg max ); see [3]. This results holds for general groups, the "looplessness" condition is only necessary for the polynomial time reconstruction using the MEIHT algorithm. Therefore, this sample complexity result also holds for the general group models we consider in this manuscript.

Group Lasso
The classical Lasso approach for k-sparse signals seeks to minimize a quadratic error penalized by the 1 -norm [22]. More precisely, for a given λ > 0 we want to find an optimal solution of the problem It is well known that using the latter approach for appropriate choices of λ leads to sparse solutions. The Lasso approach was already extended to group models in [55] and afterwards studied in several works for non-overlapping groups; see [32,40,45,48]. The idea is again to minimize a loss function, e.g. the quadratic loss, and to penalize the objective value for each group by a norm of the weights of the recovered vector restricted to the items in each group. An extension which can also handle overlapping groups was studied in [56,37]. In [49] the authors study what they call the latent group Lasso. To this end they consider a loss function L : R N → R and propose to solve the 1 / 2 -penalized problem for given weights λ > 0 and d G ≥ 0 for each G ∈ G. The idea is that for ideal choices of the latter weights a solution of Problem (14) will be sparse and its support is likely to be a union of groups. Nevertheless it is not guaranteed that the number of selected groups is optimal as it is the case for the iterative methods in the previous sections. Note that equivalently we can replace each norm w G by a variable z G in the objective function and add the quadratic constraint w G 2 2 ≤ (z G ) 2 . Hence Problem (14) can be modelled as a quadratic problem and can be solved by standard solvers like CPLEX.
The l 0 counterpart of Problem (14) was considered in [33] under the name block coding and can be formulated as Note that in contrast to Problem (14) an easy reformulation of Problem (15) into a continuous quadratic problem is not possible. Nevertheless we can reformulate it using the mixed-integer programming formulation where M i ∈ R can be chosen larger or equal to the entry |x i | of the true signal for each i ∈ [N ]. The variables v G ∈ {0, 1} have value 1 if and only if group G is selected for the support of x. As for the 1 / 2 variant it is not guaranteed that the number of selected groups is optimal. Note that the latter problem is a mixed-integer problem and therefore hard to solve in large dimension in general. Furthermore the efficiency of classical methods as the branch & bound algorithm depend on the quality of the calculated lower bound which depends on the values M i . Hence in practical applications where the true signal is not known good estimations of the M i values are crucial for the success of the latter method. Another drawback is that the best values for λ and the weights d G are not known in advance and have to be chosen appropriately for each application.

Approximation Algorithms for Model-Based Compressed Sensing
As mentioned in the last section solving the projection problem, given in Definition 6, may be computationally hard. To overcome this problem the authors in [28,27] present algorithms, based on the idea of IHT (and CoSaMP), which instead of solving the projection problems exactly, use two approximation procedures called head-and tail-approximation. In this section we will shortly describe the concept and the results in [28,27]. Note that we just present results related to the IHT, although similar results for the CoSaMP were derived as well in [28,27]. Given two structured sparsity models M, M H and a vector x, let H be an algorithm that computes a vector H(x) with support in M H . Then, given some α ∈ R (typically α < 1) we say that H is an Note that the support of the vector calculated by H is contained in M H while the approximation guarantee must be fulfilled for all supports in M. Moreover given two structured sparsity models M, M T let T be an algorithm which computes a vector Note that in general a head approximation does not need to be a tail approximation and vice versa. The cases studied in [27] are p = 1 and p = 2. For the case p = 2 the authors propose an algorithm called Approximate Model-IHT (AM-IHT), shown in Algorithm 2.

Algorithm 2 (AM-IHT)
Input A, y, e with Ax = y + e and M Output AM-IHT(y, A, M), an approximation to x 1: x 0 ← 0; n ← 0 2: while halting criterion is false do 3: holds. The authors in [27] prove that for a signal x ∈ R N with supp(x) ∈ M, noisy measurements y = Ax+e where A has M ⊕ M T ⊕ M H -model RIP with RIC δ, Algorithm 2 calculates a signal estimatex satisfying where τ depends on δ, α and β. Note that the condition (19) holds e.g. for approximation accuracies α > 0.9 and β < 1.1.
For the case p = 1 the authors replace Step 3 in Algorithm 2 by the update where M is the median operator defined as in Section 2.2. Under the same assumptions as above, but considering p = 1 for the head-and tail-approximations and A having the M-RIP-1, the authors in [27] show convergence of the adapted algorithm.

Comparison to Related Works
In a more illustrative way in Tables 1 and 2 below we show where our results stand vis-a-vis other results.
In Table 1 we show the studied models together with the derived sample complexity and the studied class of measurements matrices. In Table 2 we present the names of the studied algorithms, the class of the model projection, the class of the algorithm used to solve the model projection, the runtime complexity of the projection problem and the class of instance optimality. Here N is the dimension of the signal, k is the standard sparsity of a signal, G is the number of active groups (also referred to as group/block sparsity), g is the block size, g max is the size of the largest group, s is the size of the support of a forest (union of sub-graphs) in a given graphs, c is the maximum number of connected components in the forest, B is a bound on the total weights of edges in the forest, and ρ(H) is the weight-degree of a graph H, see respective references of the complexities for further explanation of these terms.

Expander This
Groups with overlaps

Algorithms with Exact Projection Oracles
In this section we study the exact group model projection problem which has to be solved iteratively in the Model-IHT and the MEIHT. We extend existing results for group-sparse models and pass from loopless overlapping group models (which was the most general prior to this work) to overlapping group models  Table 1. Here M is the number of groups,n is the number of iterations the algorithm takes to get a given solution, 21 refers to the mixed 1 and 2 group norm, and w T is the treewidth of the incidence graph. Abbreviations: SOCP -second order cone program, DP -dynamic programming, BD -Bender's decomposition, CD -covariate duplication, and BCD -blockcoordinate descent. For space purposes we neglect the big-O notation in the complexities. For algorithms whose runtimes are not explicitly stated but implied to be either polynomial or exponential in their respective references, we just state this without the explicit expression.

Reconstruction Algorithm
Model Projection Runtime Instance name projection algorithm complexity optimality [49] Group-LASSO Exact Model-based Approximate -Exponential 1 / 1 sparse recovery [3] MEIHT Exact DP of bounded treewidth and to general group models without any restriction on the structure. The graph representing a loopless overlapping group model has a treewidth of 1.
We start by showing that it is possible to perform exact projections onto overlapping groups with bounded treewidth using dynamic programming, see Section 3.1. While this procedure has a polynomial run-time bound it is restricted to the class of group models with bounded treewidth. Nevertheless we prove that the exact projection problem is NP -hard if the incidence graph is a grid which is the most basic graph structure without bounded treewidth. For the sake of completeness we solve the exact projection problem for all instances of group models by a method based on Benders' decomposition in Section 3.6. Solving an NP-hard problem this method does not yield a polynomial run-time guarantee but works well in practice as shown in [17]. In Section 3.5 we present an appropriately modified algorithm (MEIHT) with exact projection oracles for the recovery of signals from structured sparsity models. We derive corollaries for the general group-model case from existing works about run-time and convergence of this modified algorithm.
Recall the following notation: , and a group model is a collection The group-sparse model of order G is denoted by G G , which contains all supports S which are contained in the union of at most G groups of G, i.e. S ⊆ j∈I G j , I ⊆ [M ] and |I| ≤ G; see (10). We will interchangeably say that x or S is G G -sparse. Clearly group models are a special case of structured sparsity models. Assume s max is the size of the maximal support which is possible by selecting G groups out of G. For g max denoting the maximal size of a single group in G, i.e., we have s max ∈ O (Gg max ). Furthermore the number of possible supports is in O(M G ). Therefore applying the result from Section 2.3 we obtain (20) as the number of required measurements for a sub-Gaussian matrix to obtain the group-model-RIP with RIC δ with high probability, which induces the convergence of Algorithm 1 for small enough δ.

Group models of Low Treewidth
One approach to overcome the hardness of the group model projection problem is to restrict the structure of the group models considered. To this end we follow Baldassarre et al. [4] and consider two graphs associated to a group model G.
The intersection graph of G, I(G), is given by the vertex set V (I(G)) = G, and the edge set The incidence graph of G, B(G), is given by the vertex set V (B(G)) = [N ] ∪ G, and the edge set Note that the incidence graph is bipartite since an edge is always adjacent to an element e and a group S. See Fig. 1 for a simple illustration of the two constructions. Baldassarre et al. [4] prove that there is a polynomial time algorithm to solve the group model projection problem in the case that the intersection graph is an acyclic graph. Their algorithm uses dynamic programming on the acyclic structure of the intersection graph.
We generalize this approach and show that the same problem can be solved in polynomial time if the treewidth of the incidence graph is bounded. Following Proposition 9 below, this implies that the group model projection Problem can be solved in polynomial time if the treewidth of the intersection graph is bounded. We proceed by formally introducing the relevant concepts.

Tree Decomposition
LetḠ = (V, E) be a graph. A tree decomposition ofḠ is a tree T where each node x ∈ V (T ) of T has a bag B x ⊆ V of vertices ofḠ such that the following properties hold: 2. If B x and B y both contain a vertex v ∈ V , then the bags of all nodes of T on the path between x and y contain v as well. Equivalently, the tree nodes containing the vertex v form a connected subtree of T .
3. For every edge vw in E there is some bag that contains both v and w. That is, vertices in V can be adjacent only if the corresponding subtrees in T have a node in common.
The width of a tree decomposition is the size of its largest bag minus one, i.e. max x∈V (T ) |B x | − 1. The treewidth ofḠ, tw(Ḡ), is the minimum width among all possible tree decompositions ofḠ. Intuitively, the treewidth measures how 'treelike' a graph is: the smaller the treewidth is, the more treelike is the graph. The graphs of treewidth one are the acyclic graphs. Fig. 2 shows a graph of treewidth 2, together with a tree decomposition. Before stating any algorithms, we discuss the relation of the treewidth of the intersection and the incidence graphs of a given group model.
When bounding the treewidth of the graphs associated to a group model, it makes sense to consider the incidence graph rather than the intersection graph. This is due to the following simple observation.
Proposition 9. For any group model G it holds that tw(B(G)) ≤ tw(I(G)) + 1. However, for every t there exists a group model G such that tw(I(G)) − tw(B(G)) = t.
This statement is not necessarily new, but we quickly prove it in our language in order to be self-contained.
Proof of Proposition 9. To see the first assertion, let G be a group model. Let T be a tree decomposition of I(G) of width tw(I(G)). In the following, we attach leaves to T , one for each element in [N ], and obtain a tree decomposition of B(G). Each leaf will contain at most tw(I(G)) many elements of G and at most one element of [N ]. Hence we get tw(B(G)) ≤ tw(I(G)) + 1.
To construct the tree decomposition of B(G) pick any i ∈ [N ], and let G i be the set of groups in G containing i. Since all groups in G i contain i, the set G i is a clique in I(G). 1 Moreover, since T is a tree decomposition of I(G), the subtrees of the groups in G i mutually share a node. As subtrees of a tree have the Helly property, there is at least one node x of T such that G i ⊆ B x . We now add a new node x i with bag B xi = G i ∪ {i} and an edge between x i and x in T . Doing this for all i ∈ [N ] simultaneously, it is easy to see that we arrive at a tree decomposition T of B(G) of width at most tw(I(G)) + 1 which proves the first assertion.
To prove the second assertion consider, for any t, the group model G where Note that B(G) is a tree, hence tw(B(G)) = 1. In I(G), however, the set G is a clique of size t + 2. Thus, tw(I(G)) = t + 1 which implies tw(I(G)) − tw(B(G)) = t.
Consider now a tree decomposition T of a graphḠ. We say that T is a nice tree decomposition if every node x is of one of the following types.
• Leaf: x has no children and B x = ∅.
• Introduce: x has one child, say y, and there is a vertex v / ∈ B y ofḠ with B x = B y ∪ {v}.
• Forget: x has one child, say y, and there is a vertex v / ∈ B x ofḠ with B y = B x ∪ {v}.
• Join: x has two children y and z such that This kind of decomposition limits the structure of the difference of two adjacent nodes in the decomposition. A folklore statement (explained in detail in the classic survey by Kloks [39]) says that such a nice decomposition is easily computed given any tree decomposition ofḠ without increasing the width.

Dynamic Programming
In this section we derive a polynomial time algorithm for the group model projection problem for fixed treewidth. As in Section 2.4 we assume we have a given signal x ∈ R N and define w ∈ R N with w i = x 2 i . In the following we use the notation . The algorithm is presented using a nice tree decomposition of the incidence graph B(G) and uses the following concept. Fix a node x of the decomposition tree of B(G), a number i with 0 ≤ i ≤ G and a map c : (a) S contains only groups that appear in some bag of a node in the subtree rooted at node x, (d) of the elements in B x , S covers exactly those that are in c −1 (1). Formally, The objective value of the set S is given by w( S) + w(c −1 (1 ? )). Intuitively, a feasible solution to group model projection(x, i, c) does not cover elements labelled 0 or 1 ? , but covers all elements labelled 1. The elements labelled 1 ? are assumed to be covered by groups not yet visited in the tree decomposition. If group model projection(x, i, c) does not admit a feasible solution, we say that group model projection(x, i, c) is infeasible. The maximum objective value attained by a feasible solution to group , and inconsistent otherwise. Note that consistency of c is necessary to ensure feasibility of group model projection(x, i, c), but not sufficient.
Our algorithm processes the nodes of a nice tree decomposition in a bottom-up fashion. Fix a node x, a number i with 0 ≤ i ≤ G and a map c : B x → {0, 1, 1 ? }. We use dynamic programming to compute the value OPT(x, i, c), assuming we know all possible values OPT(y, j, c ) for all children y of x, all j with 0 ≤ j ≤ G, and all c : B y → {0, 1, 1 ? }.
In the following, for a subset S ⊂ B x the function c| S : S → {0, 1, 1 ? } is the restriction of c to S. We use Γ(v) to denote the neighborhood of v in B(G).
If c is not consistent, we may set OPT(x, i, c) = −∞ right away. We thus assume that c is consistent and distinguish the type of node x as follows.
• Introduce: let y be the unique child of x and let v / ∈ B y such that B x = B y ∪ {v}.
• Join: we set where y and z are the two children of x. The maximum is taken over all consistent colourings c , c : • Proof. The proof follows the individual steps of the dynamic programming algorithm. Consider the problem group model projection(x, i, c). If c is not consistent, we correctly set OPT(x, i, c) = −∞. We thus proceed to the case when c is consistent. Fix an optimal solution S of group model projection(x, i, c) if existent.
The leaf -node case is clear, so we proceed to the case of x being an introduce-node. Let y be the unique child of x and let v / . Assume that c(v) = 0. Since c is consistent, c −1 (1) ∩ Γ(v) = ∅ holds, and so S is an optimal solution to group model projection(y, i, c| By ). We may thus set OPT(x, i, c) = OPT(y, i, c| By ).
If c(v) = 1, S covers v, so we need to make sure some vertex u ∈ Γ(v) ∩ B x is contained in the solution in order for group model projection(x, i, c) to be feasible. Hence we have OPT(x, i, c) = OPT(y, i, c| Next we assume that v ∈ G. If c(v) = 0, we may simply put OPT(x, i, c) = OPT(y, i, c| By ). So, assume that c(v) = 1. If group model projection(x, i, c) is feasible and thus S exists, define S = S \ {v}. Now S is a solution to OPT(y, i, c ) for some c : ). Note that (y, c ) is compatible to (x, c). Consequently, OPT(x, i, c) is upper bounded by the right hand side of (21).
To see that OPT(x, i, c) is at least the right hand side of (21), let (y, c ) be compatible to (x, c) and let S be a solution to group model projection(y, i, c ) of objective value λ ∈ R. Then S ∪ {v} is a solution to group model projection(x, i, c) of objective value λ. Consequently, If x is a forget-node, let y be the unique child of x and let v / . This proves (22). If x is a join-node, let y and z be the two children of x and recall that B x = B y = B z . Let S be the collection of groups in S contained in the subtree rooted at y, and let S be the collection of groups in S contained in the subtree rooted at z. Since T is a tree decomposition, S ∩ S = S ∩ B x .
Note that S is a solution of group model projection(y, i 1 , c ) and S is a solution of group model projection(z, i 1 , c ) for some c , c : B x → {0, 1, 1 ? } and i 1 , The objective value of S equals This shows that OPT(x, i, c) is at most the right hand side of (23). Now letS be an optimal solution of group model projection(y, j 1 ,c) and letŜ be an optimal solution of group model projection(z, j 2 ,ĉ) where Note thatS andŜ exist since, as we have shown earlier, the colourings c and c satisfy the above assertions. ThenŜ ∪S is a solution of group model projection(x, i, c) with objective value max{OPT(y, j 1 ,c) + OPT(z, Consequently, OPT(x, i, c) is at least the right hand side of (23) and thus (23) holds.
By storing the best current solution alongside the OPT(x, i, c)-values we can compute an optimal solution together with OPT.

Runtime of the Algorithm
The computational complexity of the individual steps are as follows.
(a) Given the incidence graph B(G) on n = M + N vertices of treewidth w T , one can compute a tree decomposition of width w T in time O(2 O(w 3 T ) n) using Bodlaender's algorithm [11]. The number of nodes of the decomposition is in O(n).
(b) Given a tree decomposition of width w T with t nodes, one can compute a nice tree decomposition of width w T on O(w T t) nodes in O(w 2 T t) time in a straightforward way [39]. The running time of the dynamic programming is bounded as follows.
Theorem 11. The dynamic programming algorithm can be implemented to run in O(w T 5 w T G 2 N t) time given a nice tree decomposition of B(G) of width w T on t nodes.
Note that we can assume that t = O(w T n) with n = M + N . Together with the running time of the construction of the nice tree decomposition we can solve the exact projection problem on graphs with ). Proof of Theorem 11. Since the join-nodes are clearly the bottleneck of the algorithm, we discuss how to organize the computation in a way that the desired running time bound of O(w T 5 w T G 2 N ) holds in a node of this type.
So, let x be a join-node and assume that y and z are the two children of x. We want to compute OPT(x, i, c), for all colourings c : B x → {0, 1, 1 ? } and all i with 0 ≤ i ≤ G. Recall that we need to compute this value according to (23), that is,  ). This value can be computed in O(w T N ) time, since we are computing differences and unions of at most w T groups of size N each. We arrive at a total running time in O(w T 5 w T G 2 N ).

Remark 12.
The dynamic programming algorithm can be extended to include a sparsity restriction on the support of the signal approximation itself. So, we can compute an optimal K-sparse G-group-sparse signal approximation if the bipartite incidence graph of the studied group models is bounded. The running time of the algorithm increases by a factor of O(K).

Hardness on Grid-Like Group Structures
An r × r-grid is a graph with vertex set [r] × [r], and two vertices (a, b), (c, d) ∈ [r] × [r] are adjacent if and only if |a − c| = 1 and |b − d| = 0, or if |a − c| = 0 and |b − d| = 1. We also say that r is the size of the grid. Fig. 3 shows a 6 × 6-grid.

Figure 3: A 6 × 6-grid
Recall the group model projection problem can be solved efficiently when the treewidth of the incidence graph of the group structure is bounded, as shown in Section 3.1.
Definition 13 (Graph minor). Let G 1 and G 2 be two graphs. The graph G 2 is called a minor of G 1 if G 2 can be obtained from G 1 by deleting edges and/or vertices and by contracting edges.
A classical theorem by Robertson and Seymour [52] says that in a graph class C closed under taking minors either the treewidth is bounded or C contains all grid graphs.
Consequently, if C is a class of graphs that does not have a bounded treewidth, it contains all grids. Our next theorem shows that group model projection is NP-hard on group models G for which B(G) is a grid, thus complementing Theorem 11. Theorem 14. The group model projection problem is NP-hard even if restricted to instances G for which B(G) is a grid graph and the weight of any element is either 0 or 1.
Consider the following problem: given an n × n-pixel black-and-white image, pick k 2 × 2-pixel windows to cover as many black pixels as possible. This problem can be modeled as the group model projection problem on a grid graph where the weight of any element is either 0 or 1. See Fig. 4 for an illustration. Note that this group model is of frequency at most 4, and so we can do an approximate model projection and signal recovery using the result of Section 4.
Our proof is a reduction from the Vertex Cover problem. Recall that for a graphḠ, a vertex cover is a subset of the vertices ofḠ such that any edge ofḠ has at least one endpoint in this subset. The size of the smallest vertex cover ofḠ, the vertex cover number, is denoted τ (Ḡ).
Given a graphḠ and a number k as input, the task in the Vertex Cover problem is to decide whether G admits a vertex cover of size k. That is, whether τ (Ḡ) ≤ k. This problem is NP-complete even if restricted to cubic planar graphs [23]. 2 We use the following simple lemma in our proof.  We can now prove our theorem.
Proof of Theorem 14. The reduction is from Vertex Cover on planar cubic graphs. ConsiderḠ = (V, E) to be a planar cubic graph and let k be some number. Our aim is to compute an instance (G, w, k ) of the group model projection problem where B(G) is a grid such thatḠ has a vertex cover of size k if and only if G admits a selection of k groups that together cover elements of a total weight at least some threshold t.
First we embed the graphḠ in some grid H of polynomial size, meaning the vertices ofḠ get mapped to the vertices of the grid and edges get mapped to mutually internally disjoint paths in the grid connecting its endvertices. This can be done in polynomial time using an algorithm for orthogonal planar embedding [2]. We denote the mapping by π, hence π(u) is some vertex of H and π(vw) is a path from π(v) to π(w) in H for all u ∈ V and vw ∈ E.
Next we subdivide each edge of the grid 9 times, so that a vertical/horizontal edge of H becomes a vertical/horizontal path of length 10 in some new, larger grid H . We choose H such that the corners of H are mapped to the corners of H . In particular, |V (H )| ≤ 100|V (H)|. Let us denote the obtained embedded subdivision ofḠ in H by G , and let π denote the embedding. Moreover, let φ be the corresponding embedding of H into the subdivided grid H . Note that im π | V ⊆ im φ| V .
Let (A, B) be a bipartition of H . We may assume that π (u) is in A for all u ∈ V . We consider H to be the incidence graph B(G) of a group model G where the vertices in B correspond to the elements and the vertices in A correspond to the groups of G. We refer to the vertices in A as group-vertices and to vertices in B as element-vertices. Slightly abusing notation, we identify each group with its group-vertex and each element with its element-vertex and write G = A.
We observe that (a) G is an induced subgraph of H , (b) every vertex π (u), u ∈ V , has degree 3 in G and is a group-vertex, (c) every other vertex has degree 2 in G , and Next we will tweak the embedding ofḠ a bit, to get rid of paths π(uv) with the wrong parity. We do so in a way that preserves the properties (a)-(d). Let P 0 ⊆ {π (uv) : uv ∈ E(H)} be the set of all paths with a length 0 (mod 4), and let P 2 = {π (uv) : uv ∈ E(H)} \ P 0 . We want to substitute each path in P 0 by a path of length 2 (mod 4). For this, let u be the neighbour of u in the path π(uv). Note that the path π (uu ) in H starts with a vertical or horizontal path P from π (u) to π (u ) of length 10. We bypass the middle vertex of this path (an element-vertex) by going over two new element-vertices and one group-vertex instead. See Fig. 5 for an illustration.
To keep the notation easy we denote the newly obtained path by π (uv). Note that, after adding the bypass, the new path π (uv) is two edges longer and thus has length 2 (mod 4). We complete π to an embedding ofḠ by putting π (u) = π (u) and π (vu ) = π (vu ) for all u ∈ V and vu ∈ E with π (vu ) ∈ P 2 . Moreover, let us denote the changed embedding ofḠ by G . We observe that the new embedding G still satisfies the assertions (a)-(d) and, in addition, it holds that (e) every path connecting two vertices of degree 3 over vertices of degree 2 only has length 2 (mod 4).
Next we define the weights of the element-vertices by putting Assertion (d) implies that, for any subset S ⊆ G of size k there is an S ⊆ G of size at most k such that
Since w(u) = 0 for all elements in B \ V (G ), we may thus restrict our attention to the restricted group model G = A ∩ V (G ) on the element set B ∩ V (G ). Slightly abusing notation, any subset S ⊆ G is a vertex subset of I(G ) and w( S) equals the number of edges of I(G ) adjacent to the vertex set S in I(G ). Moreover, the graph I(G ) is obtained from the graphḠ by subdividing each edge an even number of times.
From Lemma 15 we know that there is some number t such that τ (Ḡ) = τ (I(G)) − t. Hence,Ḡ has a vertex cover of size k if and only if M has a cover of size k = k + t of total weight |E(I(G ))|. This, in turn, is the case if and only if M admits a cover of size k of total weight |E(I(G ))|. Since the construction of G can be done in polynomial time, the proof is complete.

MEIHT for General Group Structures
In this section we apply the results for structured sparsity models and for expander matrices to the group model case. The Model-Expander IHT (MEIHT) algorithm is one of the exact projection algorithms with provable guarantees for tree-sparse and loopless overlapping group-sparse models using model-expander sensing matrices [3]. In this work we show how to use the MEIHT for more general group structures. The only modification of the MEIHT algorithm is the projection on these new group structures. We show MEIHT's guaranteed convergence and polynomial runtime.
Note that as in [4], we are able to do model projections with an additional sparsity constraint, i.e. projection onto P G G,K defined in (11). Therefore Algorithm 3 works with an extra input K and the model projection P G G becomes P G G,K , retuning a G G,K -sparse approximation to x.

Algorithm 3 (MEIHT)
Input A, y, e with Ax = y + e, G, and G Output MEIHT(y, A, G, G), a G G -sparse approximation to x 1: x (0) ← 0; n ← 0 2: while halting criterion is false do 3: n ← n + 1 5: end while 6: Return: MEIHT(y, A, G The convergence analysis of MEIHT with the more general group structures considered here remains the same as for loopless overlapping group models discussed in [3]. We are able to perform the exact projection of P G G (and P G G,K ) as discussed in Section 3.6. With the possibility of doing the projection onto the model, we present the convergence results in Corollaries 16 and 17 as corollaries to Theorem 3.1 and Corollary 3.1 in [3] respectively. Corollary 16. Consider G to be a group model of bounded treewidth and S to be G G -sparse. Let the matrix A ∈ {0, 1} m×N be a model expander matrix with G 3G < 1/12 and d ones per column. For any x ∈ R N and e ∈ R m , the sequence of updates x (n) of MEIHT with y = Ax + e satisfies, for any n ≥ 0 Note that G 3G is the expansion coefficient of the underlying (s, d, G 3G )-model expander graph for A. This ensures that A satisfied model RIP-1 over all G 3G -sparse signals.
The proof of this Corollary can be done analogously to the proof of Theorem 3.1 in [3]. It is thus skipped and the interested reader is referred to [3].
Let us define the 1 -error of the best G G -term approximation to a vector This is then used in the following corollary.
Proof. Without loss of generality we initialize MEIHT with x (0) = 0. Upper bounding 1 − α n by 1 and using triangle inequalities with some algebraic manipulations, (24) simplifies to Using the fact that A is a binary matrix with d ones per column we have Ax S c 1 ≤ d x S c 1 . We also have e 1 ≥ α n x 1 when n ≥ log x 1 e 1 / log 1 α . Applying these bounds to (27) leads to for n = log x 1 e 1 / log 1 α . This is equivalent to (26) with c 1 = βd, c 2 = 1 + β, x (n) = x for the given n, and σ G G (x) 1 = x S c 1 because x S is the best G G -term approximation to the G G -sparse x. This completes the proof.
The runtime complexity of MEIHT still depends on the median operation and the complexity of the projection onto the model. However, as observed in [3], the projection onto the model is the dominant operation of the algorithm. Therefore, the complexity of MEIHT is of the order of the complexity of the projection onto the model. In the case of overlapping group models with bounded treewidth, MEIHT achieves a polynomial runtime complexity as shown in Proposition 18 below. On the other hand, when the treewidth of the group model is unbounded MEIHT can still be implemented by using the Bender's decomposition procedure in Section 3.6 for the projection which may have an exponential runtime complexity.
for the G G -group-sparse model with bounded treewidth w T , wheren is the number of iterations, M is the number of groups, G is the group budget and N is the signal dimension.
Proof. Before we start the MEIHT procedure we have to calculate a nice tree decomposition of the incidence graph of the group model. This can be done in . Then in each of the iterations of the MEIHT we have to solve the exact projection onto the model which is the dominant operation of the MEIHT. Since the projection on the group model with bounded treewidth w T can be done through the dynamic programming algorithm that runs in O((N + M )(w 2 T 5 w T G 2 N )), as proven in Section 3.1, this proves the result.
Remark 19. The convergence results above hold when MEIHT is modified appropriately to solve the standard K-sparse and G-group-sparse problem with groups having bounded treewidth, where the projection becomes P G G,K . However, in this case the runtime complexity in each iteration grows by a factor of O(K), as indicated in Remark 12.

Exact Projection for General Group Models
In this section we consider the most general version of group models, i.e. G = {G 1 , . . . , G M } is an arbitrary set of groups and G ∈ [M ] and K ∈ [N ] are given budgets. We study the structured sparsity model G G,K introduced in Section 2.4. Here additional to the number G of groups that can be selected we bound the number of indices to be selected in these groups by K (i.e. we consider a group-sparse model with an additional standard sparsity constraint). Note that setting K = N reduces this model G G,K to the general group model G G .
If we want to apply exact projection recovery algorithms like the Model-IHT and MEIHT to group models, iteratively the model projection problem has to be solved, i.e. in each iteration for a given signal x ∈ R N we have to find the closest signalx which has support in the model G G,K . In this section we will derive an efficient procedure based on the idea of Benders' Decomposition to solve the projection problem. This procedure is analysed and implemented in Section 5.
It has been proved that the group model projection problem for group models without a sparsity condition on the support is NP-hard [4]. Therefore the projection problem on the more general model G G,K is NP-hard as well. The latter problem can be reformulated by the integer programming formulation max w u Here w are the squared entries of the signal, the v-variables represent the groups and the u-variables represent the elements which are selected. Note that by choosing K = N we obtain the projection problem for classical group models G G .
To derive an efficient algorithm for the projection problem we use the concept of Benders' Decomposition which was already studied in [7,24]. The idea of this approach is to decompose Problem 29 into a master problem and a subproblem. Then iteratively the subproblem is used to derive feasible inequalities for the master problem until no feasible inequality can be found any more. This procedure has been applied to Problem 29 without the sparsity constraint on the u-variables in [17]. The following results for the more general Problem 29 are based on the the idea of Benders' Decomposition and extend the results in [17].
First we can relax the u-variables in the latter formulation without changing the optimal value, i.e. we may assume u ∈ [0, 1] N . We can now reformulate (29) by max µ Replacing the linear problem max u∈P (v) w u in (30) by its dual formulation, we obtain max µ where P D = {α, β, γ ≥ 0 | α + β i + γ i ≥ w i i = 1, . . . , N } is the feasible set of the dual problem. Since P D is a polyhedron and the minimum in (31) exists, the first constraint in (31) holds if and only if it holds for each vertex (α l , β l , γ l ) of P D . Therefore Problem (31) can be reformulated by max µ where (α 1 , β 1 , γ 1 ), . . . , (α t , β t , γ t ) are the vertices of P D . Each of the constraints for l = 1, . . . , t is called optimality cut.
The idea of Benders' Decomposition is, starting from Problem (32) containing no optimality cut (called the master problem), to iteratively calculate the optimal (v * , µ * ) and then find a optimality cut which cuts off the latter optimal solution. In each step the most violating optimality cut is determined by solving for the actual optimal solution v * . If the optimal solution fulfills then the optimality cut related to the optimal vertex (α * , β * , γ * ) is added to the master problem. This procedure is iterated until the latter inequality is not true any more. The last optimal v * must then be optimal for Problem (29) since the first constraint in (31) is then true for v * . If we use the latter Benders' Decomposition approach it is desired to use fast algorithms for the master problem (32) and the subproblem (33) in each iteration. By the following lemma an optimal solution of the subproblem can be easily calculated.
Proof. Note that for a given v * the dual problem of subproblem (33) is It is easy to see that there exists an optimal solution u * of Problem (34) where u i = 1 if and only if i ∈ I K v and u * i = 0 otherwise. We will use the dual slack conditions to derive the optimal values for α, β, γ. We obtain the following 4 cases: Case 1: If u * i = 0 and j:i∈Gj v * j > 0 , i.e. i ∈ I v \ I K v , then by conditions (35) we have β i = γ i = 0. To ensure the constraint α + β i + γ i ≥ w i , the value of α must be at least max i∈Iv\I K v w i .
Case 2: If u * i = 0 and j:i∈Gj v * j = 0, then i ∈ [N ] \ I v and the objective coefficient of β i in Problem (33) is 0. Therefore we can increase β i as much as we want without changing the objective value and therefore we set β i = w i to ensure the i-th constraint α + β i + γ i ≥ w i and set γ i = 0.
Case 3 If u * i = 1, i.e. i ∈ I K v , and j:i∈Gj v * j > 1 then by condition (35) β i = 0. Therefore to ensure the i-th constraint α + β i + γ i ≥ w i the value of γ i must be at least w i − α and since we minimize γ i in the objective function the latter holds with equality.
Case 4: If u * i = 1, i.e. i ∈ I K v , and j:i∈Gj v * j = 1 then we cannot use condition (35) to derive the values for β i and γ i . Nevertheless in this case both variables have an objective coefficient of 1 while α has an objective coefficient of K. By increasing α by 1 the objective value increases by K. In Cases 1 and 2 nothing changes, while for each index in Case 3 we could decrease γ i by 1 to remain feasible. But since at most K indices i fulfil the conditions of Case 3 we cannot improve our objective value by this strategy. Therefore α has to be selected as small as possible in Case 4, i.e. by Case 1 we set α = max i∈Iv\I K v w i and to ensure feasibility we set γ i = w i − α. This concludes the proof. The following theorem states that the masterproblem can be solved in pseudopolynomial time if the number of constraints is fixed. Nevertheless note that the number of iterations of the procedure described above may be exponential in N and M .
The latter problem is a special case of the robust knapsack problem with discrete uncertainty (sometimes called robust selection problem) with additional uncertain constant. In [12] the authors mention that the problem with an uncertain constant is equivalent to the problem without such a constant. Furthermore using the result in [41] Problem (36) Since for all solutions (α l , β l , γ l ) generated in Lemma 20 it holds that α l , β l i , γ l i ≤ max i∈[N ] w i we have C ≤ (2N + 1)W which proves the result.

Algorithms with Approximation Projection Oracles
As mentioned in the previous sections solving the group model projection problem is NP -hard in general. Therefore to use classical algorithms as the Model-IHT or the MEIHT we have to solve an NP -hard problem in each iteration. To tackle problems like this the authors in [28,27] introduced an algorithm called Approximate Model-IHT (AM-IHT) which is based on the idea of the classical IHT but which does not require an exact projection oracle (see Section 2.5). Instead the authors show that under certain assumptions on the measurement matrix a signal can be recovered by just using two approximation variants of the projection problem which they call head-and tail-approximation (for further details again see Section 2.5).
In this section we apply the latter results to group models of bounded frequency: group models where the maximum number of groups an element is contained in is bounded by some number f . Note that from Theorem 14 we know that group model projection is NP-hard for group models of frequency 4. A particularly interesting case of such group structures is the graphic case, where each element is contained in at most two groups. Understanding this case was proposed as an open problem by Baldassarre et al. [4].
Furthermore we apply the theoretical results derived in [28,27] to group models and show that the number of required measurements compared to the classical structured sparsity case increases by just a constant factor. In Section 5 we will computationally compare the AM-IHT to the Model-IHT and the MEIHT on interval groups.

Head-and Tail-Approximations for Group Models with Low Frequency
In this section we derive head-and tail-approximations for the case of group models with bounded frequency. We first recall the definition of head-and tail-approximations for the case of group models. Assume we have given a group model G together with G ∈ N.
Given a vector x, let H be an algorithm that computes a vector H(x) with support in G G for some G ∈ N. Then, given some α ∈ R (typically α < 1) we say that H is an (α, G, G , p)-head approximation if In other words, H uses G many groups to cover at least an α-fraction of the maximum total weight covered by G groups. Note that G can be chosen larger than G. Moreover, let T be an algorithm which computes a vector T (x) with support in G G . Given some β ∈ R (typically β > 1) we say that T is a (β, G, G , p)-tail approximation if This means that T may use G many groups to leave at most a β-fraction of weight uncovered compared to the minimum total weight left uncovered by G groups. In the following we derive the head-and tail-approximation just for the case p = 1. Note that equivalent approximation procedures can be easily derived for the case p = 2 by replacing the accuracies α and β by √ α and √ β and using the weights x 2 i instead of |x i | in the latter proofs. We will first present a result which implies the existence of a head approximation.
The algorithm derived in [30] was designed to solve the Maximum G-Coverage problem and is based on a simple greedy method. The idea is to iteratively select the group which covers the largest amount of uncovered weight. It is proven by the authors that if you are allowed to select enough groups, namely G log 2 (1/ ) , then the optimal value is approximated up to an accuracy of (1 − ). The greedy procedure is given in Algorithm 4. Note that for a given signal x and a group G ∈ G we define w(G) := i∈G |x i |.
Next we derive a tail-approximation for our problems based on the idea of LP rounding. Note that in contrast to the head-approximation, the run-time bound of the following tail-approximation depends on the frequency of the group model. G, G, ), a head-approximation to x 1: z ← 0 2: for i = 1 . . . G do 3: Sort the groups in G by decreasing total weight:

5:
Delete the indices in G i1 from all groups in G. Proof. Given a signal x ∈ R N , we define w = (|x i |) i∈ [N ] . We consider the following linear relaxation of the group model projection problem max w u Consider an optimal solution (u, v) of (39). We compute a group cover S ⊆ G by Note that S contains at most κG many groups, since It remains to show that S is a tail approximation. To this end let R be the set of indices only barely covered by v, i.e.
for some i with j ∈ G i . Moreover, note that We obtain the inequalities where we used (40) and (41). Since u is an optimal solution of the relaxed problem (39), we have where S * is an optimal solution of the group model projection problem and u * the corresponding optimal vector of Problem (12). Therefore the latter procedure is a tail-approximation which completes the proof.

AM-IHT and AM-MEIHT for Group Models
As in the previous sections for a sensing matrix A and the true signal x we have a noisy measurement vector y = Ax + e for some noise vector e. The task consists in recovering the original signal x, or a vector close to x. Furthermore we have given a group model G together with G ∈ N with frequency f ∈ [N ].
In the last section we derived polynomial time algorithms for an ((1− ), G, G log 2 (1/ ) , 2)-head approximation and an ((1 + ), G, (1 + −1 )f G, 2)-tail approximation. Note that we can use any G here. Using the results in Section 2.5, we obtain convergence of the AM-IHT for group models if T is an ((1 + ), G, G T , 2)tail approximation, H is an ((1 − ), G T + G, G H , 2)-head approximation, where G T := (1 + −1 )f G and G H := (G T + G) log 2 (1/ ) , and the sensing matrix A has GG-RIP withG = G + G T + G H . Note that G ∈ O(G) for fixed accuracy > 0 and frequency f . Furthermore |GG| ∈ O(M cG ) for a constant c. Using the bound (20) we obtain that the number of required measurements for a sub-Gaussian random matrix A having the GG-RIP with high probability is which differs by just a constant factor from the number of measurements required in the case of exact projections (see Section 3). Under condition (19) convergence of the AM-IHT is ensured.

Extension to within-group sparsity and beyond
The head and tail approximation approach can be extended far beyond the standard group-sparsity model. It still works even if we are considering K-sparse and G-group-sparse (i.e. G G,K -sparse) vectors in our model, for example.
The reason is that the group model projection can be head approximated to a constant even in this case. Formally, if we search for the K weight maximal elements covered by G many groups, we are maximizing a submodular function subject to a knapsack constraint. 3 This is known to be approximable to a constant factor (cf. Kulik et al. [42,43] and related work). Suppose we delete the covered elements and run such an approximation algorithm again. Then, after a constant number of steps, we obtain a collection of groups and elements such that the total weight of the elements is at least an (1 − )-fraction of the total weight that a K-sparse and G-group-sparse solution could ever have. Moreover, the sparsity budgets are exceeded only by a constant factor each.
Similarly, the analysis given in the proof of Theorem 24 works even if we impose sparsity on both groups and elements. Again, we obtain a solution that has a (1 + )-tail guarantee whose support exceeds that of a G-group-sparse K-sparse vector by at most a constant factor if we assume a bounded frequency. This leads to the positive consequences detailed above.
More generally, any knapsack constraints on groups and elements can be handled, leading to head and tail approximations in the case when there are non-uniform sparsity budgets on the groups and elements. However the corresponding head approximations are rather involved, and certainly much less efficient than the simple greedy procedure proposed by the Hochbaum and Pathria algorithm [30].

Computations
In this section we present the computational results for the Model-IHT, MEIHT, AM-IHT and AM-EIHT presented in Section 2 for block-group instances. Precisely, we study block-groups, i.e. each group G ∈ G is a set of sequenced indices, G = [s, t] ∩ [N ], where 1 ≤ s < t ≤ N and each group has the same size |G| = l. For a given dimension N we generate blocks of size l = 0.02N . We consider two types of block-models, one where the successive groups overlap in l−1 2 items and another where they overlap in l − 1 items. Note that the frequency is then given by f = 2 or f = l respectively.
We run all algorithms for random signals x ∈ R N in dimension N ∈ {100, 200, . . . , 900}. For each dimension we vary the number of measurements m ∈ {20, 40, .., N } and generate 20 random matrices A ∈ R m×N each together with a random signal x ∈ R N . We assume there is no noise, i.e. e = 0. For a given group model G the support of the signal x is determined as the union of G randomly selected groups. The components of x in the support are calculated as identical and independent draws from a standard Gaussian distribution while all other components are set to 0. Our computations are processed for two classes of random matrices, Gaussian matrices and expander matrices as described in Section 2. The Gaussian matrices are generated by drawing identical and independent values from a standard Gaussian distribution for each entry of A and afterwards normalizing each entry by 1 √ m . The expander matrices are generated by randomly selecting d = 2 log(N )/ log(Gl) indices in {1, . . . , m} for each column of the matrix. The choice of d is motivated by the choice in [3]. Each of the algorithms is stopped if either the number of iterations reaches 1000 or if for any iteration i + 1 we have x (i+1) − x (i) p < 10 −5 . For the error in each iteration we use p = 1 for the calculations corresponding to the expander matrices and p = 2 for the Gaussian matrices. After the determination of the algorithm we calculate the relative error of the returned signalx to the true signal x, i.e. we calculate x −x p x p .
Again we use p = 1 for the calculations corresponding to the expander matrices and p = 2 for the Gaussian matrices. We call a signal recovered if the relative error is smaller than 10 −5 . For the AM-IHT and the AM-EIHT the approximation accuracy of the head-and the tail approximation algorithms are set to α = 0.95 and β = 1.05. For the exact signal approximation problem which has to be solved in each iteration of the Model-IHT and the MEIHT we implemented the Benders' decomposition procedure presented in Section 3.6. To this end the master problem is solved by CPLEX 12.6 while each optimal solution of the subproblem is calculated using the result of Lemma 20. Regarding the AM-IHT, for the head-approximation we implemented the greedy procedure given in Algorithm 4 while for the tail-approximation we implemented the procedure of Theorem 24. Again the LP in the latter procedure is solved by CPLEX 12.6. All computations were calculated on a cluster of 64-bit Intel(R) Xeon(R) CPU E5-2603 processors running at 1.60 GHz with 15MB cache.
The results of the computations are presented in the following diagrams. For all instances we measure the smallest number of measurements m # for which the median of the relative error to the true signal is at most 10 −5 , i.e. the smallest number of measurements for which at least 50% of the signals were recovered. Furthermore we show the average number of iterations and the average time in seconds the algorithms need to successfully recover a signal, given m # number of measurements. We stop increasing the number of measurements m if m # is reached.  The smallest number of measurements m # which leads to a median relative error of at most 10 −5 is nearly linear in the dimension; see Figure 6. For all algorithms the corresponding m # is very close even for different number of overlaps. Nevertheless the number of measurements m # is smaller for expander matrices than for Gaussian matrices. Furthermore in the expander case the instances with overlap l−1 The average runtime for the Gaussian case is a bit larger than the runtime for the expander case as expected, since operations with dense matrices are more costly than the sparse ones. However, it may be due to the larger number of iterations; see Figure 8. Furthermore the runtime for the instances with l − 1 overlap is much larger in both cases. Here the AM-IHT (AM-EIHT) is faster than the Model-IHT (MEIHT) for the instances with overlap l − 1 while it is slightly slower for the others.  In Figures 9, 10 and 11 we show the same properties as above, but for varying G ∈ {1, 2, . . . , 9}, a fixed dimension of N = 200 and d fixed to 7 for all values of G. Similar to Figure 6 the value m # seems to be linear in G (see Figure 9). Just the Model-IHT (MEIHT) for blocks with overlap l − 1 seems to require an exponential number of measurements in G to guarantee a small median relative error. Here the AM-IHT (AM-EIHT) performs much better. Interestingly the number of iterations does not change a lot for increasing G for Gaussian matrices while it grows for the AM-EIHT in the expander case. This is in contrast to the iteration results with increasing N ; see Figure 7. The runtime of all algorithms increases slowly with increasing G, except for the IHT for blocks with overlap l − 1 the runtime explodes. For both instances the AM-IHT (AM-EIHT) is faster than the Model-IHT (MEIHT).
To conclude this section we selected the instances for dimension N = 800 and G = 5 and present the development of the median relative error over the number of measurements m; see Figure 12. In the expander case the median relative error decreases rapidly and is nearly 0 already for m N ≈ 0.45. Just for the MEIHT for blocks with overlap l − 1 the relative error is close to 0 not until m N ≈ 0.55. For the Gaussian case the results look similar with the only difference that a median relative error close to 0 is reached not until m N ≈ 0.6.

Latent Group Lasso
In this section we present computational results for the latent group Lasso approach introduced in Section 2. We consider block-groups and all group instances are generated as described in the previous section. We study the 1 / 2 variant presented in (14) and its 0 counterpart presented in (15). Both problems are implemented in CPLEX 12.6. For the 0 counterpart we implemented the integer programming formulation (16). For the given random Gaussian matrices A ∈ R m×N and its linear measurements y ∈ R m we use L(x) = Ax − y 2 2 while for expander matrices we use L(x) = Ax − y 1 . The last choice is motivated by our comnputational tests which showed that the 1 -norm has a better performance for expander matrices.
We run all algorithms for random signals x ∈ R N in dimension N = 200. The number of measurements is varied in m ∈ {20, 40, .., 2N } and we generate 20 random matrices A ∈ R m×N each together with a random signal x ∈ R N . We assume there is no noise, i.e. e = 0. For a given group model G the support of the signal x is determined as the union of G = 5 randomly selected groups. The components of x in the support are calculated as identical and independent draws from a standard Gaussian distribution while all other components are set to 0. Our computations are processed for two classes of random matrices, Gaussian matrices and expander matrices generated as described in the previous section. After each calculation we calculate the relative error of the returned signalx to the true signal x, i.e. we calculate x −x p x p where we use p = 1 for the calculations corresponding to the expander matrices and p = 2 for the Gaussian matrices. Additionally we calculate the pattern recovery error which was defined in [49], and the probability of recovery, i.e. the fraction of instances which were successfully recovered. We call a signal recovered if the relative error is smaller or equal than 10 −4 . All computations were performed for λ ∈ {0.25, 0.5, . . . , 5} and d G = 1 for all groups G ∈ G. For each m ∈ {20, 40, .., 2N } the λ with the best average relative error is calculated and the λ which has the best average relative error most often over all m is chosen. For all experiments the optimal value was λ * = 0.25.
All computations were calculated on a cluster of 64-bit Intel(R) Xeon(R) CPU E5-2603 processors running at 1.60 GHz with 15MB cache.
The results of the computations are presented in Figures 13,14,15,16 and 17. For each m we show the median relative error, the probability of recovery, the average pattern recovery error and the average number of selected groups G; each value calculated over all 20 matrices and for λ * = 0.25. In Figure 17 we show for each m the value of λ with has the smallest average relative error.
The results in Figure 13 show that the 0 variant of the latent group Lasso performs very well for Gaussian matrices. Even for a small number of measurements the median relative error is 0 while for expander matrices the error never is smaller than 0.6. The 1 / 2 variant for Gaussian matrices has a larger error for small number of measurements which decreases rapidly and is always 0 for m larger than 0.5N . In the expander case it is never smaller than 0.6 as well. The same picture holds for the pattern recovery error; see Figure 15. The only difference here is that also for expander matrices the error tends to 0. The results indicate that the optimal support is calculated for both variants and for both types of matrices if m is large enough, but for expander matrices the latent group Lasso struggles to find the optimal component-values of x in the support. Interestingly the frequency of the groups does not significantly influence the results.
The probability of recovery for Gaussian matrices is 1 for all m for the 0 variant and is 1 for m larger than 0.8N for the 1 / 2 variant; see Figure 14. According to the results for the relative error the probability of recovery for expander matrices is 0 for all m. The number of groups selected by the latent group Lasso is close to 5 for all m for the 0 variant; see Figure 16. For the 1 / 2 variant with Gaussian matrices it is close  to 5 for all m larger than 0.8N while for the expander matrices it is already close to 5 for all m larger than 0.5N . The value of the optimal λ is large for small m and is always 0.25 for larger m; see Figure 17.  Figure 17: Value of λ with best mean relative error.
To summarize the results, it seems that the 0 latent group Lasso outperforms the 1 / 2 variant. It can even compete with the iterative algorithms tested in the previous section; the number of required measurements can be even smaller for the 0 latent group Lasso while at the same time the support is correctly recovered. Nevertheless in a real-word application the optimal λ is not known and has to be found. Furthermore in contrast to the iterative algorithms studied in this work it can never be guaranteed that the recovered support and especially the number of groups calculated by the latent group Lasso is optimal. The expander variant of the latent group Lasso performs much worse than for the iterative algorithms. Especially this approach fails to recover the true signal for all instances, while the true support can be found.

Conclusion
In this paper we revisited the model-based compressed sensing problem focusing on overlapping group models with bounded treewidth and low frequency. We derived a polynomial time dynamic programming algorithm to solve the exact projection problem for group models with bounded treewidth, which is more general than the state-of-the-art considering loopless overlapping models. For general group models we derived an algorithm based on the idea of Bender's decomposition, which may run in exponential time but often performs better than dynamic programming in practice. We proved that the latter procedure is generalizable from group-sparse models to group-sparse plus standard sparse models. The most dominant operation of iterative exact projection algorithms is the model projection. Hence our results show that the Model-IHT and the MEIHT run in polynomial time for group models with bounded treewidth. Alternatively, for group models with bounded frequency we show that another class of less accurate algorithms run in polynomial time. More precisely the AM-IHT and the AM-EIHT are algorithms using head-and tail-approximations instead of exact projections.
Using Benders' Decomposition (with Gaussian and model-expander sensing matrices) we compare the minimum number of measurements required by, and runtimes of, each of the four algorithms (Model-IHT, MEIHT, AM-IHT and AM-EIHT) to achieve a given accuracy. In summary the experimental results on overlapping block groups seem to indicate that the number of required measurements to recover a signal is smaller for expander matrices than for Gaussian matrices. Furthermore, we could observe that the number of measurements to ensure a small relative error is smaller for the approximate versions of the algorithms. The run-time gets much larger for Gaussian matrices with increasing N than for expander matrices, which might be just what is expected when applying dense versus sparse matrices. In general the approximate versions of the algorithms may have a larger number of iterations but the run-time is lower. This indicates that the larger number of iterations can be compensated by the faster computation of the approximate projection problems in each iteration. Additionally to the iterative algorithms we test the latent group Lasso approach on the same instances and show that the 0 variant outperforms the 1 / 2 variant and is even competitive to the iterative algorithms.