# Convex relaxations of penalties for sparse correlated variables with bounded total variation

- First Online:

- Received:
- Accepted:

DOI: 10.1007/s10994-015-5511-2

- Cite this article as:
- Belilovsky, E., Argyriou, A., Varoquaux, G. et al. Mach Learn (2015) 100: 533. doi:10.1007/s10994-015-5511-2

- 2 Citations
- 603 Downloads

## Abstract

We study the problem of statistical estimation with a signal known to be sparse, spatially contiguous, and containing many highly correlated variables. We take inspiration from the recently introduced *k*-support norm, which has been successfully applied to sparse prediction problems with correlated features, but lacks any explicit structural constraints commonly found in machine learning and image processing. We address this problem by incorporating a total variation penalty in the *k*-support framework. We introduce the (*k*, *s*) *support total variation norm* as the *tightest* convex relaxation of the intersection of a set of sparsity and total variation constraints. We show that this norm leads to an intractable combinatorial graph optimization problem, which we prove to be NP-hard. We then introduce a tractable relaxation with approximation guarantees that scale well for grid structured graphs. We devise several first-order optimization strategies for statistical parameter estimation with the described penalty. We demonstrate the effectiveness of this penalty on classification in the low-sample regime, classification with M/EEG neuroimaging data, and image recovery with synthetic and real data background subtracted image recovery tasks. We extensively analyse the application of our penalty on the complex task of identifying predictive regions from low-sample high-dimensional fMRI brain data, we show that our method is particularly useful compared to existing methods in terms of accuracy, interpretability, and stability.

### Keywords

Structured sparsity Feature selection Brain decoding*k*-Support Total variation

## 1 Introduction

Regularization methods utilizing the \(\ell _1\) norm such as Lasso (Tibshirani 1996) have been used widely for feature selection. They have been particularly successful at learning problems in which very sparse models are required. However, in many problems a better approach is to balance sparsity against an \(\ell _2\) constraint. One reason is that very often features are correlated and it may be better to combine several correlated features than to select fewer of them, in order to obtain a lower variance estimator and better interpretability. This has led to the method of *elastic net* in statistics (Zou and Hastie 2005), which regularizes with a weighted sum of \(\ell _1\) and \(\ell _2\) penalties. More recently, it has been shown that the elastic net is not in fact the tightest convex penalty that approximates sparsity (\(\ell _0\)) and \(\ell _2\) constraints at the same time (Argyriou et al. 2012). The tightest convex penalty is given by the *k*-support norm, which is parametrized by an integer *k*, and can be computed efficiently. This norm has been successfully applied to a variety of sparse vector prediction problems (Belilovsky et al. 2015; Gkirtzou et al. 2013; McDonald et al. 2014; Misyrlis et al. 2014).

We study the problem of introducing structural constraints to sparsity and \(\ell _2\), from first principles. In particular, we seek to introduce a total variation smoothness prior in addition to sparsity and \(\ell _2\) constraints. Total variation is a popular regularizer used to enforce local smoothness in a signal (Michel et al. 2011; Rudin et al. 1992; Tibshirani et al. 2005). It has successfully been applied in image de-noising and has recently become of particular interest in the neural imaging community where it can be used to reconstruct sparse but locally smooth brain activation (Baldassarre et al. 2012b; Michel et al. 2011).Two kinds of total variation are commonly considered in the literature, isotropic \(TV_I(w)= \Vert \nabla w\Vert _{2,1}\) and anisotropic \(TV_A(w)= \Vert \nabla w\Vert _1\) (Beck and Teboulle 2009). In our theoretical analysis we focus on the anisotropic penalty.

To derive a penalty incorporating these constraints we follow the approach of Argyriou et al. (2012) by taking the convex hull of the intersection of our desired penalties and then recovering a norm by applying the gauge function. We then derive a formulation for the dual norm which leads us to a combinatorial optimization problem, which we prove to be NP-hard. We find an approximation to this penalty and prove a bound on the approximation error. Since the *k*-support norm is the tightest relaxation of sparsity and \(\ell _2\) constraints, we propose to use the intersection of the TV norm ball and the *k*-support norm ball. This leads to a convex optimization problem in which (sub)gradient computation can be achieved with a computational complexity no worse than that of the total variation. Furthermore, our approximation can be computed for variation on an arbitrary graph structure.

We discuss and utilize several first order optimization schemes including stochastic subgradient descent, iterative Nesterov-smoothing methods, and FISTA with an estimated proximal operator. We demonstrate the tractability and utility of the norm through applications of classification on MNIST with few samples, M/EEG classification, and background-subtracted image recovery. For the problem of identifying predictive regions in fMRI we show that we can get improved accuracy, stability, and interpretability along with providing the user with several potential tools and heuristics to visualize the resulting predictive models. This includes several interesting properties that apply to the special case of k-support norm optimization as well.

## 2 Convex relaxation of sparsity, \(\ell _2\) and total variation

In this section we formulate the (*k*, *s*) *support total variation* norm, a tight convex relaxation of sparsity, \(\ell _2\), and total variation (TV) constraints. We derive its dual norm which results in an intractable optimization problem. Finally we describe a looser convex relaxation of these penalties which leads to a tractable optimization problem.

### 2.1 Derivation of the norm

*D*generally take the form of a discrete difference operator, but the discussion in the following sections is more general than that. It is easy to see that the set \(Q_{k,s}^2\) is not convex due to the presence of the \(\Vert \cdot \Vert _0\) terms. Hence using \(Q_{k,s}^2\) in a regularization method is impractical. Thus we consider instead the convex hull of \(Q_{k,s}^2\):

*D*,

*k*and

*s*, this convex set may not span the entire \(\mathbb {R}^d\), that is, it may be contained within a smaller subspace. In Sect. 2.2 we show a condition for which the set will span \(\mathbb {R}^d\) (see Proposition 1). For a matrix

*D*that is the transpose of an incidence matrix representing a graph with a maximum degree of \(l_{deg}\), the value of s should be greater than or equal to \(l_{deg}\).

*D*,

^{1}the convex set \(C_{k,s}^2\) is the unit ball of a certain norm. We call this norm the (

*k*,

*s*)

*support total variation*norm. It equals the gauge function of \(C_{k,s}^2\), that is,

*x*, \( v_i=\lambda c_i z_i \Rightarrow \lambda = \frac{\sum \nolimits _{i=1}^r \Vert v_i\Vert _2}{ \sum \nolimits _{i=1}^r c_i\Vert z_i \Vert _2} \;.\) To maximize the denominator for fixed \(v_i\), we note that \(\sum \nolimits _{i=1}^r c_i\Vert z_i \Vert _2 \le (\sum \nolimits _{i=1}^r c_i) \max \limits _{i=1}^r\Vert z_i \Vert _2 = 1\). The equality can be attained by applying the constraints in Eq. (1). Substituting for \(\lambda \) and removing the constraints already applied above our norm now becomes

*k*-support norm (Argyriou et al. 2012), which trades off between the \(\ell _1\) norm (\(k=1,s=m\)) and the \(\ell _2\) norm (\(k=d,s=m\)). Formula 2 is combinatorial in nature and hence is difficult to directly include in an optimization problem.

### 2.2 Derivation of the dual norm

A standard approach for analyzing structured norms is through analysis of the dual norm (Argyriou et al. 2012; Bach et al. 2012; Mairal and Yu 2013). As such, it will be useful to derive an expression for the dual norm of \(\Vert \cdot \Vert _{k,s}^{sptv}\). This will allow us to connect the norm with an optimization problem on a graph, use this to show the norm is NP-hard, and to derive an approximation bound (Proposition 2).

*k*,

*s*) support TV norm we first consider a more general class of norms. Each norm in this class is associated with a set of subspaces \(S_1,\dots ,S_n\) and a set of norms \(\Vert \cdot \Vert _{(1)},\dots ,\)\(\Vert \cdot \Vert _{(n)}\). We assume that these subspaces span \(\mathbb {R}^d\), that is, \(\sum _{i=1}^n S_i = \mathbb {R}^d\), the summation here denotes addition of sets (\(S_1+S_2 = \{x: x=x_1+x_2 , x_1\in S_1, x_2\in S_2\}\)). We may now define the following norm

*k*,

*s*) support TV norms can be written in the form (3) by specifying all

*n*norms to be the \(\ell _2\) norm and the linear subspaces to correspond to the constraints on the supports.

We note that this definition is equivalent to an infimal convolution of *n* norms. Let \(\delta _S\) denote the indicator function of a subspace *S* and the infimal convolution \((f_1 \,\Box \,\dots \,\Box \,f_n)\) of *n* functions as \({{{\mathrm{\,\Box \,}}}}_{i=1}^n f_i\). Using this notation, the norm \(\Vert \cdot \Vert \) can be written equivalently as \(\Vert \cdot \Vert = {{{\mathrm{\,\Box \,}}}}_{i=1}^n \left( \Vert \cdot \Vert _{(i)} + \delta _{S_i}\right) \;.\) We may derive the general form of the dual norm \(\Vert \cdot \Vert ^{*}\) of \(\Vert \cdot \Vert \) by a direct application of standard duality results from convex analysis.

**Lemma 1**

*Proof*

*convex conjugate*or

*Fenchel conjugate*of a function \(f:\mathbb {R}^d\rightarrow \mathbb {R}\cup \{+\infty \}\) by \(f^*\) (Bauschke and Combettes 2011). It is known that the convex conjugate of a norm equals the indicator function of its dual unit ball. Thus it holds that

Equation (4) for the dual norm is interpreted as the maximum of the distances of *x* (with respect to the corresponding dual norms) from the orthogonal complements. We now specialize this formula to the case of (*k*, *s*) support TV norm.

*Notation* We define \(G_k\) as all subsets of \(\{1,...,d\}\) of cardinality at most *k* and \(M_s\) as all subsets of \(\{1,...,m\}\) of cardinality at most *s*. For every \(I \in G_k\), we denote \(I^c = \{ 1,...,d\}\backslash I\) and for every \(J \in M_s, J^c = \{ 1,...,m\} \backslash J\). We denote \(D_{J^c}\) as the submatrix of *D* with only the rows indexed by \(J^c\) and for every \(u\in \mathbb {R}^d, u_I\) is the subvector of *u* with only the elements indexed by *I*.

*r*in Eq. (2) can be assumed to be at most \(|G_k||M_s|\) (by grouping components with the same (

*I*,

*J*) pattern and applying the triangle inequality). We can now reduce the dual norm to

**Proposition 1**

*D*,

*k*and

*s*. We choose

*D*to be the transpose of the incidence matrix of a directed graph \(G_d=(\mathcal {V}_d,\mathcal {E}_d)\), with the vertices corresponding to the elements of

*x*. Furthermore \(G=(\mathcal {V},\mathcal {E})\) is an undirected graph with vertices \(\mathcal {V}=\mathcal {V}_d\) and an unordered set of the same edges as \(\mathcal {E}_d\). For a given

*J*, we can consider the graph \(G_{J^c}\), specified by the incidence matrix \(D_{J^c}\) as the original graph with \(\left| {J}\right| \) edges removed. The notation presented is illustrated in Figure 1.

*a*,

*b*of the undirected graph

*G*, if there exists a path between

*a*and

*b*then \(x_a = x_b\). More formally, if we divide \(G_{J^c}\) into all of its disjoint subgraphs denoted by \(G_\gamma =(\mathcal {V}_\gamma ,\mathcal {E}_\gamma )\),

*x*into independent groups, since the projection and thereby the residual of components corresponding to vertices in disjoint groups will have independent contribution. The components of \({{\mathrm{Proj}}}_{S_{I,J}}(x)\) at \(I^c\) must be zero. Moreover, the members of any group that contains a vertex from \(I^c\) will be zero. We can therefore compute \(E_{I,J}(x)\) independently for each disjoint group, and only for the groups that do not contain a vertex in \(I^c\). For each disjoint group the contribution to \(E_{I,J}^2(x)\) is

We can additionally show that we can limit \(M_s\) to maximum cardinality sets (cardinality *s*) and \(G_k\) to maximum cardinality sets (cardinality *k*). Indeed, adding indexes in *I* or *J* cannot decrease \(S_{I,J}\) and hence cannot decrease the norm of the projection in Eq. (5). Thus we can narrow the problem to removing *s* edges and \(d-k\) nodes (with their associated subgraphs).

We have now reduced the computation of the dual norm to a graph partitioning problem. Graph partition problems are often NP-hard, and we show this to be the case here as well:

**Theorem 1**

Computation of the (*k*, *s*) support total variation dual norm is NP-hard

The proof of Theorem 1 is given in Appendix 1.

**Corollary 1**

Computation of the (*k*, *s*) support total variation norm is NP-hard.

In light of this Theorem, we are unable to incorporate the (*k*, *s*) support total variation norm in a regularized risk setting. Instead in the sequel we examine a tractable approximation with bounds that scale well for the family of graphs of interest.

### 2.3 Approximating the norm

*s*equals

*m*or 1 are tractable, the general case for arbitrary values of

*s*leads to an NP-hard graph partitioning problem for the dual norm, implying the norm itself is intractable. We thus relax the problem by taking instead the intersection of the

*k*-support norm ball and the convex relaxation of total variation. This leads to the following penalty

**Proposition 2**

*Proof*

*k*entries in |

*u*|, and is known as the 2-

*k*

*symmetric gauge norm*(Bhatia 1997). Thus, for every \(a,w\in \mathbb {R}^d\), it holds that

*a*, we obtain

*I*to be the set of indexes corresponding to the largest

*k*elements of |

*x*|. We also pick

We note that we can fulfil the technical condition on the range of \(D^T\) by augmenting the incidence matrix in a manner that does not change the result of the regularized risk minimization. The condition that the submatrix \(D_{*I}\) has at least \(m-s\) zero rows has an intuitive interpretation when *D* is the transpose of an incidence matrix of a graph. It means that any group of *k* vertices in the graph involves at most *s* edges. This is true in many cases of interest, such as grid structured graphs if *s* is proportional to *k*. The term involving \(\Vert (D^{\scriptscriptstyle \top })^+\Vert _{\infty }^{2}\) is at most linear in the number of vertices. \(\Vert D\Vert ^2\) corresponding to the maximum eigenvalue of the graph Laplacian is bounded above by a constant for a given structure (e.g. 2-D with neighborhood of 4).

We have proposed a tractable approximation to the (*k*, *s*) support total variation norm, which was shown to be NP-hard. We showed that the error from this approximation has a bound that scales well for the case of grid graphs. We now discuss some optimization strategies for this approximate penalty and demonstrate several experiments showing its utility.

### 2.4 Optimization

^{2}

*TV*(

*w*), FISTA with an approximate computation of the \(\Vert w\Vert _k^{sp}+TV(w)\), and the Excessive Gap Method. A common concern in

*TV*related optimization is the convergence. The former two methods have previously shown good empirical and theoretical convergence (Dohmatob et al. 2014; Dubois et al. 2014) and we describe specifics of their implementation with our objective below. However,these approaches do not provide optimality guarantees on the solution. For solving Eq. (12) we may apply the Excessive Gap Method, which has convergence guarantees on the duality gap. We describe the non-trivial analysis required for applying the excessive gap method on our objective, which also requires the newly derived

*k*-support ball projection operator in Sect. 2.4.1. We note that this section constitutes a preliminary proposal demonstrating our objectives can be optimized with state-of-the-art convex optimization methods. A detailed analysis of the optimization is beyond the scope of this work, and we utilize a combination of the methods described throughout our experiments.

In Iterated FISTA, we may utilize the proximal operator for *k*-support along with Nesterov smoothing on the *TV*(*w*) term to make it differentiable (Dohmatob et al. 2014; Nesterov 2004). We can follow a strategy of repeatedly solving a FISTA problem with progressively decreasing smoothing parameter on the *TV*(*w*) term as per Dubois et al. (2014), who provide analysis of such an approach, which they call CONESTA. This technique can be used to solve any of Eqs. (11), (10) and (12) given the relevant proximal mapping discussed in Sect. 2.4.1

We can estimate the proximal operator of \(\lambda _1 \Vert w\Vert _k^{sp}+\lambda _2 TV(w)\) using an accelerated proximal gradient method in the dual, as described in Beck and Teboulle (2009), and the projection operator onto the \(\Vert w\Vert _k^{sp}\) dual ball given in Chatterjee et al. (2014). This allows us another approach of directly applying FISTA, but with the inexact proximal operator in order to solve Eq. (11).

To apply the Excessive Gap Method to *k*-support TV regularizations we note the primal and the dual of Eq. (12) can be written as \(\min \nolimits _{\Vert w\Vert _{ksp}\le \lambda _1} f(w)=\max \nolimits _{\Vert u\Vert _{\infty }<1}\phi (u)\) where the primal is given \(f(w)=\hat{f}(w)+\max \nolimits _{\Vert u\Vert _{\infty }<1}\{\langle Dw,u \rangle \}\), and the dual is given by \(\phi (u)=-\hat{\phi }(u)+\langle Dw^*_u,u \rangle +\hat{f}(w^*_u)\) with \(w^*_u= \mathop {\hbox {arg min}}\nolimits _{\Vert w\Vert ^{sp}_{k}\le \lambda _1}{\langle Dw,u \rangle + \hat{f}(x)}\).

*z*, onto the \(\ell _{\infty }\) ball,which we denote \(P_{\Vert \cdot \Vert _{\infty \le 1}}(z)\), obtained by truncating all values above magnitude 1. The relevant operations are then given by

*x*(

*u*) can be solved using an accelerated projected gradient method and the projection onto the

*k*-support ball derived in Sect. 2.4.1.

#### 2.4.1 Proximal operators associated with the *k*-support norm

*k*-support ball of size \(\lambda , C_{\lambda }\). We note that

*k*-support norm is given by

*i*th largest element of

*w*. The projection onto \(\Vert w\Vert _k^{sp}\) is given by:

**Theorem 2**

*x*. If \(\Vert x\Vert _k^{sp}>\lambda \), define \(D_r=\sum \limits _{i=1}^{k-r-1}(|x|^{\downarrow })^2, T_{r,l}=\sum \limits _{i=k-r}^{l}|x|^{\downarrow }\), and \(n=l-k+r+1\) , and construct the equation for \(\beta _{r,l}\):

*k*-support ball is given by finding

*r*,

*l*which satisfy the conditions:

*r*and

*l*in O(log(

*k*) log(

*d*-

*k*)).

*Proof Sketch*

Argyriou et al. (2012, Algorithm 1) specifies conditions on the proximal map of \(\frac{1}{2\beta }(\Vert w\Vert _{k}^{sp})^2\). For a given \(\beta \) there must be a corresponding \(\lambda \) such that \(\Vert w\Vert _{k}^{sp}=\lambda \), and therefore \(\Vert prox_{\frac{1}{2\beta }(\Vert w\Vert _{k}^{sp})^2}(x)\Vert _{k}^{sp}=\lambda \). Substituting Eq. (13) and explicit form and constraints for \(prox_{\frac{1}{2\beta }(\Vert w\Vert _{k}^{sp})^2}(x)\) in Argyriou et al. (2012, Algorithm 1) we obtain Eq. (14) when the constraints are satisfied. Theorem 3 of Chatterjee et al. (2014) holds since the constraints are the same \(\square \)

## 3 Experimental results

We evaluate the effectiveness of the introduced penalty on signal recovery and classification problems. We consider a sparse image recovery problem from compressed sensing, a small training sample classification task using MNIST, an M/EEG prediction task, and classification and recovery task for fMRI and synthetic data. We compare our regularizer against several common regularizers (\(\ell _1\) and \(\ell _2\)) and popular structured regularizers for problems with similar structure. In recent work \(\hbox {TV}+\ell _1\), which adds the TV and \(\ell _1\) constraints, has been heavily utilized for data with similar spatial assumptions (Dohmatob et al. 2014; Gramfort et al. 2013) and is thus one of our main benchmarks. Source code for learning with the *k*-support/TV regularizer is available at https://github.com/eugenium/StructuredSparsityRegularization.

### 3.1 Background subtracted image recovery

We apply *k*-support total variation regularization to a background subtracted image reconstruction problem frequently used in the structured sparsity literature (Baldassarre et al. 2012a; Huang et al. 2009). We use a similar setup to Baldassarre et al. (2012a). Here we apply *m* random projections to a background-subtracted image along with Gaussian noise, and reconstruct the image using the projections and projection matrices. Our evaluation metric for the recovery is the mean squared pixel error. For this experiment we utilize the a squared loss function and the iterative FISTA with smoothed TV described in Sect. 2.4.

*k*-support total variation to reconstruct the original images. We compute normalized recovery error for different number of samples

*m*and compare our regularizer to LASSO, \(\hbox {TV}+\ell _1\), and StructOMP. The latter is a structured regularizer which performs best on this problem in Huang et al. (2009). The average normalized recovery error is shown for different sample sizes in Figure 2(a). We used a separate set of images to set the parameters for each method.

In terms of recovery error we note that *k*-support total variation substantially outperforms LASSO and \(\hbox {TV}+\ell _1\), and outperforms StructOMP for low sample sizes. Further examination of the images reveals other advantages of the *k*-support total variation regularizer. An example for one image recovery scenario is shown at 2 different sample sizes in Figure 2(b). Here we can see that at low sample sizes StructOMP and LASSO can completely fail in terms of creating a visually coherent reconstruction of the image. \(\hbox {TV}+\ell _1\) recovery at the low sample size improves upon the latter methods, producing smooth regions, but still not resembling the human shape pictured in the original image. *k*-support total variation has better visual quality at this low sample complexity, due to its ability to retain multiple groups of correlated variables in addition to the smoothness prior. For the case of a larger number of samples, illustrated by the bottom row of Figure 2(b), we note that although the recovery performance of StructOMP is better (lower error), the visual quality of the *k*-support total variation regualrizer produces smoother and more coherent image segments.

### 3.2 Low sample complexity MNIST classification

*d*corresponds to the image size (\(28\times 28\)) and the cases \(k=1\) and \(k=d\) correspond the \(\ell _1\) and \(\ell _2\) norm, respectively, when \(\lambda _2=0\). We test on the entire MNIST test set of 10000 images. We optimize a logistic loss function combined with our

*k*-support total variation norm and compare to results from \(\ell _1, \ell _2, k\)-support norm, and \(\hbox {TV}/\ell _1\) penalties combined with logistic loss. We perform optimization using FISTA on the

*k*-support norm (Argyriou et al. 2012; Nesterov 2004) and a smoothing applied to the total variation. For the graph structure, specified by

*D*, we use a grid graph with each pixel having a neighborhood consisting of the 4 adjacent pixels. We obtain surprisingly high classification accuracy using just 18 training examples. The results in Table 1 show classification accuracy for each one versus all classifier and the average of the classifiers. In all but two cases the

*k*-support TV norm outperforms the other regularizers. We note that for the digit 9 classification the difference between the best classifier and

*k*-support/TV is not statistically significant.

Accuracy for One versus All classifiers on MNIST using only 18 training examples and standard error computed on the test set

Class. | \(\ell _1\) | \(\ell _2\) | KS | \(\ell _1+\hbox {TV}\) | \(\hbox {KS}+\hbox {TV}\) |
---|---|---|---|---|---|

D0 | \( 93.62 \pm .01 \) | \( 93.49 \pm .01 \) | \( 93.68 \pm .02 \) | \( 96.22 \pm .01 \) | \( \varvec{96.27 \pm .01} \) |

D1 | \( 90.1 \pm .02 \) | \( 89.56 \pm .02 \) | \( 90.08 \pm .02 \) | \( 90.57 \pm .02 \) | \( \varvec{92.18 \pm .02} \) |

D2 | \( 78.28 \pm .03 \) | \( 77.28 \pm .03 \) | \( 78.25 \pm .03 \) | \( \varvec{81.47 \pm .02} \) | \( 81.39 \pm .03 \) |

D3 | \( 68.58 \pm .02 \) | \( 68.05 \pm .02 \) | \( 68.60 \pm .02 \) | \( 71.63 \pm .02 \) | \( \varvec{73.25 \pm .02} \) |

D4 | \( 83.81 \pm .01 \) | \( 82.55 \pm .01 \) | \( 83.76 \pm .01 \) | \( 84.69 \pm .01 \) | \( \varvec{84.79 \pm .01} \) |

D5 | \( 73.7 \pm .03 \) | \( 73.2 \pm .02 \) | \( 73.69 \pm .03 \) | \( 74.52 \pm .02 \) | \( \varvec{74.95 \pm .02} \) |

D6 | \( 93.48 \pm .01 \) | \( 93.37 \pm .01 \) | \( 93.51 \pm .01 \) | \( 93.71 \pm .01 \) | \( \varvec{94.08 \pm .01} \) |

D7 | \( 88.88 \pm .02 \) | \( 87.21 \pm .02 \) | \( 88.85 \pm .02 \) | \( 91.67 \pm .01 \) | \( \varvec{92.59 \pm .01} \) |

D8 | \( 70.79 \pm .02 \) | \( 72.07 \pm .03 \) | \( 72.75 \pm .02 \) | \( 73.23 \pm .02 \) | \( \varvec{73.10 \pm .02} \) |

D9 | \( 85.48 \pm .02 \) | \( 85.61 \pm .02 \) | \( 85.49 \pm .02 \) | \( 85.5 \pm .03 \) | \( 85.60 \pm .03 \) |

### 3.3 M/EEG prediction

*k*-support total variation regularization to an M/EEG prediction problem from Backus et al. (2011), Zaremba et al. (2013), using the preprocessing from Zaremba et al. (2013). This results in data samples with 60 channels, each consisting of a time-series presumed to be independent across channels. Following Zaremba et al. (2013) we report results for subject 8 from this dataset. For the total variation graph structure, we impose constraints for adjacent samples within each channel, while values from different channels are not connected within the graph. In the original work a latent variable SVM with delay parameter

*h*is used to improve alignment of the samples. We consider only the case for \(h=0\), which reduces to the standard SVM. To directly compare our results we utilize hinge loss with a constant C of \(2\times 10^4\), the same regularization value used in Zaremba et al. (2013). Thus we optimize the following objective

*k*-support and total variation norms, while maintaining a fixed weight for our regularizer comparable to Zaremba et al. (2013). We use \(k= 2500\) (approximately \(80\,\%\) of the dimensions) and \(\lambda = 0.1\). Table 2 shows the mean and standard deviation for the classification accuracy. We use the same partitioning of the data as described by Zaremba et al. (2013), and on average obtain an improvement over the original results. We note that \(\hbox {TV}+\ell _1\) regularization has relatively poor performance. We hypothesize this is because the data used is very noisy and not very sparse.

M/EEG accuracy for SVM, *k*-support total variation regularized SVM, and \(\hbox {TV}+\ell _1\) regularized SVM computed over 5 folds

Classifier | Mean acc. (%) | Acc std. (%) |
---|---|---|

SVM (Zaremba et al. 2013) | 65.44 | 2.29 |

ksp-TV SVM | 66.84 | 3.42 |

TV-\(\ell _1\) SVM | 60.70 | 4.66 |

### 3.4 Prediction and identification in fMRI analysis

In this section we demonstrate the advantages of our sparse regularization method in the analysis of fMRI neuro-imaging data. Brain activation in response to stimuli is normally assumed to be sparse and locally contiguous, thus our proposed regularizer is ideal for describing our prior assumptions on this signal. An important aspect of analysing fMRI data is the ability to demonstrate how the predictive variables identified as important by an estimator correspond to relevant brain regions. Regularized risk minimization is one of few approaches which can handle the multivariate nature of this problem. However, in the presence of many highly correlated variables, such as those in brain regions with many adjacent voxels being activated by a stimulus, using sparse regularization alone there may be many possible solutions with near equivalent predictive performance for small training sample size. Furthermore, from a practical standpoint, overly sparse solutions can be difficult to interpret when attempting to determine an implicated brain region. Thus regularization here allows us to not only converge to a good solution with lower sample complexity, but obtain more interpretable models from amongst the space of solutions with good prediction. Related to interpretability is solution stability, solutions which are more stable under different samples of training data, with regards to implicated voxels/regions allow the practitioner to make a more trustworthy interpretations of the model (Misyrlis et al. 2014; Yan et al. 2014). We evaluate our approach taking all these factors into account.

We first analyze our method using a synthetic simulation of a signal similar to brain activation patterns. This gives us the opportunity to assess the true support recovery performance, which we cannot obtain with real data. We then analyze a popular block-design fMRI dataset from a study on face and object representation in the human ventral temporal cortex (Dohmatob et al. 2014) and perform experiments on predicting and in turn utilizing the predictive models for identifying the relevant regions of interest. We attempt to classify scans taken when a user is shown a pair of scissor vs. when they observe scrambled pixels. We demonstrate that we can obtain improved accuracy, solution interpretability, and stability characteristics compared to previously applied sparse regularization methods incorporating spatial priors. For these experiments we use logistic loss and the \(TV_I(w)\) penalty, which has been shown to work better in fMRI analysis. Optimization is done using FISTA and estimated proximal operator. As our baseline we focus on \(\hbox {TV}+\ell _1\) which has been recently popularized for fMRI applications as well as \(\hbox {TV}+\ell _1+\ell _2\), which has been considered in structural MRI (Dubois et al. 2014).

*w*we wish to recover. Figure 3 shows this ideal parameter vector. We construct data samples \(X=Yw+\varepsilon \). Where

*Y*is a sample from \(\{-1,1\}\) and \(\varepsilon \) is Gaussian noise. We take 150 training samples, 100 validation samples, and 1000 test samples. We consider a binary classification setting using only \(\ell _1, \ell _2\), or

*k*-support regularizers, Smooth-Lasso (Hebiri et al. 2011), \(\hbox {TV}+\ell _1\) regularizer, \(\hbox {TV}+\ell _1+\ell _2\), and our

*k*-support TV regularizer. For each of these scenarios we perform model selection using grid search and select the model with the highest accuracy on the validation set. We repeat this experiment with a new set of training, validation, and test samples 15 times so that we may obtain statistical significance results. The test set accuracy results for each method are shown in Table 3. For each competing method we perform a Wilcoxon signed-rank test against the

*k*-support total variation results. In all listed cases the test rejects the null hypothesis (at a significance level of \(p<0.05\)) that the samples come from the same distribution. We assess the support recovery of competing method by measuring the area under the precision-recall curve for different support thresholds. Finally we measure stability using Pearson correlation between weight vectors from different trials.

Average test accuracy, support recovery, and test accuracy results for 15 trials of synthetic data along with *p*-value for a Wilcoxon signed-rank test performed for each method against the *k*-support/TV result, below 0.05 for all cases

Description | Test acc. ( | Supp. recovery | Stability |
---|---|---|---|

\(\ell _2\) | 67.8 % (7E-4) | 0.388 | 0.173 |

\(\ell _1\) | 68.4 % (7E-4) | 0.377 | 0.220 |

| 68.1 % (7E-4) | 0.398 | 0.217 |

Smooth-LASSO | 77.0 % (7E-4) | 0.407 | 0.464 |

\(\hbox {TV}+\ell _1\) | 80.2 % (9E-3) | 0.739 | 0.620 |

\(\hbox {TV}+\ell _1+\ell _2\) | 81.5 % (2E-2) | 0.796 | 0.688 |

| 82.2 % | 0.816 | 0.719 |

In Figure 3 we visualize the weight vector and precision-recall curve produced by the various regularization methods for one trial. We can see that in Figure 3 the *k*-support norm alone does a poor job at reconstructing a model with any of these local correlations in place. The Smooth-Lasso, \(\hbox {TV}+\ell _1\) and \(\hbox {TV}+\ell _1+\ell _2\) regularizers do a substantially better job at indicating the areas of interest for this task but the *k*-support/TV regularizer produces more precise regions with fewer spurious patterns and substantially better classification accuracy and support recovery. We can see an additional advantage of the *k*-support/TV regularizer over the other methods in terms of stability of the results across trials. Figure 3(c) also shows the effectiveness of the *k*-support/TV regularizer for varying target weight vectors.

*k*variable we can encourage a less sparse solution, that may be more interpretable and include more highly correlated variables. Figure 4a shows this effect for maps of varying

*k*values (note that \(k=1\) corresponds to \(TV+\ell _1\)).

*k*in

*k*-support has an interpretable parameter setting for mixing sparsity and \(\ell _2\). We can interpret the

*k*in our regularizer as an estimate of the number of voxel locations active in the brain. Thus we can set

*k*based on prior knowledge. We fix the value of

*k*to 500 representing approximately \(2\,\%\) sparsity, this allows us to directly compare to the state of the art method for sparse regularization in fMRI, \(TV+\ell _1\), with an equal sized search space in model-selection. We show the accuracy and stability results for \(TV\,+\,\ell _1, TV\,+\,\ell _1\,+\,\ell _2\), and our \(\hbox {TV}+k\)-support regularization in Table 4.

Average test accuracy results for 20 trials along with *p*-value for a Wilcoxon signed-rank test performed for each method against the *k*-support/TV result

Description | Test acc. ( | Stability |
---|---|---|

\(\hbox {TV}+\ell _1\) | 84.72 (8E-4) | 0.132 |

\(\hbox {TV}+\ell _1+\ell _2\) | 86.06 (0.15) | 0.186 |

| 87.91 | 0.415 |

*k*-support norm. As we can see in Eq. (13) the

*k*-support norm can be shown to a combination of \(\ell _2\) penalties on the highest magnitude \(k-r-1\) terms and \(\ell _1\) penalty applied to the rest. Here

*r*is the unique integer in \(\{0,\dots ,k-1\}\) satisfying

*k*-support ball where specific key variables fall into the \(\ell _2\) term, while we still attempt to squash the remaining terms with \(\ell _1\). This property of the optimization of our penalty implies a visualization heuristic for the final solution of taking the top \(k-r-1\) variables. Another view on this heuristic comes from the implicit delineation implied by Eq. (3.4). For

*k*much smaller than

*d*and \(k-r-1\) greater than 0 the definition of

*r*implies the \(k-r-1^{th}\) largest magnitude parameter will be a large factor (\(\frac{d-k+r}{1+r}\)) bigger than the mean of the rest of the parameters below it. Figure 5 illustrates thresholding based on a fixed threshold value and our heuristic of thresholding based on the final \(k-r-1\) value in

*k*-support

*TV*optimization.

## 4 Conclusions

We have introduced a novel norm that incorporates spatial smoothness and correlated sparsity.

This norm, called the (*k*, *s*) *support total variation norm*, extends both the total variation penalty which is a standard in image processing and the recently proposed *k*-support norm from machine learning. The (*k*, *s*) support TV norm is the *tightest convex penalty* that combines *sparsity*, \(\ell _2\) and *total variation* constraints jointly. We have derived a variational form for this norm for arbitrary graph structures. We have also expressed the dual norm as a combinatorial optimization problem on the graph. This graph problem is shown to be NP-hard motivating the use of a relaxation, which is shown to be equivalent to the weighted combination of a *k*-support norm and a total variation penalty. We have shown that this norm approximates the (*k*, *s*) support TV norm within a factor that depends on properties of the graph as well as on the parameters *k* and *s*, and that this bound scales well for grid structured graphs. Moreover, we have demonstrated that joint *k* support and TV regularization can be applied on a diverse variety of learning problems, such as classification with small samples, neural imaging and image recovery. These experiments have illustrated the utility of penalties combining *k*-support and total variation structure on problems where spatial structure, feature selection and correlations among features are all relevant. We have shown that this penalty has several unique properties that make it an excellent tool analysis of fMRI data. Some of our additional contributions include a generalized formulation of the dual norm of a norm which is the infimal convolution of norms, the first algorithm for projecting onto the *k*-support norm ball, and first analysis that notes interesting practical properties of the *r* variable of the *k*-support norm.

The proof of this statement follows from the fact that optimization subject to the intersection of two constraints has a Lagrangian that is exactly a regularized risk minimization with the two corresponding penalties each with their own Lagrange multiplier.

## Acknowledgments

We would like to thank Tianren Liu for his help with showing that computation of the (*k*, *s*) support total variation norm is an NP-hard problem.