Convex relaxations of penalties for sparse correlated variables with bounded total variation
- 943 Downloads
- 3 Citations
Abstract
We study the problem of statistical estimation with a signal known to be sparse, spatially contiguous, and containing many highly correlated variables. We take inspiration from the recently introduced k-support norm, which has been successfully applied to sparse prediction problems with correlated features, but lacks any explicit structural constraints commonly found in machine learning and image processing. We address this problem by incorporating a total variation penalty in the k-support framework. We introduce the (k, s) support total variation norm as the tightest convex relaxation of the intersection of a set of sparsity and total variation constraints. We show that this norm leads to an intractable combinatorial graph optimization problem, which we prove to be NP-hard. We then introduce a tractable relaxation with approximation guarantees that scale well for grid structured graphs. We devise several first-order optimization strategies for statistical parameter estimation with the described penalty. We demonstrate the effectiveness of this penalty on classification in the low-sample regime, classification with M/EEG neuroimaging data, and image recovery with synthetic and real data background subtracted image recovery tasks. We extensively analyse the application of our penalty on the complex task of identifying predictive regions from low-sample high-dimensional fMRI brain data, we show that our method is particularly useful compared to existing methods in terms of accuracy, interpretability, and stability.
Keywords
Structured sparsity Feature selection Brain decoding k-Support Total variation1 Introduction
Regularization methods utilizing the \(\ell _1\) norm such as Lasso (Tibshirani 1996) have been used widely for feature selection. They have been particularly successful at learning problems in which very sparse models are required. However, in many problems a better approach is to balance sparsity against an \(\ell _2\) constraint. One reason is that very often features are correlated and it may be better to combine several correlated features than to select fewer of them, in order to obtain a lower variance estimator and better interpretability. This has led to the method of elastic net in statistics (Zou and Hastie 2005), which regularizes with a weighted sum of \(\ell _1\) and \(\ell _2\) penalties. More recently, it has been shown that the elastic net is not in fact the tightest convex penalty that approximates sparsity (\(\ell _0\)) and \(\ell _2\) constraints at the same time (Argyriou et al. 2012). The tightest convex penalty is given by the k-support norm, which is parametrized by an integer k, and can be computed efficiently. This norm has been successfully applied to a variety of sparse vector prediction problems (Belilovsky et al. 2015; Gkirtzou et al. 2013; McDonald et al. 2014; Misyrlis et al. 2014).
We study the problem of introducing structural constraints to sparsity and \(\ell _2\), from first principles. In particular, we seek to introduce a total variation smoothness prior in addition to sparsity and \(\ell _2\) constraints. Total variation is a popular regularizer used to enforce local smoothness in a signal (Michel et al. 2011; Rudin et al. 1992; Tibshirani et al. 2005). It has successfully been applied in image de-noising and has recently become of particular interest in the neural imaging community where it can be used to reconstruct sparse but locally smooth brain activation (Baldassarre et al. 2012b; Michel et al. 2011).Two kinds of total variation are commonly considered in the literature, isotropic \(TV_I(w)= \Vert \nabla w\Vert _{2,1}\) and anisotropic \(TV_A(w)= \Vert \nabla w\Vert _1\) (Beck and Teboulle 2009). In our theoretical analysis we focus on the anisotropic penalty.
To derive a penalty incorporating these constraints we follow the approach of Argyriou et al. (2012) by taking the convex hull of the intersection of our desired penalties and then recovering a norm by applying the gauge function. We then derive a formulation for the dual norm which leads us to a combinatorial optimization problem, which we prove to be NP-hard. We find an approximation to this penalty and prove a bound on the approximation error. Since the k-support norm is the tightest relaxation of sparsity and \(\ell _2\) constraints, we propose to use the intersection of the TV norm ball and the k-support norm ball. This leads to a convex optimization problem in which (sub)gradient computation can be achieved with a computational complexity no worse than that of the total variation. Furthermore, our approximation can be computed for variation on an arbitrary graph structure.
We discuss and utilize several first order optimization schemes including stochastic subgradient descent, iterative Nesterov-smoothing methods, and FISTA with an estimated proximal operator. We demonstrate the tractability and utility of the norm through applications of classification on MNIST with few samples, M/EEG classification, and background-subtracted image recovery. For the problem of identifying predictive regions in fMRI we show that we can get improved accuracy, stability, and interpretability along with providing the user with several potential tools and heuristics to visualize the resulting predictive models. This includes several interesting properties that apply to the special case of k-support norm optimization as well.
2 Convex relaxation of sparsity, \(\ell _2\) and total variation
In this section we formulate the (k, s) support total variation norm, a tight convex relaxation of sparsity, \(\ell _2\), and total variation (TV) constraints. We derive its dual norm which results in an intractable optimization problem. Finally we describe a looser convex relaxation of these penalties which leads to a tractable optimization problem.
2.1 Derivation of the norm
2.2 Derivation of the dual norm
A standard approach for analyzing structured norms is through analysis of the dual norm (Argyriou et al. 2012; Bach et al. 2012; Mairal and Yu 2013). As such, it will be useful to derive an expression for the dual norm of \(\Vert \cdot \Vert _{k,s}^{sptv}\). This will allow us to connect the norm with an optimization problem on a graph, use this to show the norm is NP-hard, and to derive an approximation bound (Proposition 2).
We note that this definition is equivalent to an infimal convolution of n norms. Let \(\delta _S\) denote the indicator function of a subspace S and the infimal convolution \((f_1 \,\Box \,\dots \,\Box \,f_n)\) of n functions as \({{{\mathrm{\,\Box \,}}}}_{i=1}^n f_i\). Using this notation, the norm \(\Vert \cdot \Vert \) can be written equivalently as \(\Vert \cdot \Vert = {{{\mathrm{\,\Box \,}}}}_{i=1}^n \left( \Vert \cdot \Vert _{(i)} + \delta _{S_i}\right) \;.\) We may derive the general form of the dual norm \(\Vert \cdot \Vert ^{*}\) of \(\Vert \cdot \Vert \) by a direct application of standard duality results from convex analysis.
Lemma 1
Proof
Equation (4) for the dual norm is interpreted as the maximum of the distances of x (with respect to the corresponding dual norms) from the orthogonal complements. We now specialize this formula to the case of (k, s) support TV norm.
Notation We define \(G_k\) as all subsets of \(\{1,...,d\}\) of cardinality at most k and \(M_s\) as all subsets of \(\{1,...,m\}\) of cardinality at most s. For every \(I \in G_k\), we denote \(I^c = \{ 1,...,d\}\backslash I\) and for every \(J \in M_s, J^c = \{ 1,...,m\} \backslash J\). We denote \(D_{J^c}\) as the submatrix of D with only the rows indexed by \(J^c\) and for every \(u\in \mathbb {R}^d, u_I\) is the subvector of u with only the elements indexed by I.
Proposition 1
a Example of D matrix for a graph and b an example \(D_{J^c}\) for a given instance of J. The graph in (b) has two subgraphs, one with nodes \(x_1,x_2,x_4\) and the other the singleton, \(x_3\)
We can additionally show that we can limit \(M_s\) to maximum cardinality sets (cardinality s) and \(G_k\) to maximum cardinality sets (cardinality k). Indeed, adding indexes in I or J cannot decrease \(S_{I,J}\) and hence cannot decrease the norm of the projection in Eq. (5). Thus we can narrow the problem to removing s edges and \(d-k\) nodes (with their associated subgraphs).
We have now reduced the computation of the dual norm to a graph partitioning problem. Graph partition problems are often NP-hard, and we show this to be the case here as well:
Theorem 1
Computation of the (k, s) support total variation dual norm is NP-hard
The proof of Theorem 1 is given in Appendix 1.
Corollary 1
Computation of the (k, s) support total variation norm is NP-hard.
In light of this Theorem, we are unable to incorporate the (k, s) support total variation norm in a regularized risk setting. Instead in the sequel we examine a tractable approximation with bounds that scale well for the family of graphs of interest.
2.3 Approximating the norm
Proposition 2
Proof
We note that we can fulfil the technical condition on the range of \(D^T\) by augmenting the incidence matrix in a manner that does not change the result of the regularized risk minimization. The condition that the submatrix \(D_{*I}\) has at least \(m-s\) zero rows has an intuitive interpretation when D is the transpose of an incidence matrix of a graph. It means that any group of k vertices in the graph involves at most s edges. This is true in many cases of interest, such as grid structured graphs if s is proportional to k. The term involving \(\Vert (D^{\scriptscriptstyle \top })^+\Vert _{\infty }^{2}\) is at most linear in the number of vertices. \(\Vert D\Vert ^2\) corresponding to the maximum eigenvalue of the graph Laplacian is bounded above by a constant for a given structure (e.g. 2-D with neighborhood of 4).
We have proposed a tractable approximation to the (k, s) support total variation norm, which was shown to be NP-hard. We showed that the error from this approximation has a bound that scales well for the case of grid graphs. We now discuss some optimization strategies for this approximate penalty and demonstrate several experiments showing its utility.
2.4 Optimization
In Iterated FISTA, we may utilize the proximal operator for k-support along with Nesterov smoothing on the TV(w) term to make it differentiable (Dohmatob et al. 2014; Nesterov 2004). We can follow a strategy of repeatedly solving a FISTA problem with progressively decreasing smoothing parameter on the TV(w) term as per Dubois et al. (2014), who provide analysis of such an approach, which they call CONESTA. This technique can be used to solve any of Eqs. (11), (10) and (12) given the relevant proximal mapping discussed in Sect. 2.4.1
We can estimate the proximal operator of \(\lambda _1 \Vert w\Vert _k^{sp}+\lambda _2 TV(w)\) using an accelerated proximal gradient method in the dual, as described in Beck and Teboulle (2009), and the projection operator onto the \(\Vert w\Vert _k^{sp}\) dual ball given in Chatterjee et al. (2014). This allows us another approach of directly applying FISTA, but with the inexact proximal operator in order to solve Eq. (11).
To apply the Excessive Gap Method to k-support TV regularizations we note the primal and the dual of Eq. (12) can be written as \(\min \nolimits _{\Vert w\Vert _{ksp}\le \lambda _1} f(w)=\max \nolimits _{\Vert u\Vert _{\infty }<1}\phi (u)\) where the primal is given \(f(w)=\hat{f}(w)+\max \nolimits _{\Vert u\Vert _{\infty }<1}\{\langle Dw,u \rangle \}\), and the dual is given by \(\phi (u)=-\hat{\phi }(u)+\langle Dw^*_u,u \rangle +\hat{f}(w^*_u)\) with \(w^*_u= \mathop {\hbox {arg min}}\nolimits _{\Vert w\Vert ^{sp}_{k}\le \lambda _1}{\langle Dw,u \rangle + \hat{f}(x)}\).
2.4.1 Proximal operators associated with the k-support norm
Theorem 2
Proof Sketch
Argyriou et al. (2012, Algorithm 1) specifies conditions on the proximal map of \(\frac{1}{2\beta }(\Vert w\Vert _{k}^{sp})^2\). For a given \(\beta \) there must be a corresponding \(\lambda \) such that \(\Vert w\Vert _{k}^{sp}=\lambda \), and therefore \(\Vert prox_{\frac{1}{2\beta }(\Vert w\Vert _{k}^{sp})^2}(x)\Vert _{k}^{sp}=\lambda \). Substituting Eq. (13) and explicit form and constraints for \(prox_{\frac{1}{2\beta }(\Vert w\Vert _{k}^{sp})^2}(x)\) in Argyriou et al. (2012, Algorithm 1) we obtain Eq. (14) when the constraints are satisfied. Theorem 3 of Chatterjee et al. (2014) holds since the constraints are the same \(\square \)
3 Experimental results
We evaluate the effectiveness of the introduced penalty on signal recovery and classification problems. We consider a sparse image recovery problem from compressed sensing, a small training sample classification task using MNIST, an M/EEG prediction task, and classification and recovery task for fMRI and synthetic data. We compare our regularizer against several common regularizers (\(\ell _1\) and \(\ell _2\)) and popular structured regularizers for problems with similar structure. In recent work \(\hbox {TV}+\ell _1\), which adds the TV and \(\ell _1\) constraints, has been heavily utilized for data with similar spatial assumptions (Dohmatob et al. 2014; Gramfort et al. 2013) and is thus one of our main benchmarks. Source code for learning with the k-support/TV regularizer is available at https://github.com/eugenium/StructuredSparsityRegularization.
3.1 Background subtracted image recovery
We apply k-support total variation regularization to a background subtracted image reconstruction problem frequently used in the structured sparsity literature (Baldassarre et al. 2012a; Huang et al. 2009). We use a similar setup to Baldassarre et al. (2012a). Here we apply m random projections to a background-subtracted image along with Gaussian noise, and reconstruct the image using the projections and projection matrices. Our evaluation metric for the recovery is the mean squared pixel error. For this experiment we utilize the a squared loss function and the iterative FISTA with smoothed TV described in Sect. 2.4.
a Average model error for background subtracted image reconstruction for various sample sizes. b Image example for different methods and sample sizes. k-support/TV regularization gives the best recovery error for 216 samples, and gives smoother recovery results than the other methods for both sample sizes
In terms of recovery error we note that k-support total variation substantially outperforms LASSO and \(\hbox {TV}+\ell _1\), and outperforms StructOMP for low sample sizes. Further examination of the images reveals other advantages of the k-support total variation regularizer. An example for one image recovery scenario is shown at 2 different sample sizes in Figure 2(b). Here we can see that at low sample sizes StructOMP and LASSO can completely fail in terms of creating a visually coherent reconstruction of the image. \(\hbox {TV}+\ell _1\) recovery at the low sample size improves upon the latter methods, producing smooth regions, but still not resembling the human shape pictured in the original image. k-support total variation has better visual quality at this low sample complexity, due to its ability to retain multiple groups of correlated variables in addition to the smoothness prior. For the case of a larger number of samples, illustrated by the bottom row of Figure 2(b), we note that although the recovery performance of StructOMP is better (lower error), the visual quality of the k-support total variation regualrizer produces smoother and more coherent image segments.
3.2 Low sample complexity MNIST classification
Accuracy for One versus All classifiers on MNIST using only 18 training examples and standard error computed on the test set
Class. | \(\ell _1\) | \(\ell _2\) | KS | \(\ell _1+\hbox {TV}\) | \(\hbox {KS}+\hbox {TV}\) |
---|---|---|---|---|---|
D0 | \( 93.62 \pm .01 \) | \( 93.49 \pm .01 \) | \( 93.68 \pm .02 \) | \( 96.22 \pm .01 \) | \( \varvec{96.27 \pm .01} \) |
D1 | \( 90.1 \pm .02 \) | \( 89.56 \pm .02 \) | \( 90.08 \pm .02 \) | \( 90.57 \pm .02 \) | \( \varvec{92.18 \pm .02} \) |
D2 | \( 78.28 \pm .03 \) | \( 77.28 \pm .03 \) | \( 78.25 \pm .03 \) | \( \varvec{81.47 \pm .02} \) | \( 81.39 \pm .03 \) |
D3 | \( 68.58 \pm .02 \) | \( 68.05 \pm .02 \) | \( 68.60 \pm .02 \) | \( 71.63 \pm .02 \) | \( \varvec{73.25 \pm .02} \) |
D4 | \( 83.81 \pm .01 \) | \( 82.55 \pm .01 \) | \( 83.76 \pm .01 \) | \( 84.69 \pm .01 \) | \( \varvec{84.79 \pm .01} \) |
D5 | \( 73.7 \pm .03 \) | \( 73.2 \pm .02 \) | \( 73.69 \pm .03 \) | \( 74.52 \pm .02 \) | \( \varvec{74.95 \pm .02} \) |
D6 | \( 93.48 \pm .01 \) | \( 93.37 \pm .01 \) | \( 93.51 \pm .01 \) | \( 93.71 \pm .01 \) | \( \varvec{94.08 \pm .01} \) |
D7 | \( 88.88 \pm .02 \) | \( 87.21 \pm .02 \) | \( 88.85 \pm .02 \) | \( 91.67 \pm .01 \) | \( \varvec{92.59 \pm .01} \) |
D8 | \( 70.79 \pm .02 \) | \( 72.07 \pm .03 \) | \( 72.75 \pm .02 \) | \( 73.23 \pm .02 \) | \( \varvec{73.10 \pm .02} \) |
D9 | \( 85.48 \pm .02 \) | \( 85.61 \pm .02 \) | \( 85.49 \pm .02 \) | \( 85.5 \pm .03 \) | \( 85.60 \pm .03 \) |
3.3 M/EEG prediction
M/EEG accuracy for SVM, k-support total variation regularized SVM, and \(\hbox {TV}+\ell _1\) regularized SVM computed over 5 folds
Classifier | Mean acc. (%) | Acc std. (%) |
---|---|---|
SVM (Zaremba et al. 2013) | 65.44 | 2.29 |
ksp-TV SVM | 66.84 | 3.42 |
TV-\(\ell _1\) SVM | 60.70 | 4.66 |
3.4 Prediction and identification in fMRI analysis
In this section we demonstrate the advantages of our sparse regularization method in the analysis of fMRI neuro-imaging data. Brain activation in response to stimuli is normally assumed to be sparse and locally contiguous, thus our proposed regularizer is ideal for describing our prior assumptions on this signal. An important aspect of analysing fMRI data is the ability to demonstrate how the predictive variables identified as important by an estimator correspond to relevant brain regions. Regularized risk minimization is one of few approaches which can handle the multivariate nature of this problem. However, in the presence of many highly correlated variables, such as those in brain regions with many adjacent voxels being activated by a stimulus, using sparse regularization alone there may be many possible solutions with near equivalent predictive performance for small training sample size. Furthermore, from a practical standpoint, overly sparse solutions can be difficult to interpret when attempting to determine an implicated brain region. Thus regularization here allows us to not only converge to a good solution with lower sample complexity, but obtain more interpretable models from amongst the space of solutions with good prediction. Related to interpretability is solution stability, solutions which are more stable under different samples of training data, with regards to implicated voxels/regions allow the practitioner to make a more trustworthy interpretations of the model (Misyrlis et al. 2014; Yan et al. 2014). We evaluate our approach taking all these factors into account.
We first analyze our method using a synthetic simulation of a signal similar to brain activation patterns. This gives us the opportunity to assess the true support recovery performance, which we cannot obtain with real data. We then analyze a popular block-design fMRI dataset from a study on face and object representation in the human ventral temporal cortex (Dohmatob et al. 2014) and perform experiments on predicting and in turn utilizing the predictive models for identifying the relevant regions of interest. We attempt to classify scans taken when a user is shown a pair of scissor vs. when they observe scrambled pixels. We demonstrate that we can obtain improved accuracy, solution interpretability, and stability characteristics compared to previously applied sparse regularization methods incorporating spatial priors. For these experiments we use logistic loss and the \(TV_I(w)\) penalty, which has been shown to work better in fMRI analysis. Optimization is done using FISTA and estimated proximal operator. As our baseline we focus on \(\hbox {TV}+\ell _1\) which has been recently popularized for fMRI applications as well as \(\hbox {TV}+\ell _1+\ell _2\), which has been considered in structural MRI (Dubois et al. 2014).
a (left top to bottom right) ideal weight vector, weight vector obtained with \(\ell _1, \ell _2, k\)-support norm, \(\hbox {TV}+\ell _1\), and k-support/TV regularizer, and weight vector with combined total variation and k-support norm regularizer. The k-support/TV regularization gives the highest accuracy, support recovery, stability, and most closely approximates the target pattern. b Illustrates the improved precision-recall for k-support/TV versus the other methods on the support recovery for different thresholds. c Recovered support for varying ideal weight vector. This demonstrates that the k-support/TV regularization works well for a wide range of sparsity, correlation, and smoothness
Average test accuracy, support recovery, and test accuracy results for 15 trials of synthetic data along with p-value for a Wilcoxon signed-rank test performed for each method against the k-support/TV result, below 0.05 for all cases
Description | Test acc. (p value) | Supp. recovery | Stability |
---|---|---|---|
\(\ell _2\) | 67.8 % (7E-4) | 0.388 | 0.173 |
\(\ell _1\) | 68.4 % (7E-4) | 0.377 | 0.220 |
k-support | 68.1 % (7E-4) | 0.398 | 0.217 |
Smooth-LASSO | 77.0 % (7E-4) | 0.407 | 0.464 |
\(\hbox {TV}+\ell _1\) | 80.2 % (9E-3) | 0.739 | 0.620 |
\(\hbox {TV}+\ell _1+\ell _2\) | 81.5 % (2E-2) | 0.796 | 0.688 |
k-support/TV | 82.2 % | 0.816 | 0.719 |
In Figure 3 we visualize the weight vector and precision-recall curve produced by the various regularization methods for one trial. We can see that in Figure 3 the k-support norm alone does a poor job at reconstructing a model with any of these local correlations in place. The Smooth-Lasso, \(\hbox {TV}+\ell _1\) and \(\hbox {TV}+\ell _1+\ell _2\) regularizers do a substantially better job at indicating the areas of interest for this task but the k-support/TV regularizer produces more precise regions with fewer spurious patterns and substantially better classification accuracy and support recovery. We can see an additional advantage of the k-support/TV regularizer over the other methods in terms of stability of the results across trials. Figure 3(c) also shows the effectiveness of the k-support/TV regularizer for varying target weight vectors.
a Output map for k=1 (TV-\(\ell _1\)), \(k=50\), and \(k=500\), in each case the Lateral Occipital Cortex is indicated. b Objective value of \(\hbox {TV}+k\)-support (\({\hbox {k}}=500\)) and \(k-r-1\) over iterations.
Average test accuracy results for 20 trials along with p-value for a Wilcoxon signed-rank test performed for each method against the k-support/TV result
Description | Test acc. (p value) | Stability |
---|---|---|
\(\hbox {TV}+\ell _1\) | 84.72 (8E-4) | 0.132 |
\(\hbox {TV}+\ell _1+\ell _2\) | 86.06 (0.15) | 0.186 |
k-support/TV | 87.91 | 0.415 |
Output map for fixed thresholding and thresholding based on converged \(k-r-1\) value
4 Conclusions
We have introduced a novel norm that incorporates spatial smoothness and correlated sparsity.
This norm, called the (k, s) support total variation norm, extends both the total variation penalty which is a standard in image processing and the recently proposed k-support norm from machine learning. The (k, s) support TV norm is the tightest convex penalty that combines sparsity, \(\ell _2\) and total variation constraints jointly. We have derived a variational form for this norm for arbitrary graph structures. We have also expressed the dual norm as a combinatorial optimization problem on the graph. This graph problem is shown to be NP-hard motivating the use of a relaxation, which is shown to be equivalent to the weighted combination of a k-support norm and a total variation penalty. We have shown that this norm approximates the (k, s) support TV norm within a factor that depends on properties of the graph as well as on the parameters k and s, and that this bound scales well for grid structured graphs. Moreover, we have demonstrated that joint k support and TV regularization can be applied on a diverse variety of learning problems, such as classification with small samples, neural imaging and image recovery. These experiments have illustrated the utility of penalties combining k-support and total variation structure on problems where spatial structure, feature selection and correlations among features are all relevant. We have shown that this penalty has several unique properties that make it an excellent tool analysis of fMRI data. Some of our additional contributions include a generalized formulation of the dual norm of a norm which is the infimal convolution of norms, the first algorithm for projecting onto the k-support norm ball, and first analysis that notes interesting practical properties of the r variable of the k-support norm.
Footnotes
- 1.
The conditions are given in Proposition 1.
- 2.
The proof of this statement follows from the fact that optimization subject to the intersection of two constraints has a Lagrangian that is exactly a regularized risk minimization with the two corresponding penalties each with their own Lagrange multiplier.
Notes
Acknowledgments
We would like to thank Tianren Liu for his help with showing that computation of the (k, s) support total variation norm is an NP-hard problem.
References
- Argyriou, A., Foygel, R., & Srebro, N. (2012). Sparse prediction with the \(k\)-support norm. In F. Pereira, C.J.C. Burges, L. Bottou, & K.Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25 (pp. 1457–1465). Curran Associates, Inc.Google Scholar
- Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2012). Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1), 1–106.CrossRefGoogle Scholar
- Backus, A., Jensen, O., Meeuwissen, E., van Gerven, M., & Dumoulin, S. (2011). Investigating the temporal dynamics of long term memory representation retrieval using multivariate pattern analyses on magnetoencephalography data. Tech. rep.Google Scholar
- Baldassarre, L., Morales, J., Argyriou, A., & Pontil, M. (2012a). A general framework for structured sparsity via proximal optimization. In AISTATS, pp. 82–90.Google Scholar
- Baldassarre, L., Mourao-Miranda, J., & Pontil, M. (2012b). Structured sparsity models for brain decoding from fMRI data. In PRNI.Google Scholar
- Bauschke, H. H., & Combettes, P. L. (2011). Convex analysis and monotone operator theory in Hilbert spaces, CMS Books in mathematics. Berlin: Springer.CrossRefGoogle Scholar
- Beck, A., & Teboulle, M. (2009). Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Transactions on Image Processing, 18(11), 2419–2434.MathSciNetCrossRefGoogle Scholar
- Belilovsky, E., Gkirtzou, K., Misyrlis, M., Konova, A. B., Honorio, J., Alia-Klein, N., et al. (2015). Predictive sparse modeling of fMRI data for improved classification, regression, and visualization using the k-support norm. Computerized Medical Imaging and Graphics. doi: 10.1016/j.compmedimag.2015.03.007.
- Bhatia, R. (1997). Matrix analysis, graduate texts in mathematics. Berlin: Springer.Google Scholar
- Chatterjee, S., Chen, S., & Banerjee, A. (2014). Generalized dantzig selector: Application to the \(k\)-support norm. In NIPS, pp. 1934–1942.Google Scholar
- Dohmatob, E., Gramfort, A., Thirion, B., & Varoquaux, G. (2014). Benchmarking solvers for TV-l1 least-squares and logistic regression in brain imaging. In PRNI.Google Scholar
- Dubois, M., Hadj-Selem, F., Lofstedt, T., Perrot, M., Fischer, C., Frouin, V., & Duchesnay, E. (2014). Predictive support recovery with TV-elastic net penalty and logistic regression: An application to structural MRI. In PRNI.Google Scholar
- Gkirtzou, K., Honorio, J., Samaras, D., Goldstein, R.Z., & Blaschko, M.B. (2013). fMRI analysis of cocaine addiction using \(k\)-support sparsity. In ISBI, pp. 1078–1081.Google Scholar
- Gramfort, A., Thirion, B., Varoquaux, G. (2013) Identifying predictive regions from fMRI with TV-L1 prior. In PRNI, pp. 17–20.Google Scholar
- Hebiri, M., Van De Geer, S., et al. (2011). The smooth-lasso and other \(1+2\)-penalized methods. Electronic Journal of Statistics, 5, 1184–1226.MathSciNetCrossRefzbMATHGoogle Scholar
- Huang, J., Zhang, T., & Metaxas, D. (2009). Learning with structured sparsity. In Proceedings of the international conference on machine learning, pp. 417–424.Google Scholar
- LeCun, Y., & Cortes, C. (2010). MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/
- Mairal, J., & Yu, B. (2013). Supervised feature selection in graphs with path coding penalties and network flows. JMLR, 14(1), 2449–2485.MathSciNetzbMATHGoogle Scholar
- McDonald, A.M., Pontil, M., & Stamos, D. (2014). New perspectives on k-support and cluster norms. arXiv:1403.1481
- Michel, V., Gramfort, A., Varoquaux, G., Eger, E., & Thirion, B. (2011). Total variation regularization for fMRI-based prediction of behavior. IEEE Transactions on Medical Imaging, 30(7), 1328–1340.CrossRefGoogle Scholar
- Misyrlis, M., Konova, A., Blaschko, M., Honorio, J., Alia-Klein, N., Goldstein, R., & Samaras, D. (2014). Predicting cross-task behavioral variables from fMRI data using the \(k\)-support norm. In Sparsity techniques in medical imaging.Google Scholar
- Nesterov, Y. (2004). Introductory lectures on convex optimization. Berlin: Springer.CrossRefGoogle Scholar
- Nesterov, Y. (2005). Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1), 235–249.MathSciNetCrossRefzbMATHGoogle Scholar
- Parikh, N., Boyd, S., et al. (2014). Foundations and trends in optimization. Foundations and Trends in Theoretical Computer Science, 8(1–2).Google Scholar
- Rudin, L. I., Osher, S., & Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D, 60(1–4), 259–268.CrossRefzbMATHGoogle Scholar
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288.MathSciNetzbMATHGoogle Scholar
- Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B, 67, 91–108.MathSciNetCrossRefzbMATHGoogle Scholar
- Vazirani, V. (2001). Approximation Algorithms. Berlin: Springer.Google Scholar
- Yan, S., Yang, X., Wu, C., Zheng, Z., & Guo, Y. (2014). Balancing the stability and predictive performance for multivariate voxel selection in fMRI study. In Brain informatics and health, pp. 90–99.Google Scholar
- Zaremba, W., Kumar, M. P., Gramfort, A., Blaschko, M. B. (2013). Learning from M/EEG data with variable brain activation delays. In IPMI, pp. 414–425.Google Scholar
- Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320.MathSciNetCrossRefzbMATHGoogle Scholar