Keywords

1 Introduction

With the growth of available medical imaging data, the need for good methods to perform large neuroimaging studies has increased. The majority of studies use voxel-based analysis (VBA) to identify regions where two groups differ [2]. VBA generates statistical maps consisting of p-values characterizing significant differences at the voxel level. These methods have limited ability to identify complex population differences and pathologies that span multiple anatomical regions because they do not take into account correlations between voxels and regions in the brain. In addition a large number of multiple comparisons are needed due to the high dimensionality of the data.

To overcome these limitations of VBA [6], alternative methods reformulate the region selection problem as a simultaneous feature selection and classification (or regression). Such methods typically use a sparse \(l_1\) (LASSO) penalty and have been successfully applied to medical imaging data [12]. However, imposing sparsity can often lead to less stable feature maps that cannot be interpreted from an anatomical viewpoint. To counter this behavior, several estimators incorporate the notion of spatial smoothness on the coefficient maps through additional penalizers. Two main types of image-based penalizers have been used in the literature. Graph net (GN) formulations use an \(l_2\) penalty on the gradients to force adjacent voxels to have similar weights [10, 11]. Alternatively regularization could be enforced by imposing sparsity on the spatial gradients through a total variation (TV) penalty [7, 9]. These two types of penalties correspond to the linear and nonlinear diffusion and have been used in many image analysis applications like denoising, segmentation and registration.

However it is known that \(l_1\) and TV penalties have inherent bias and often lead to less stable predictions [16]. To address these limitations of LASSO penalty, the Smoothly Clipped Absolute Deviation (SCAD) penalty [8] have been proposed in the context of high dimensional regression with variable selection. SCAD has become quite popular in the statistical community and proved to have some desired properties such as continuity, asymptotic unbiasedness and sparsity [8]. However these statistical works are limited to the one-dimensional case and there exist very few application of SCAD in image analysis [5, 13].

In the current study we propose a novel regularized variable selection method, in the context of classification, based on the SCAD penalty. We use SCAD for enforcing sparsity of solution and SCAD of TV as the image regularization penalty for enforcing spatial continuity. Using synthetic and real MRI data from a multiple sclerosis study, we show superiority for variable selection for models based on the SCAD penalty when compared to the classical \(l_1\) or TV penalties.

2 Methods

2.1 Sparse Classification

Let X be a \(n\times m\) data matrix of n vectorized images \(\mathbf{x}_i\) as rows, each with m voxels. Let \(\varOmega \subseteq \mathbb {R}^3\) be the image domain of \(\mathbf{x}_i\). In the context of binary classification, we are given a corresponding set of labels \(\mathbf{y}\) as a \(n\times 1\) vector where each \(y_i\) takes discrete values \(\{-1,+1\}\). The goal is to build a classifier that predicts the binary labels given the data. The most common classification method is logistic regression (LR) that can be formulated as minimizing the negative log-likelihhod of a logistic regression distribution:

$$\begin{aligned} \min _{{\varvec{\beta }},b} \sum _{i=1}^n \log \left( 1+\exp (-y_i(\mathbf{x}_i{\varvec{\beta }}+b))\right) \end{aligned}$$
(1)

One main problem with the solution of this problem is that all coefficients in \({\varvec{\beta }}\) are usually nonzero. Sparse constraints on the solution address this issue. However, selecting the best subset of coefficients (\(l_0\) norm) is an NP-hard problem, so an \(l_1\) approximation of the \(l_0\) penalizer is usually used [16]:

$$\begin{aligned} \min _{{\varvec{\beta }},b} \sum _{i=1}^n \log \left( 1+\exp (-y_i(\mathbf{x}_i{\varvec{\beta }}+b))\right) +\lambda \Vert {\varvec{\beta }} \Vert _1 \end{aligned}$$
(2)

2.2 Image-Based Penalty

Sparsity is an effective way of regularizing the classification problem, but may select isolated voxels in the brain rather than compact and anatomically meaningful regions. Image-based penalties provide a principled way of imposing anatomical continuity of selected regions. Two types of image-based penalizers have been explored in the context of sparse classification or regression: the GN penalty [10, 11] and the TV penalty [7, 9]. We limit our discussion to the TV penalty as it shown superior at variable selection compared to GN [7] and provides the base for the proposed extension to SCADTV. The TV penalty uses an \(l_1\) norm on the image gradients. We use an anisotropic formulation of the TV-norm: \(\Vert \nabla {\varvec{\beta }} \Vert _1= \Vert \nabla _i {\varvec{\beta }}\Vert _1 +\Vert \nabla _j{\varvec{\beta }}\Vert _1 +\Vert \nabla _k{\varvec{\beta }}\Vert _1 \), where (ijk) denotes the 3 orthogonal dimensions of the image data. Denoting by \(\lambda ,\gamma \) two tuning parameters, the resulting penalized classification can then be written as:

$$\begin{aligned}&\min _{{\varvec{\beta }},b} \sum \nolimits _{i=1}^n \log \left( 1+\exp (-y_i(\mathbf{x}_i{\varvec{\beta }}+b))\right) + {\mathcal P_{l_1+TV}}({ \varvec{\beta }})\end{aligned}$$
(3)
$$\begin{aligned}&{\mathcal P_{l_1+TV}} = \lambda \left( \gamma \Vert {\varvec{\beta }} \Vert _1 + (1-\gamma ) \Vert \nabla {\varvec{\beta }} \Vert _1 \right) \end{aligned}$$
(4)

2.3 SCAD and SCADTV Penalties

The SCAD penalty \(\rho _{\lambda }(.)\) is more conveniently defined by its derivative

$$\begin{aligned} \rho _{\lambda }'(t) = \lambda \left\{ I(t \le \lambda ) + \frac{(a \lambda -t)_+}{(a-1)\lambda } I(t>\lambda ) \right\} , t > 0 \end{aligned}$$
(5)

with \(\rho _{\lambda }(0)=0\), \((z)_+=\max (z,0)\), I the indicator function, \(\lambda \) and a model parameters. As usual, \(a=3.7\) is used [8]. Figure 1(a) shows the SCAD penalty (blue) and \(l_1\) penalty (red).

Fig. 1.
figure 1

Illustration of SCAD penalty: (a) SCAD (blue) and \(l_1\) (red) penalty functions; thresholding function with \(l_1\) (b) and SCAD (c) penalty and \(\lambda =2\).

To better understand the behavior of SCAD penalty, consider the penalized least square problem \(\min _\beta (z-\beta )^2 + {\mathcal P}(\beta )\), where \({\mathcal P}(\beta )\) is chosen as the LASSO or the SCAD penalty. The solution is unique \(\hat{\beta }= S_{\lambda }(z)\) where \(S_{\lambda }\) is a thresholding function. Figure 1 displays the thresholding function for LASSO (b) and SCAD (c) with \(\lambda =2\). We notice that the SCAD penalty shrinks small coefficients to zero, while keeping large coefficients intact, while the \(l_1\) penalty tends to shrink all coefficients. This unbiased property of SCAD penalty comes from the fact that \(\rho _{\lambda }(t)=0\), when t is large enough.

Extending the SCAD definition for vector data and discrete gradients of the coefficients we define the combined SCAD and SCADTV penalties as:

$$\begin{aligned}&{\mathcal P_{SCAD}} = \sum \nolimits _{l=1}^m \rho _{\lambda }(|\beta _l|) \end{aligned}$$
(6)
$$\begin{aligned}&{\mathcal P_{SCADTV}} = \sum \nolimits _{l=1}^m \rho _{\lambda }(|\nabla _i \beta _l|) + \rho _{\lambda }(|\nabla _j \beta _l|) +\rho _{\lambda }(|\nabla _k \beta _l|) \end{aligned}$$
(7)

where (ijk) denotes the 3 orthogonal dimensions of the image data as in the definition of TV norm. Similar to SCAD, SCADTV shrinks small gradients encouraging neighboring coefficients to have the same values, but leaves large gradients unchanged. We propose three types of penalty functions that are compared with the classic \({\mathcal P_{l_1+TV}}\) model in the context of logistic regression classification: \({\mathcal P_{SCAD+SCADTV}}\), \({\mathcal P_{l_1+SCADTV}}\) and \({\mathcal P_{SCAD+TV}}\).

2.4 Optimization and Parameter Tuning

Note that the SCAD penalty, unlike \(l_1\) and TV, is not convex. We solve this problem using ADMM [4] that was successfully applied to convex problems. Recently it was shown [18] that several ADMM algorithms including SCAD are guaranteed to converge. The tuning parameters \(\lambda ,\gamma \) are chosen by generalized information criterion (GIC).

3 Experimental Results

3.1 Synthetic Data

Medical imaging data has no available ground truth on the significant anatomical regions discriminating two populations. We therefore generated synthetic data \(\mathbf{x}_i\) of size \(32 \times 32 \times 8\) containing four \(8 \times 8 \times 4\) foreground blocks with high spatial coherence (see Fig. 2). Background values are generated from a normal distribution N(0, 1), while the correlated values inside the four blocks are drawn from a multinormal distribution \(N(0,\varSigma _r)\), with \(r \in \{0.25,0.5\}\). Binary labels \(y_i\) are then assigned based on the logistic probability following a Bernoulli distribution. The coefficient vector \({\varvec{\beta }}\) has fixed values of 0 outside the four blocks and piecewise smooth values inside, with increasing strength for the data signal in the following order: top-left, top-right, bottom-right, bottom-left. Figure 2 top-left presents a 2D slice of the synthetic data and bottom-left presents the 3D view of the nonzero coefficients. Binary labels are assigned based on the logistic probability following a Bernoulli distribution. Each dataset contains \(n=300\) subjects, making the data matrix X of size \(n \times 8192\). For each coherence value r we repeated the test 96 times.

Fig. 2.
figure 2

(top-left) shows a 2D slice of the ground truth coefficients for simulated data. (bottom-left) shows a 3D view of the ground truth nonzero coefficients. The following figures show significant regions on synthetic data detected by the 4 methods. Shaded gray regions correspond to the true nonzero coefficients and the red regions are calculated from the estimated nonzero coefficients averaged over 96 trials.

3.2 Neuroimaging Data

Our neuroimaging data belongs to an in-house multiple sclerosis (MS) study. Following recent research that suggests a possible pivotal role for iron in MS [15], we are investigating if iron in deep gray matter is a potential biomarker of disability in MS. High field (4.7T) quantitative traverse relaxation rate (R2*) images are used as they are shown to be highly influenced by non-heme iron [14]. Sample R2* slices can be viewed in Fig. 4 (top). The focus is subcortical deep gray matter structures: caudate, putamen, thalamus and global pallidus. Forty subjects with relapsing remitting MS (RRMS) and 40 age- and gender-matched controls were recruited. Ethical approval and informed consent were obtained.

Prior to analysis, the MRI data is pre-processed and aligned with an in-house unbiased template using ANTs [1]. The multimodal template is built from 10 healthy controls using both T1w and R2*. Pre-processing involves intra-subject alignment of R2* with T1w and bias field intensity normalization for T1w [17]. Nonlinear registration in the template space is done using SyN [3]. Aligned R2* values are used as iron-related measurements. The measurement row vectors \(\mathbf{x}_i\) of size 158865 are formed by selecting only voxels inside a deep gray matter mask manually traced on the atlas.

Fig. 3.
figure 3

Results for synthetic experiments (a) Classification scores for noise level \(r=0.25\). (b) Dice scores between ground truth and estimated nonzero coeff. (c) Sum of Absolute Error (SAE) between ground truth and estimated coeff. for \(r = 0.25, 0.5\).

3.3 Evaluation Methodology

We compare the performance of the four penalized logistic regression models described in Sect. 2: \(SCAD+SCADTV\), \(SCAD+TV\), \(l_1+SCADTV\) and \(l_1+TV\). Training and test data is selected for each of the 96 synthetic datasets (200 training and 100 test) and for the real data (5 folds cross-validation). Results are reported on the test data using the \({\varvec{\beta }}\) coefficients computed on the training data. The sparse regions are selected from all nonzero coefficients.

Classification results are evaluated using accuracy (proportion of correctly classified samples), sensitivity (true positive rate), specificity (true negative rate) and the area-under-the-curve (AUC) for the receiver operating characteristic (ROC) curve. Variable selection accuracy compared to ground truth for synthetic data is evaluated using a dice score. We also compute the mean absolute error of recovered vs ground truth coefficients. For real data, we measured the stability of the detected regions using a dice score between the estimated regions in each of the 5 folds (dice folds).

3.4 Results

Comparative results on synthetic data with two levels of coherence \(r \in \{0.25,0.5\}\) for the multinormal distribution are reported in the bar graphs in Fig. 3(a), (b) and (c). When evaluating the classification accuracy in plot (a) results are comparable for all four methods with a mean of about \(94\%\) for \(SCAD+SCADTV\), \(SCAD+TV\) and \(L1+TV\) and a bit lower for \(L1+SCADTV\). But, when looking at the accuracy of variable selection using dice score (b) as well as the accuracy of the recovered sparse coefficients (c), we see that the \(SCAD+SCADTV\) penalizer is superior compared to the others. It achieves the highest dice score and the lowest SAD of recovered coefficients. To visualize the results of the 96 trials, we average the estimated nonzero coefficients, as binary masks, and threshold at 0.2. Results as illustrated in Fig. 2 confirm the numerical evaluation showing that the \(SCAD+SCADTV\) penalty gives the cleanest and closest to ground truth variable selection results while the \(l_1+TV\) penalty archives the worse performance.

Table 1. Results for real MRI data. Means on the 5 folds are reported. Class. rate = classification rate, Sens. = sensitivity, Spec. = specificity, AUC; Dice Folds = Dice score between detected sparse regions. Bold highlights best results among methods.

Comparative classification results on real neuroimaging MRI data are reported in Table 1. As ground truth on selected sparse regions is not available for real data, we estimated the quality of the detected sparse regions using a stability over folds measured using between-folds dice scores (DiceFolds). We report the average over the 10 distinct folds combinations. While classification results are comparable among proposed penalizers, results on stability of detected regions clearly show that the new penalties SCAD and SCADTV achieve superior results. To visualize the results, Fig. 4 displays sample axial slices and a 3D view of the regions recovered by the four methods. The regions were calculated from all data with optimal parameters for each method. Most methods recover compact regions in very similar brain locations.

Fig. 4.
figure 4

Illustration of the significant anatomy detected by the 4 methods using MRI data. Top: 2D axial slices with the R2* data as background; Bottom: a 3D view of the result. The deep gray matter mask used for selecting the voxels included in the observation vectors \(\mathbf{x}_i\) is contoured in white and the selected significant regions in red.

4 Discussion

We introduced a new penalty based on SCAD for variable selection in the context of sparse classification in high dimensional neuroimaging data. While SCAD penalty was proposed in statistical literature to overcome the inherent bias of \(l_1\) and TV penalties, it was not yet used in medical imaging population studies. We experimentally shown on simulated and real MRI data that the proposed models based on SCAD are better at selecting the true nonzero coefficients and achieve higher accuracy. Part of our future work, we are looking at deriving theoretical results on coefficients bounds and accuracy of variable selection for the SCAD based models. Extending our work, similar penalizers could be used for regression or data representation (ex. PCA, CCA).