Keywords

1 Introduction

In recent years, the issue of model interpretability attracts an increasing attention in voxel-based neuroimage analysis of disease prediction, e.g. [5, 9]. Examples include, but not limited to, the preprocessed features on structural Magnetic Resonance Imaging (sMRI) images that usually contain the following voxel-wise features: (1) lesion features that are contributed to the disease (2) procedural bias introduced during the preprocessing steps and shown to be helpful in classification [3, 12] (3) irrelevant or null features which are uncorrelated with disease label. Our goal is to stably select non-null features, i.e. lesion features and procedural bias with high power/recall and low false discovery rate (FDR).

The lesion features have been the main focus in disease prediction. In dementia disease such as Alzheimer’s Disease (AD), such features are thought to be geometrically clustered in atrophied regions (hippocampus and medial temporal lobe etc.), as shown by the red voxels in Fig. 1(A). To explore such spatial patterns, multivariate models with Total Variation [10] regularization can be applied by enforcing smoothness on the voxels in neighbor, e.g. the \(n^{2}\)GFL [15] can stably identify the early damaged regions in AD by harnessing the lesions.

Recently, another type of features called procedural bias, which are introduced during the preprocessing steps, are found to be helpful for disease prediction [12]. Again, taking AD as an example, the procedural bias refer to the mistakenly enlarged Gray Matter (GM) voxels surrounding locations with cerebral spinal fluid (CSF) spaces enlarged, e.g. lateral ventricle, as shown in Fig. 1(A). This type of features has been ignored in the literature until recently, when the GSplit LBI [12] was targeted on capturing both types of features via a split of tasks of TV regularization (for lesions) and disease prediction with general linear model (with procedural bias). By leveraging such bias, it can outperform models which only focus on lesions in terms of prediction power and interpretability.

However, GSplit LBI may suffer from inaccurate feature selection due to the following limitations in high dimensional feature space:Footnote 1: (1) multicollinearity: high correlation among features in multivariate models [14]; (2) “heterogenous features”: the procedural bias and lesion features differ in terms of volumetric change (enlarged v.s. atrophied) and particularly spatial pattern (surroundingly distributed v.s. spatially cohesive). Specifically, the multicollinearity could select spurious null features which are inter-correlated with non-nulls. Moreover, GSplit LBI fails to take into account the heterogeneity since it enforces correlation on features without differentiation. Such problems altogether may result in inaccurate selection of non-nulls, especially procedural bias. As shown in Fig. 1(B) and Table 2, the procedural bias selected by GSplit LBI are unstably scattered on regions that are less informative than ventricle. Moreover, the collinearity among features tends to select a subset of features among correlated ones, as discussed in [16]. Such a limitation leads to the ignorance of many meaningful regions (such as medial temporal lobe, thalamus etc.) of GSplit LBI in selecting lesion features, as identified by the purple frames of FDR-HS in Fig. 1(B). Moreover, the two problems above may get worse as dimensionality grows. In our experiments with a fine resolution (\(4\times 4\times 4\) of 20,091 features), the prediction accuracy of GSplit LBI deteriorates to \(89.77\%\) (as shown in Table 3), lower than \(90.91\%\) reported in [12] with a coarse resolution (\(8\times 8\times 8\) of 2,527 features).

Fig. 1.
figure 1

A: the features selected by FDR-HS (green denotes procedural bias; red denotes lesion features which are geometrically clustered) B: comparison with GSplit LBI

To resolve the problems above, we propose a “two-groups” empirical Bayes method to identify heterogenous features, called FDR-HS standing for “FDR Heterogenous smoothing” in this paper. As a univariate FDR control method, it avoids the collinearity problem by proceeding voxel-by-voxel, as discussed in [7]. Moreover, it can deal with heterogeneity by regularizing on features with different levels of spatial coherence in different feature groups, which remedies the problem of losing spatial patterns that most conventional mass-univariate models suffer from, such as two sample T-test, BH\(_{q}\) [4] and LocalFDR [7]. By introducing a binary latent variable, our problem turns into a convex optimization and can be solved efficiently via EM algorithm like [13]. The method is applied to a voxel-based sMRI analysis for AD with a fine resolution (4\(\times \)4\(\times \)4 of 20,091 features). As a result, our proposed method exhibits a much stabler feature extraction than GSplit LBI, and achieves much better classification accuracy at \(91.48\%\).

2 Method

Our dataset consists of p voxels and N samples \(\{x_{i},y_{i}\}_{1}^{N}\) where \(x_{ij}\) denotes the intensity value of the \(j^{th}\) voxel of the \(i^{th}\) sample and \(y_{i} = \{\pm 1\}\) indicates the disease status (\(-1\) denotes AD). The FDR-HS method is proposed to select non-null features. Such method is the combination of “two-groups” model and heterogenous regularization, which is illustrated in Fig. 2 and discussed below.

Fig. 2.
figure 2

Illustration of FDR-HS model.

Model Formulation. Assuming for each voxel \(i \in \{1,...,p\}\), the statistic \(z_{i}\) is sampled from the following mixture:

$$\begin{aligned} z_{i}&\sim \sum _{k=0}^{1} \mathrm {p}(s_{i} = k)\mathrm {p}(z_{i} | s_{i} = k) = c_{i} f_{1}(z_{i}) + (1 - c_{i})f_{0}(z_{i}), \end{aligned}$$
(2.1)

where \(s_{i}\) is a latent variable indicating if the voxel i belongs to the group of null features (\(s_i=0\)) or the group of non-null ones (\(s_i=1\)), \(c_i = \mathrm {p}(s_{i} = 1) = \mathrm {sigmoid}(\beta _{i}) = e^{\beta _{i}} / \left( 1 + e^{\beta _{i}}\right) \) and \(z_{i} = \mathrm {\Phi }^{-1}\left( F_{N-2}(t_{i})\right) \) with \(t_{i}\) computed by two-sample t-test. Correspondingly, \(f_{0}(\cdot )\) is density function of nulls, i.e. uncorrelated with AD and \(f_{1}(\cdot )\) is that of non-nulls, i.e. procedural bias and lesions. The loss function can thus be defined as negative log-likelihood of \(z_{i}\):

$$\begin{aligned} \ell (\beta ) = -\sum _{i=1}^{N} \log {\left( \frac{e^{\beta _{i}}}{1 + e^{\beta _{i}}}f_{1}(z_{i}) + \frac{1}{1 + e^{\beta _{i}}}f_{0}(z_{i})\right) } \end{aligned}$$
(2.2)

which can be viewed as logistic regression (when \(f_{0}\) and \(f_{1}\) are replaced with binaries, as (2.6)) with identity design matrix since (2.1) proceeds voxel-by-voxel. Hence, it does not have the problem of multicollinearity.

Selecting Features. To select features, we compute the posterior distribution of \(s_{i}\) conditioned on \(z_{i}\) and \(\widehat{\beta }_{i}\) (estimated \(\beta _{i}\)) and features with

$$\begin{aligned} \mathrm {p}(s_{i} = 0 | z_{i}, \widehat{\beta }_{i}) = \frac{(1 - \widehat{c}_{i})f_{0}(z_{i})}{\widehat{c}_{i}f_{1}(z_{i}) + (1 - \widehat{c}_{i})f_{0}(z_{i})} < \gamma \ \left( \widehat{c}_{i} = e^{\widehat{\beta }_{i}} / \left( 1 + e^{\widehat{\beta }_{i}} \right) \right) \end{aligned}$$
(2.3)

are selected. The \(\gamma \in (0,1)\) is pre-setting threshold parameter.

Heterogenous Spatial Smoothing. However, (2.1) may lose spatial structure of non-nulls, especially lesion features. Besides, note that the procedural bias and lesion features are heterogenous in terms of volumetric change and level of spatial coherence. Hence, to capture the spatial structure of heterogenous features, we split the graph of voxels which denotes as \(\varvec{G}\)Footnote 2 into three subgraphs, i.e. \(\varvec{G} = \varvec{G}_{1} \cup \varvec{G}_{2} \cup \varvec{G}_3\) with:

$$\begin{aligned} \varvec{G}_{1}&= \left( \varvec{V}_{1}, \varvec{E}_{1}\right) , \ \varvec{V}_{1} = \{i: z_{i} \le 0\}, \ \varvec{E} _{1} = \{(i,j) \in \varvec{E}: z_{i} \le 0, z_{j} \le 0\} \end{aligned}$$
(2.4a)
$$\begin{aligned} \varvec{G}_{2}&= \left( \varvec{V}_{2}, \varvec{E}_{2}\right) , \ \varvec{V}_{2} = \{i: z_{i}> 0 \}, \ \varvec{E}_{2} = \{(i,j) \in \varvec{E}: z_{i}> 0, z_{j} > 0\} \end{aligned}$$
(2.4b)
$$\begin{aligned} \varvec{G}_{3}&= \left( \varvec{V}_{3}, \varvec{E}_{3}\right) , \ \varvec{V}_{3} = \varvec{V}_{1} \cup \varvec{V}_{2}, \ \varvec{E}_{3} = \{(i,j) \in \varvec{E}: z_{i} > 0, z_{j} \le 0\} \end{aligned}$$
(2.4c)

where \(\varvec{G}_{1}\) denotes the subgraph restricted on enlarged voxels (procedural bias since -1 denotes AD); \(\varvec{G}_{2}\) denotes the subgraph restricted on degenerate voxels (lesion features); \(\varvec{G}_{3}\) denotes the bipartite graph with the edges connecting enlarged and degenerate voxels. The optimization function can be redefined as:

$$\begin{aligned} g(\beta ) = \ell (\beta ) + \lambda _{pro} \Vert D_{\varvec{G}_{1}} \beta \Vert _{1} + \lambda _{les} \Vert D_{\varvec{G}_{2}} \beta \Vert _{1} + \lambda _{pro\text {-}les} \Vert D_{\varvec{G}_{3}} \beta \Vert _{1} \end{aligned}$$
(2.5)

where \(D_{\varvec{G}_{k}}\beta = \sum _{(i,j) \in \varvec{E}_{k}} \beta _{i} - \beta _{j}\) for \(k \in \{1,2,3\}\) denote graph difference operator on \(\varvec{G}_{k=1,2,3}\). By setting the group of regularization hyper-parameters \(\{\lambda _{pro}, \lambda _{les}, \lambda _{pro\text {-}les}\}\) with different values, we can enforce spatial smoothness on three subgraphs at different level in a contrast to the traditional homogeneous regularization in [13]. The choice of each hyper-parameter, similar to [13], it is a trade-off between over-fitting and over-smoothing. Too small value tends to select features more than needed, while too large value will oversmooth hence the features are less clustered. Note that lesion features are more spatially coherent than procedural bias and they are located in different regions, the reasonable choice of regularization hyper-parameters tend to have \(\lambda _{les} \le \lambda _{pro} \le \lambda _{pro\text {-}les}\).

Optimization. Note that the function (2.5) is not convex. Hence we adopted the same idea in [13] that introduced the latent variables \(s_{i}\) and \(=1\) if \(z_{i} \sim f_{1}(z)\) and 0 if \(z_{i} \sim f_{0}(z)\). The \(\ell (\beta )\) and \(g(\beta )\) are modified as:

$$\begin{aligned} \ell (\beta ,s)&= \sum _{i=1}^{N} \left\{ \log \left( 1 + e^{\beta _{i}} \right) - s_{i}\beta _{i}\right\} \end{aligned}$$
(2.6)
$$\begin{aligned} g(\beta ,s)&= \ell (\beta ,s) + \lambda _{pro} \Vert D_{\varvec{G}_{1}} \beta \Vert _{1} + \lambda _{les} \Vert D_{\varvec{G}_{2}} \beta \Vert _{1} + \lambda _{pro\text {-}les} \Vert D_{\varvec{G}_{3}} \beta \Vert _{1} \end{aligned}$$
(2.7)

To solve (2.7), we can implement Expectation-Maximization (EM) algorithm to alternatively solve \(\beta \) and s. Suppose currently we are in the \((k+1)^{th}\) iteration.

In the E-step, we can estimate \(s_{i}\) by expectation value conditional on \((\beta ^k,z_{i})\): \(\tilde{s}_{i} = \mathrm {E}(s_{i} | \beta ^k, z_{i}) = \frac{c^k_{i} f_{1}(z_{i})}{c^k_{i} f_{1}(z_{i}) + (1 - c^k_{i})f_{0}(z_{i})}\).

In the M-step, we plug \(\tilde{s}_{i}\) into (2.7), denote \(\widetilde{D}_{\varvec{G}} =\! \left[ D_{\varvec{G}^T_{1}}, \frac{\lambda _{les}}{\lambda _{pro}}D_{\varvec{G}^T_{2}}, \frac{\lambda _{pro\text {-}les}}{\lambda _{pro}}\right. \) \(\left. D_{\varvec{G}^T_{3}}\right] ^T\) and expand \(\ell (\beta | \tilde{s}^k)\) using a second-order Taylor approximation at the \(\beta ^k\). Then the M-step turns into a generalized lasso problem with square loss:

$$\begin{aligned} \min _{\beta } \frac{1}{2} \Vert \tilde{y} - \widetilde{X}\beta \Vert _{2}^2 + \lambda _{pro} \Vert \widetilde{D}_{\varvec{G}} \beta \Vert _{1} \end{aligned}$$
(2.8)

where \(\widetilde{X} = diag\{\sqrt{w_{1}},...,\sqrt{w_{p}}\}\) and \(\tilde{y}_{i} = \sqrt{w_{i}}\left( \beta ^k_{i} - \triangledown _{\beta } \ell (\beta | \tilde{s}^k_i)_{|_{\beta ^k}} / w_{i}\right) \) with \(w_{i} = \triangledown _{\beta }^2 \ell (\beta | \tilde{s}_i)_{|_{\beta ^k}}\). Note that X and \(\widetilde{D}_{\varvec{G}}\) are sparse matrices, hence (2.8) can be efficiently solved by Alternating Direction Method of Multipliers (ADMM) [6] which has a complexity of \(O(p\log {p})\).

Estimation of \(f_{0}\) and \(f_{1}\). Before the iteration, we need to estimate \(f_{0}(z)\) and \(f_{1}(z)\). The marginal distribution of z can be regarded as mixture models with p components: \(z \sim \frac{1}{p} \sum _{i=1}^{p} g_{i}(z), \ g_{i}(z) = p(s_{i})p(z | s_{i}) = c_{i}f_{1}(z) + (1 - c_{i})f_{0}(z)\) Hence, the marginal distribution of z is \(f(z) = \bar{c}f_{1}(z) + (1 - \bar{c})f_{0}(z)\), which is equivalent to LocalFDR [7]. We can therefore implement the CM (Central Matching) [7] method to estimate \(\{f_{0}(z),\bar{c}\}\) and kernel density to estimate f(z). The \(f_{1}(z)\) can thus be given as \(\left( f(z) - f_{0}(z)\bar{c}\right) / (1 - \bar{c})\).

3 Experimental Results

In this section, we evaluate the proposed method by applying it on the ADNI database http://adni.loni.ucla.edu. The database is split into 1.5 T and 3.0T (namely 15 and 30) MRI scanner magnetic field strength datasets. The 15 dataset contains 64 AD, 110 MCI (Mild Cognitive Impairment) and 90 NC, while the 30 dataset contains 66 AD and 110 NC. After applying DARTEL VBM [2] preprocessing pipeline on the data with scale of 4 \(\times \) 4 \(\times \) 4 mm\(^{3}\) voxel size, there are in total 20,091 voxels with average values in GM population on template greater than 0.1 and they are served as input features. We designed experiments on 1.5T AD/NC, 1.5T MCI/NC and 3.0T AD/NC tasks, namely 15ADNC, 15MCINC and 30ADNC, respectively.

3.1 Prediction Results

To test the efficacy of selected features by FDR-HS and compare it with other univariate models (as listed in Table 1), we feed them into elastic net classifier, which has been one of the state-of-the-arts in the prediction of neuroimage data [11]. The hyper-parameters are determined by grid-search. In details, the threshold hyper-parameter of p-value in T-test and q-value in BH\(_{q}\) are optimized through \(\{0.001,0.01,0.02,0.05,0.1\}\); the threshold hyper-parameter for choosing non-nulls, i.e. \(\gamma \) for FDR-HS (2.3) and the counterpart of LocalFDR [7], are chosen from \(\{0.1,0.2,...,0.5\}\). Besides, the regularization parameters \(\lambda _{pro}\), \(\lambda _{les}\) and \(\lambda _{pro\text {-}les}\) of FDR-HS are ranged in \(\{0.1,0.2,...,2\}\). For elastic net, the regularization parameter is chosen from \(\{0.1,0.2,...,2,5,10\}\); the mixture parameter \(\alpha \) is from \(\{0,0.01,...,1\}\). Moreover, we compare our model to GSplit LBI and elastic net, adopting the same optimized strategy for hyper-parameters in [12] (the top 300 negative voxels are identified as procedural bias [12]) and those of elastic net following after the univariate models, as mentioned above.

A 10-fold cross-validation strategy is applied and the classification results for all tasks are summarized in Table 1. As shown, our method yields better results than others in all cases, that includes: (1) FDR-HS can select features with more prediction power than other univariate models due to the ability to capture heterogenous spatial patterns; (2) FDR-HS can achieve better classification results than multivariate methods in high dimensional settings, in which the non-nulls may be represented by other nulls that are highly correlated with them.

Table 1. Comparison between FDR-HS and others on 10-fold classification result

3.2 Feature Selection Analysis

We used 2-d images of 30ADNC to visualize the features of all methods under the hyper-parameters that give the best accuracy. As shown in Fig. 3, the lesion features selected by FDR-HS are located clustered in early damaged regions; while procedural bias are surrounding around lateral ventricle. Besides, such a result is given by \(\lambda _{les}< \lambda _{pro} < \lambda _{pro\text {-}les}\), which agrees with that the larger value results in features with lower level of spatial coherence. In contrast, the lesions selected by T-test and BH\(_{q}\) are scattered and redundant; some procedural bias around lateral ventricle are missed by BH\(_{q}\) and LocalFDR. Moreover, GSplit LBI selected procedural bias on regions with CSF space less enlarged than lateral ventricle; besides, it ignored lesions located in medial temporal lobe, Thalamus and Fusiform etc., which are believed to be the early damaged regions [1, 8].

Fig. 3.
figure 3

The comparison of FDR-HS between others in terms of feature selection (30ADNC). Red denotes lesions; blue denotes procedural bias.

Table 2. Comparison between FDR-HS and others on stability (measured by mDC)

Besides, we also evaluated the stability of selected features using multi-set Dice Coefficient (mDC) measurement defined in [15]. Larger mDC implies more stable feature selection. As shown in Table 2, our model can obtain more stable results than GSplit LBI which suffer the “collinearity” problem.

4 Conclusions

In this paper, a “two-groups” Empirical-Bayes model is proposed to stably and efficiently select interpretable heterogenous features in voxel-based neuroimage analysis. By modeling prior probability voxel-by-voxel and using a heterogenous regularization, the model can avoid multicollinearity and exploit spatial patterns of features. With experiments on ADNI database, the features selected by our models have better interpretability and prediction power than others.