Exploring stability-based voxel selection methods in MVPA using cognitive neuroimaging data: a comprehensive study

Feature selection plays a key role in multi-voxel pattern analysis because functional magnetic resonance imaging data are typically noisy, sparse, and high-dimensional. Although the conventional evaluation criterion is the classification accuracy, selecting a stable feature set that is not sensitive to the variance in dataset may provide more scientific insights. In this study, we aim to investigate the stability of feature selection methods and test the stability-based feature selection scheme on two benchmark datasets. Top-k feature selection with a ranking score of mutual information and correlation, recursive feature elimination integrated with support vector machine, and L1 and L2-norm regularizations were adapted to a bootstrapped stability selection framework, and the selected algorithms were compared based on both accuracy and stability scores. The results indicate that regularization-based methods are generally more stable in StarPlus dataset, but in Haxby dataset they failed to perform as well as others.


Introduction
Exploring the mysteries of brain function is one of the most challenging and fascinating tasks in the domain of science. In recent years, with the advent of machine learning techniques, the interdisciplinary field of machine learning and neuroscience has drawn growing attention to both communities. With the aid of modern neuroimaging techniques, the capability of machine learning algorithms to identify distributed patterns of voxels in response to stimuli allows for decoding brain activities using data-driven models. A comprehensive review of previous studies has been provided in [1][2][3]. In this study, we would like to focus on multi-voxel pattern analysis (MVPA) [4], which is a commonly used methodological framework for analyzing functional magnetic resonance imaging (fMRI) data with machine learning algorithms (see Fig. 1). fMRI is a popular, non-invasive neuroimaging technique to measure brain activity via blood-oxygen-level dependent (BOLD) signals, recorded as time series in a three-dimensional (3D) brain space. The precise spatial localization of brain activation, therefore, is an essential advantage of fMRI compared to other non-invasive neuroimaging techniques. Unlike conventional univariate approaches, MVPA constructs a pattern classification problem to decode neural information processing by characterizing multivariate brain activity patterns [5].
However, fMRI-based data analysis using machine learning approaches has a challenging small-n largep problem, i.e., there are many thousands of voxels in the brain, but the sample size is relatively small because of the expensive cost of fMRI data collection. Moreover, only a portion of the brain will be activated with respect to specific stimulus or mental states. Hence, selecting the active voxels associated with particular stimuli or states is an important and challenging task before training classifiers in MVPA, which is called feature selection or feature reduction. In current studies, a common criterion of evaluating the subset selection is classification accuracy. This evaluation criterion may suffer from the variance in training data with a limited sample size and result in unstable generalization error when the trained model is applied to an unknown dataset. Selecting stable features across various datasets, on the other hand, has not been completely investigated. Therefore, the objective of our study is to explore for an integrated stability-based feature selection approach.
The remainder of this paper is organized as follows: Section 2 provides a brief review of existing studies, including stability selection algorithms and their applications to neuroimaging data. Section 3 illustrates the methodology, including experimental settings, data description, feature extraction and selection methods, classification algorithms, and methodological framework. Results are reported and discussed in Sect. 4, followed by the conclusion and possible directions for future work in Sect. 5.

Literature review
A major challenge in MVPA, as stated previously, comes from the high dimensionality and sparsity in fMRI data. Hence, the regularized logistic regression (LR) such as least absolute shrinkage and selection operator (LASSO) and elastic net (or ENet for short) are found to be particularly useful in addressing sparsity. Another general objective of feature selection is to build interpretable models which are able to support or reject hypothesis with domain knowledge. To this end, selecting a stable subset that is robust to the variance in samples is of great importance. Numerous studies have discussed the stability issue using various types of feature selection methods from statistician's perspective [6][7][8][9]. Numerous metrics to quantify the stability in feature selection were proposed, but no standard guideline for comparing various feature selection methods has been acknowledged up to date [6,7,10,11]. In this section, a brief review of existing studies of stability selection is provided in terms of methodology and applications to neuroimaging data.
Before Meinshausen and Buhlmann [8] proposed their methodological framework of stability selection, some early studies have implied the usefulness of re-sampling strategy such as bootstrap of improving the stability of feature selection [7,12]. In Meinshausen and Buhlmann's work, the subset selection is performed via repeatedly running LASSO on re-sampled subsets, while each subset is half the size of original samples. A feature is able to enter the model only if the frequency of being selected is greater than a user-defined threshold (denoted as H below). This method was later improved in [9] by changing the re-sampling mechanism such that if one half of the dataset was sampled, the other complimentary half should also be used. This Complimentary Pairs Stability Selection (CPSS) method has been mathematically proved to provide an improved bound for the estimation error control. An interesting aspect of stability selection is that although original stability selection approach was claimed not to be sensitive to the selection of H in a range of [0.6, 0.9], it was reported in the CPSS article [9] that the choice of H may have an impact. In general, stability selection is a topic that has not been fully discovered.
Stability-based data analysis approaches based on neuroimaging data have drawn growing interest from neuroscientists in recent years, and have been widely adopted as a methodological framework in existing studies. The great potential of stability selection lies in its adaptability, which allows users to develop their own approaches with various focuses as well as domain knowledge in order to construct more powerful knowledge discovery systems. The existing applications are limited in quantity, but rich in diversity from the following aspects. First of all, in current studies, stability selection has been used to satisfy a variety of research purposes such as exploring the brain functionality in visual pathways [11], functional networks [13,14], resting-state networks [15], or the localization (or identification) of significant biomarkers relevant to specific mental states [16] or diagnose brain-related disorders [17]. Second, in terms of methodology, numerous variations were made utilizing the concept of stability selection. For example, the possible features for stability selection can be extracted from the functional network; in addition to voxels (or nodes in network science), selecting discriminative connectivity (edges) is also helpful to understand the mechanism underlying functional networks [13,14,18,19]. Moreover, some studies integrated other machine learning algorithms such as clustering [15,20,21], graphical lasso [18], and support vector machine (SVM) [14]. A novel variation of LASSO was proposed to search for similar but not identical voxels in feature selection across multiple human subjects [22]. Finally, although the original stability selection was proposed as a data-driven model, some novel methods also utilized anatomical information, topological structure, or other structural information underlying features to enhance its stability and predictive power [16,23].

StarPlus dataset
This dataset is named StarPlus because of the visual stimuli presented to subjects during the experiments. Subjects were instructed to focus on the visual stimulus on the screen when fMRI data was recorded. In one half of all experiment trials, a sentence (semantic stimulus) was presented first for 4 s (e.g., ''It is true that the star is above the plus.''), followed by an image (symbolic stimulus) showing similar information for another 4 s (see Fig. 1a). The subjects need to press a button to indicate whether if the sentence and image matches each other. In remaining trials, the sequence of presenting sentences and images switches. 40 trials were conducted during this experiment, each of which contains 2 samples labeled by the type of stimulus (semantic = '0,' symbolic = '1').
The fMRI data was collected at 500 ms sampling rate in a 3D space of 64 Â 64 Â 8 voxels, and the pre-processed data of 6 subjects is available to public. The scanned area contains 25-30 anatomical regions of interest (ROIs), which have approximately 4000 voxels. Particularly, 7 ROIs are highlighted by the proposer as they are most relevant to this task. Thus, the number of voxels to be analyzed in our study is reduced to around 2000, varying from subject to subject.

Haxby dataset
Haxby dataset contains the fMRI scans of 6 subjects. The experiment has 12 trials, each of which lasts for about 24 s, separated by rest periods (see Fig. 1b). In each trial, 8 images presenting 8 types of objects including houses, human faces, cats, and so on. Images were shown on the screen for 500 ms of each; the inter-stimulus interval is 1500 ms. The entire experiment was then partitioned into 12 Â 8 ¼ 96 samples from each individual with only one trial removed from subject 5 who was corrupted during this trial. The fMRI scans were collected in a space of 40 Â 64 Â 64 voxels, corresponding to a voxel size of 3:5 Â 3:75 Â 3:75 mm 3 , and a volume repetition time of 2.5 s [4]. Similarly, instead of examining the whole brain, our study is focused on the visual cortex area which consists of up to 675 voxels based on the anatomical information of our subjects.

Feature extraction
General linear model (GLM) approach as introduced in [25] was applied to the time series data for feature extraction. The basic concept is to characterize BOLD signals by fitting GLM to a haemodynamic response function (HRF) that describes blood-oxygen-level responses to the given stimulus as a function of time. The estimates of the coefficientŝ b ¼ fb 1 ; :::b m g T in GLM model: Y ¼ Xb represent the time-related response of each individual voxel to the stimulus of interest. Usingb as features results in an m-dimensional feature space, where each voxel is represented by its beta valueb j ; j 2 f1; :::; mg. In our study, pre-processing and feature extraction were implemented in Matlab 8.3 [26] using a toolbox [27]. Figure 2 illustrates extracting beta values as features for subject 1 in the Haxby dataset, where the samples (stimuli) are ordered in the same sequence as presented in the experiment.

Feature selection
Current feature selection methods are categorized into three classes based on how the subset-search algorithm is combined with the classification procedure: filter, wrapper, and embedded [28,29]. In this subsection, the selected feature selection methods are reviewed under this framework.

Filter approach
Filter methods are relatively independent of classification comparing to other types of methods, and the interactions among features are completely ignored. With a pre-defined metric of relevance between individual features and class labels, all features are ranked and the top-n features  comprise the subset selection. In this study, Pearson correlation (referred to as Corr) and mutual information (MI) were employed as they are commonly used metrics. Moreover, the size of subset to be selected is not arbitrarily determined, but optimized using a cross-validation scheme.
Since the classifier used in combination with these filter methods is SVM, these approaches will be referred to as SVM-MI, SVM-Corr in the following sections.

Wrapper approach
Instead of evaluating the similarity between individual features and class labels, wrapper methods seek for a best subset of features by evaluating the subset as a whole based on classification performance. The recursive feature elimination (RFE) integrated with SVM, referred to as SVM-RFE, was chosen as an example of wrapper approach in our study. It is a backward feature selection approach which starts with the entire feature set and iteratively removes a proportion of features after evaluation using SVM, which was implemented using a toolbox in Matlab 8.3 [30]. However, a significant disadvantage of wrapper methods is the computational cost: the classification algorithm need to be performed repeatedly for every subset in the candidate pool, which will largely increase the computational time especially with high-dimensional data. In order to be consistent with filter methods, the size of subset in feature selection was also optimized using a cross-validation scheme.

Embedded approach
The embedded methods utilize regression models with regularization. In such models, the feature selection is embedded in the training process of classification algorithm by optimizing a penalty parameter k. With an appropriate k selected using a cross-validation scheme, all redundant features are removed from the model by forcing their coefficients to be zero. In this study, we employ both LASSO and ENet as embedded approaches. More details about these algorithms are to be discussed later in Sect. 3.4.

Classification algorithms
Consider that in a binary classification problem, the input data are a set of data points X ¼ fx 1 ; :::; x n g in an m-dimensional feature space, i.e., x i 2 R m 8 i 2 f1; :::; ng, where n is the number of data points and m is the number of features. The corresponding target values T ¼ ft 1 ; :::; t n g are the class labels. The predictions of class labels are denoted by Y ¼ fy 1 ; :::; y n g. The objective of classification algorithms is to estimate the optimal parameters w and b, such that the mapping f : X ! Y best captures the relationship between inputs and targets.

Support vector machine
SVM is a classifier that optimizes the decision boundary with a maximum geometrical margin, i.e., the distance between decision boundary and the closest data points in each class. The soft-margin SVM with a linear kernel is formulated as follows: arg min s:t: t i ðw T x i þ bÞ ! 1 À n i 8i 2 f1; :::; ng; ð2Þ n i ! 0 8 i 2 f1; :::; ng; where slack variables n i are introduced to give tolerance to the misclassified data points lying in between support vectors, parameter C controls the tolerance level, and the target values t i 2 fÀ1; 1g. The decision boundary of a linear classifier is a hyperplane described by the function:

Regularized logistic regression
LASSO is a shrinkage method proposed by Tibshirani [31], which is applicable to both linear and logistic regression models; ENet is a widely used variation of LASSO proposed by Zou and Hastie [32]. In linear regression, the Fig. 2 The feature (beta values) matrix is extracted by applying a general linear model to fMRI BOLD signals. Subject 1 in Haxby dataset is used as an illustrative example objective of LASSO is to find optimal solution for the following problem: where k is a tuning parameter which controls the shrinkage. This formulation can be generalized to logistic regression models by replacing Eq. (4) with the cost function in LR model. Similarly, the formulation of ENet shares the same objective function as in Eq. (4) but the constraint is as follows: where a controls the trade-off between ridge regression and LASSO. In our study, a ¼ 0:8 is used as a common selection.

Methodological framework
Although the concepts of stability selection were utilized in this study, the setup of experiment differs in two datasets. Table 1 presents the cross-validation settings of both datasets. Let D T , D G ; and D V denote training, test and validation set, respectively. The general framework is demonstrated as follows: -Step 1: Randomly take a subset D S out of training set D T ; -Step 2: Run the feature selection method on set D S while using D V to control the tuning parameters in selected algorithm; -Step 3: Repeat step 1 and 2 n times; -Step 4: Use a set of most frequently selected features S as the future feature set; -Step 5: Train the model with selected features on D T and D V ; -Step 6: Evaluate the performance on D G ; -Step 7: Repeatedly perform Step 1 to 6 according to selected cross-validation scheme. In Step 2, after specifying the D T , D G ; and D V , the resampling was performed 50 times on D T for feature selection with simultaneous validation on D V . Provided that stability selection method proposed a re-sampling scheme with embedded feature selection methods [8,9], our approach was designed to utilize the filter and wrapper methods in the same manner such that the results can be compared apples to apples. Further, ten replications were conducted based on different settings of D T , D G ; , and D V for StarPlus dataset, while twelve replications were performed for Haxby dataset such that each trial was used exactly once as test set.
The stability measure in our study is Jaccard Index [33], a measure of similarity between two sets. Suppose there are two subsets S a and S b , then the Jaccard Index for (S a ; S b ) is defined as where j S j is the number of elements in set S. When there are k subsets, the overall similarity is computed by averaging the pairwise Jaccard Index for all possible pairs. The formulation is given as follows: 4 Results and discussions In this section, the results are presented and discussed from the following aspects. First, a comparison among selected feature selection methods is provided based on accuracy and stability. Second, the selection of H is further examined to provide some suggestions for future studies. Finally, the localization of voxels selected by each method is discussed to provide some insights.

Feature selection methods
As shown in Tables 2-13, the classification performance has a large variance across algorithms and subjects. In this section, some discussions are separately given to two datasets since algorithms performed differently in our experiment.

Filter and wrapper methods
In StarPlus dataset, SVM-MI, SVM-RFE, and SVM-Corr performed at a comparable level as embedded algorithms in terms of accuracy, but embedded algorithms yielded a better overall stability. Moreover, it is not desirable that SVM-Corr sometimes selected a large subset although it is always highly stable. It may imply that SVM-Corr approach, according to the current experiment settings, tends to overfit in some cases. In Haxby data, however, SVM-MI, SVM-RFE, and SVM-Corr algorithms are more accurate than embedded algorithms in general. In terms of computational cost, SVM-MI and SVM-Corr are much faster than SVM-RFE. Among these three algorithms, SVM-MI is suggested based on overall accuracy, stability, and computational time, which is interestingly consistent with a previous study using same dataset without utilizing stability selection [34].

Embedded methods
In general, ENet has higher stability and standard deviation compared to LASSO; also, it selects a larger and more stable subset. It indicates that throughout all replications the ENet has more stable subsets in feature   selection, but these subsets yielded an unstable predictive power compared to LASSO. Comparison based on the best performing model, ENet yields better accuracy than LASSO in general, which is also supported by previous study using the same dataset [35]. This phenomenon may relate to the balance between variance and bias of generalization error in statistics. The stability selection scheme provides a control to help avoid the situation of having an unstable feature subset in the model. On the other side, however, by reducing the total number of   available samples for training purposes, it seems to scarify accuracy to some extent. This raises questions that, if it is possible to design a systematic approach to achieve or control the balance between stability and accuracy.
Depending on the objective of their studies, some researchers may favor an interpretable model to explore or support a hypothesis, while others may prefer a predictive one for practical use.

Threshold selection
According to the our experimental results, the selection of H within [0.6, 0.9] has a significant influence on classification accuracy. This finding is consistent with the comments in [9]. More interestingly, a rough trend seems to imply that LASSO favors a smaller H while ENet prefers a larger one. As no previous studies have reported this   behavior in stability selection based on our knowledge, we can only make intuitive inference for the possible reason.
Since group effect is encouraged in ENet, it tends to introduce more features into the model than LASSO, and thus a higher H is preferred to avoid introducing too many redundant features. Another interesting observation is the correlation to stability scores. For most subjects, the stability scores seem to be negatively correlated with H in LASSO and ENet, which indicates setting up a high threshold may have a negative impact on model stability. The size of subset to be selected after re-sampling and replications, however, does not show any correlations with H in stability selection. Moreover, the size of subset remains stable in general for the same subject with a varying H. These findings encourage further exploration for standard guidelines for the selection of H with empirical or theoretical supports. Figure 3 presents a visualization of selected voxels for subject 1 in Haxby dataset and subject 04820 in StarPlus dataset. Subset selection is determined by picking up the most stable voxels, namely, the voxels with highest selection frequency throughout all replications. In general, the algorithms with higher stability scores: SVM-Corr, LASSO, and ENet selected a cluster of voxels located in visual cortex area, which is consistent with the domainspecific knowledge, while SVM-MI and SVM-RFE had a sparse voxel distribution. This indicates that stability-based feature selection framework provides a more stable, interpretable subset selection, which is difficult to achieve by evaluating models using accuracy.

Conclusion
In this study, we conducted a comprehensive analysis for a selection of filter, wrapper, and embedded feature selection approaches on the two benchmark fMRI datasets, adopting a stability-based methodological framework. It is found that the stability of feature selection is a potential alternative criterion for model selection in addition to classification accuracy, especially for those studies whose objective is to find a model with good interpretation rather than excellent predictive power. Having noticed that it is the case for the majority of neuroimaging data-based studies, developing stability-based feature selection may be helpful for identifying important voxels to decode mental states.
The future studies may explore a reliable metric to quantify the stability of feature selection methods because it has not been clearly defined. A standard guideline for selecting a suitable feature selection approach to achieve higher stability can be developed on the basis of a reliable metric. Also, a methodological framework which enables control of the balance between accuracy and stability is another issue to be further explored. Furthermore, it would be an interesting topic to examine the stability in voxel selection across different subjects, which will also be a challenging task because the activity patterns in brain are known to have large individual variations even in the same cognitive tasks.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://crea tivecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.