Neuroinformatics

, Volume 11, Issue 2, pp 227–247

Penalized Likelihood Phenotyping: Unifying Voxelwise Analyses and Multi-Voxel Pattern Analyses in Neuroimaging

Penalized Likelihood Phenotyping

Authors

    • University of Wisconsin-Madison
  • Bret M. Hanlon
    • University of Wisconsin-Madison
  • Antoine Lutz
    • University of Wisconsin-Madison
  • Janet E. Lainhart
    • University of Utah
  • Andrew L. Alexander
    • University of Wisconsin-Madison
  • Richard J. Davidson
    • University of Wisconsin-Madison
Original Article

DOI: 10.1007/s12021-012-9175-9

Cite this article as:
Adluru, N., Hanlon, B.M., Lutz, A. et al. Neuroinform (2013) 11: 227. doi:10.1007/s12021-012-9175-9
  • 641 Views

Abstract

Neuroimage phenotyping for psychiatric and neurological disorders is performed using voxelwise analyses also known as voxel based analyses or morphometry (VBM). A typical voxelwise analysis treats measurements at each voxel (e.g. fractional anisotropy, gray matter probability) as outcome measures to study the effects of possible explanatory variables (e.g. age, group) in a linear regression setting. Furthermore, each voxel is treated independently until the stage of correction for multiple comparisons. Recently, multi-voxel pattern analyses (MVPA), such as classification, have arisen as an alternative to VBM. The main advantage of MVPA over VBM is that the former employ multivariate methods which can account for interactions among voxels in identifying significant patterns. They also provide ways for computer-aided diagnosis and prognosis at individual subject level. However, compared to VBM, the results of MVPA are often more difficult to interpret and prone to arbitrary conclusions. In this paper, first we use penalized likelihood modeling to provide a unified framework for understanding both VBM and MVPA. We then utilize statistical learning theory to provide practical methods for interpreting the results of MVPA beyond commonly used performance metrics, such as leave-one-out-cross validation accuracy and area under the receiver operating characteristic (ROC) curve. Additionally, we demonstrate that there are challenges in MVPA when trying to obtain image phenotyping information in the form of statistical parametric maps (SPMs), which are commonly obtained from VBM, and provide a bootstrap strategy as a potential solution for generating SPMs using MVPA. This technique also allows us to maximize the use of available training data. We illustrate the empirical performance of the proposed framework using two different neuroimaging studies that pose different levels of challenge for classification using MVPA.

Keywords

ClassificationRegressionVoxel based morphometryMulti-Voxel pattern analysisGeneralization riskImage phenotypingPenalized likelihoodLinear models

Introduction

Many neuroimaging studies are conducted with a priori hypotheses to be tested. Voxelwise analysis1 (henceforth referred to as VBM) is the most widely used framework for hypothesis testing in neuroimaging. In this framework, the measurements at each voxel (or region) are treated as outcome measures and are analyzed independently leading to a large number of univariate analyses. Depending on the study, these measurements could be any of the following: cortical thickness obtained using T1 weighted images, blood oxygen level dependent activations obtained using functional magnetic resonance imaging (fMRI), fractional anisotropy computed using diffusion tensor images (DTI), or the index of metabolic activity using positron emission tomography (PET). The relationship between the outcome measure and the experimental design variables is commonly modeled using generalized linear models (GLM) of which the linear model (LM) is a special case (McCullagh and Nelder 1989).2

With increasingly large amounts of data being collected, hypothesis testing alone fails to utilize all of the information in the data. Such studies may also be used to discover interesting patterns of regularity and to find image phenotypical information effecting individual differences in diagnosis, prognosis, or other non-imaging observations. Increasing sample sizes and multi-center studies combined with the maturation of high dimensional statistical tools has led to an increasing interest in multi-voxel pattern analysis (MVPA) (Norman et al. 2006; Pereira et al. 2009; Hanke et al. 2009b; Anderson and Oates 2010; Carp et al. 2011; Halchenko and Hanke 2010).3Thus far, the majority of this work has been in the area of classification and primarily using functional magnetic resonance imaging in detecting various states of mind (Pereira et al. 2009). There is a growing interest, in the spirit of computer-aided diagnosis, in performing MVPA using structural information of the brain with modalities such as T1-weighted MRI and diffusion tensor imaging (DTI). However, performing MVPA using structural brain signatures is a harder problem than using functional brain signatures. This is because, except in studies investigating atrophy, structural changes (effect-sizes) are usually much smaller and reside in higher effective-dimensions compared to functional changes, thus demanding more data for both VBM and MVPA models. Yet, surprisingly, the majority of the neuroimaging studies have significantly more functional data collected compared to the structural data such as DTI. Hence, driven by improving performance scores such as cross-validation accuracies and area under the receiver operating characteristic (ROC) curves, the current research has primarily focused on the following two areas. (1) The first area involves developing pre-processing methods for extracting various features such as using topological properties of the cortical surfaces (Pachauri et al. 2011), spatial frequency representations of the cortical thickness (Cho et al. 2012), shape representations of region-specific white matter pathways (Adluru et al. 2009) or including various properties of the diffusion tensors in specific regions of interest (Lange et al. 2010; Ingalhalikar et al. 2011). Recent work in this direction has even been in performing meta analyses using large scale data from the published articles on the web for extracting specific regions in the brain relevant for a given task in fMRI studies (Yarkoni et al. 2011; Mitchell 2011). (2) The second area involves developing various classifier models such as multi-kernel, multi-modal learning (Hinrichs et al. 2011; Batmanghelich et al. 2011; Zhang et al. 2012), incorporating spatial constraints to a linear program based boosting model (Hinrichs et al. 2009) and even ensemble classifier models (Liu et al. 2012).

However not much attention has been paid to analyzing the model-behavior beyond the basic metrics such as average cross-validation accuracy and ROC curves, except for a few upcoming articles such as Hinrichs et al. (2011), Ingalhalikaret al. (2011), which attempt to interpret the classifier model parameters and prediction scores. Although this is promising, to our best knowledge, there has been virtually no work conducted in careful risk assessment of such models in neuroimaging studies by employing generalization risk theory available in the statistical machine learning literature.

This paper has four main goals: (1) to unify the key aspects of the two main families of neuroimage analysis, VBM and MVPA by using penalized likelihood modeling (PLM); (2) to illustrate the fundamental differences of VBM and MVPA interms of model selection and risk assessment and introduce practical generalization risk bounds using results from statistical learning theory; (3) to demonstrate that for obtaining statistical parametric maps (SPM) using MVPA, one faces an issue similar to the multiple comparisons problem; and (4) to show that there are critical conceptual level differences between the use of MVPA in artificial intelligence applications and neuroimaging studies, including differences between the challenges in state-detection (diagnosis) vs. trait-prediction(prognosis). We use DTI data from two neuroimaging studies to illustrate the potential of the proposed approach. The first study investigates the role of white matter in the autism spectrum disorders. The second study addresses the question of white matter plasticity effects of long-term meditation practice. We would like to note that PLM is not new to neuroimaging and has been used in the form of sparse regression by applying the \(\ell _{1}\) penalty (Carroll et al. 2009; Vounou et al. 2010; Ryali et al. 2010; Bunea et al. 2011). However, to our best knowledge, our work is the first to use PLM in a global context for unifying VBM and MVPA, which enables us to provide better contrast between and comparison of the two families of analysis.

The remainder of the article is organized as follows. Section “Materials and Methods” contains materials and methods. Specifically, the mathematical and statistical concepts of penalized likelihood and risk assessment for both VBM and MVPA are presented in sections “Loss and Penalty”, “Risk Assessment” respectively. Section “Neuroimaging Data” presents the neuroimaging data sets discussed in this paper and section “Hypotheses Examined” describes the statistical hypotheses that are examined in connection to the proposed penalized likelihood phenotyping. The experimental results are presented in section “Experimental Results”. Section “Discussion and Future Directions” contains a discussion of the presented work and directions for future research in related areas. The Appendix presents an alternative approach for risk-assessment of MVPA, specifically derived for support vector machines.

Materials and Methods

In this section, we first describe how penalized likelihood modeling can provide a unified framework for voxel based morphometry (VBM) and the multi-voxel pattern analyses (MVPA) in the Section “Loss and Penalty”. We then explain how the models can be assessed and interpreted via risk assessment in section “Risk Assessment”.

Loss and Penalty

Penalized likelihood modeling (PLM), and more generally regularization, is rooted in the idea of inducing some selection bias for selecting parsimonious models in explaining the data, without much loss in estimation performance. To set the notation, let us assume we collect brain data (v number of voxels) and clinical covariates (p variables) on n subjects. Typically, in an imaging data set \(v \approx 10^{5}\), \(n \approx 10^{2}\), and \(p \approx 10\). For instance, in our autism study we have \(v=41,011\), \(n=154\) and \(p=4\) (\(\mathtt {Age}\), \(\mathtt {Group}\), Social Responsiveness Scale (\(\mathtt {SRS}\)), \(\mathtt {IQ}\)). For the meditation study, we have \(v=57019\), \(n=49\) and \(p=3\) (\(\mathtt {Group}\), \(\mathtt {Age}\) and Total Life TimePractice Hours (\(\mathtt {TLPH}\))).

We now explain how the data are modeled in VBM and classification, which is a canonical example in MVPA. VBM fits the following regression model at each voxel
$$\begin{array}{rll}Y&=&X\beta+\varepsilon,~\textrm{where }Y\in\mathbb{R}^{n\times 1},X\in\mathbb{R}^{n\times (p+1)},\\&&\beta\in\mathbb{R}^{(p+1)\times 1}.\end{array}$$
(1)
Y is the vector of outcome measures observed in the brain, X is the design matrix of the p clinical variables and a column of constants, and \(\beta \) is a vector indicating the effect of each variable on the signal. The error term (\(\varepsilon \)) is assumed to follow the standard normal distribution, i.e. \(\varepsilon \sim \mathcal {N}(0,1)\). In contrast, consider the classification problem where the data are typically modeled as
$$Y=X\beta,~\textrm{where }Y\in\{-1,1\}^{n\times 1},X\in\mathbb{R}^{n\times v},\beta\in\mathbb{R}^{v\times 1}.$$
(2)
In this setting, X is a matrix of vectorized brain signals also commonly known as the feature matrix.Y is the diagnostic or group-label information. Here, each brain is treated as a high-dimensional vector. Hence, a key difference between VBM and classification in data modeling is in the size of\(\beta \). Specifically, in the classification setting \(\beta \) is a high-dimensional object (\(v>>n\)), while in the VBM setting it is not (\(p<<n\)). In addition to classification which translates to computer aided diagnosis, one can also model a high-dimensional regression in MVPA where \(Y\in \mathbb {R}^{n\times 1}\). Such an MVPA regression model translates to computer aided prognosis.4
The key idea behind estimating such models, for both VBM and MVPA, is to balance data fidelity (using a loss/fitness function) and over-fitting (using a penalty). We would like to quickly note that over-fitting of a model should not be confused with precise estimation of a model. Precise estimation can lead to over-fitting when the model chosen a priori is not a complete description of the “reality” under consideration. Hence the goal under PLM, is to estimate the respective model parameters \(\beta \) by minimizing
$$\underset{\beta}{\operatorname{argmin}}~L(Y,X\beta)+ \lambda P(\beta),$$
(3)
where L is a loss function, P is a penalty which controls the complexity of the model, and \(\lambda \ge 0\) is a tuning parameter which controls the amount of penalization. This parameter can be thought to reflect the uncertainty in modeling the reality: smaller values reflect more confidence in the modeling. Typically, the loss function can be viewed as a negative log-likelihood of the model, hence the objective function in Eq. 3 is commonly known as a penalized likelihood function. Naturally different combinations of loss and penalty would result in different biases, optimization challenges and model behaviors. In VBM, the most commonly implemented loss function in popular neuroimaging packages such as SPM, AFNI, FSL, SurfStat, fMRIStat, is the following ordinary least squares (OLS) loss
$$L_{\mathtt{OLS}}(Y,X\beta)=||Y-X\beta||_{2}^{2}.$$
(4)
Let us take a closer look at the above loss function. If we expand and re-group the terms, it can be upper bounded by
$$\underbrace{||Y||_{2}^{2}-2||Y||_{2}||X\beta||_{2}}_{L(Y,X\beta)}+\underbrace{||X\beta||_{2}^{2}}_{\lambda P(\beta)},$$
(5)
we can notice that the OLS loss function has an implicit penalty on \(||\beta ||^{2}\). Of course one could introduce additional penalty to the OLS resulting in for example, ridge regression (Marquardt and Snee 1975). We can also use other robust loss functions such as the \(\epsilon \)-insensitive\(\ell _{1}\)-loss function in combination with a \(||\beta ||_{1}\) as penalty (Adluruet al. 2012). The \(\epsilon \)-insensitive \(\ell _{1}\)-loss used in the popular support-vector regression (SVR) is defined as
$$|Y-X\beta|_{\epsilon}=\left\{\begin{array}{ll} 0 &\textrm{if }|Y-X\beta|\leq\epsilon,\\|Y-X\beta| &\text{otherwise}.\end{array}\right.$$
(6)
This loss-penalty combination also is very useful for computer-aided prognosis in the class of MVPA. In the Table 1, below we present some commonly used loss and penalty combinations in the family of VBM and MVPA.
Table 1

Different loss-penalty combinations resulting in different types of models which fall under either VBM or MVPA

 

\(L(Y,X\beta )\)

\(\lambda P(\beta )\)

Family

\(\ell _{2}\)-SVM

\(\underset {i=1}{\overset {n}{\displaystyle \sum }}\operatorname {max}\left (0,1-Y_{i}X_{i}\beta \right )^{2}\)

\(\frac {1}{2}||\beta ||_{2}^{2}\)

MVPA

\(\ell _{1}\)-SVM

\(\underset {i=1}{\overset {n}{\displaystyle \sum }}\operatorname {max}\left (0,1-Y_{i}X_{i}\beta \right )^{2}\)

\(||\beta ||_{1}\)

MVPA

\(\ell _{1}\)-logistic regression

\(\underset {i=1}{\overset {n}{\displaystyle \sum }}\log \left (1+e^{-Y_{i}X_{i}\beta }\right )\)

\(||\beta ||_{1}\)

VBM and MVPA

OLS

\(\leq ||Y||_{2}^{2}-2||Y||_{2}||X\beta ||_{2}\)

\(\leq ||X\beta ||_{2}^{2}\)

VBM

Ridge

\(||Y-X\beta ||^{2}_{2}\)

\(||\beta _{\setminus 0}||_{2}^{2}\)

VBM and MVPA

SVR

\(|Y-X\beta |_{\epsilon }\)

\(||\beta ||_{2}^{2}\)

VBM and MVPA

Such a penalized likelihood modeling point of view allows us to tie the two families of analysis into a unified analytic framework. This will also allow to us compare and contrast the differences and similarities in a principled way. \(Y_{i}\) and \(X_{i}\) denote the group label and the vectorized brain image of the ith sample respectively. \(||\beta _{\setminus 0}||\) denotes the set of \(\beta \)s without the first coefficient \(\beta _{0}\), also known as bias

Different loss and penalty functions require different optimization routines to estimate \(\beta \). For example, OLS has a closed form solution of \(\widehat {\beta }=(X^{T}X)^{-1}X^{T}Y\). Fortunately, the other combinations in the Table 1 are convex and computationally tractable and the optimization details have been analyzed, implemented and improved over time (Park and Hastie 2007; Hastie et al. 2009; Friedman et al. 2010; Yuan et al. 2011).

We would like to note that typically in MVPA models, each brain image is vectorized and treated as a single high-dimensional vector (as shown in Eq. 2), that is the \(3D\) volume structure of the brain scans is ignored. Hence the most commonly used penalties (shown in the Table 1 above) are vector norms such as \(||\beta ||_{1},||\beta ||_{2}\). Such vector norms take into account the high-dimensionality issue by taking into account the first order interactions but not products between different \(\beta _{i}\)s.

However one can devise classification models by keeping the full \(3D\) volume structure of the brain images and treating them as tensors,5 so that higher-order interactions between the \(\beta _{i}\) scan be accounted for in the norms. Such interactions might very well be relevant for analyzing patterns in these data. Tensors offer much richer variety of operations but for simplicity of understanding one can model the data for ith sample as
$$\begin{array}{rll}Y_{i}&=&\langle X_{i},\beta\rangle,~\textrm{where }Y_{i}\in\{-1,1\},X_{i}\in\mathbb{R}^{v_{x}\times v_{y}\times v_{z}}, \\&&\beta\in\mathbb{R}^{v_{x}\times v_{y}\times v_{z}}, \end{array}$$
(7)
and \(v_{x},v_{y},v_{z}\) are the number of voxels in the three directions. \(\langle \cdot ,\cdot \rangle \) denotes the tensorial dot product which numerically is identical to vector dot product by vectorizing the two tensors (Signoretto et al. 2011). The main advantage of such tensorial view comes from the ability to add structural norms such as the Frobenius norm (\(||\beta ||_F\equiv \sqrt {\langle \beta ,\beta \rangle }\)), rank, Schatten \(\{p,q\}\)-norms (\(||\beta ||_{p,q}\)) or nuclear p-norms (\(||\beta ||_{p,1}\)), whose definitions using advanced singular value decompositions of tensors can be found in Kolda and Bader (2009), Signoretto et al. (2011). Penalizing with some of these norms (Frobenius, nuclear 1-norm) has convex optimization routines while with some others (Schatten \(\{p,q\}\)-norms with \(p<1\)) is computationally hard. Furthermore justifying these norms would require stronger a priori assumptions on the \(3D\) structures of the \(\beta _{i}\)s such as having low-rank or other spectral-gap related assumptions.

Some such ideas have been applied in \(2D\) natural image analysis such as face and gait detection by treating images as matrices and image sequences as tensors rather than vectors (Wolf et al. 2007; Kotsia et al. 2012), in tensor classification for online learning (Shi et al. 2011). Tensorial extensions have also been proposed for independent component analyses in brain fMRI analyses (Beckmann and Smith 2005). But since the main goals of this manuscript are to unify the VBM and MVPA and present novel risk analysis approaches for the estimated models regardless of the loss-penalty combination, our experimental results focus on the most commonly used combinations (such as the ones presented in Table 1) and focus on test-set bounds and cost-function based examination of the ROC curves.

Risk Assessment

In the previous section we saw how penalized likelihood modeling helps us unify the models explored in VBM and MVPA as different combinations of loss and penalty in different vector (or possibly tensor) spaces. One of the key distinctions between VBM and MVPA will arise in the risk assessment of the estimated models. In each setting there are two-levels of risk assessment that correspond to performance and interpretation. These are summarized in Table 2 and will be described in detail in the following text.
Table 2

Summarization of the two levels of risk assessment in VBM and MVPA

Level

VBM

MVPA

1st

Individual voxel level (type I error control via tail bounds)

Joint set of voxels (prediction error control via deviation bounds)

2nd

Joint set of voxels (multiple comparisons)

Individual voxel level (variable selection)

The key technology needed for each level is shown in the brackets. In both VBM and MVPA, these two levels can sometimes be jointly addressed for example, multivariate hypothesis testing (Kanungo and Haralick 1995) in VBM and LASSO type loss-penalty combinations that allow simultaneous variable selection and model estimation (Tibshirani 1996) in MVPA

Voxel Based Morphometry (Level 1 Risk)

In VBM, the model selection process is based on domain specific knowledge. First, a certain set of null and alternate hypotheses are formulated. Then after identifying a set of variables of interest and potential nuisance variables, a control experiment is conducted to collect the data by randomizing on the nuisance variables. Power analysis (Mumford and Nichols 2008) is usually (but not always and frequently enough) performed to calculate the amount of data that needs to be collected in order to be able to confidently reject the formulated null-hypotheses. Hence the focus of risk assessment is more on false rejection of null-hypotheses and less on model selection, interpretation and generalization. This a priori model selection, preferably based on other non-imaging data, results in a clear notion of outcome measures (dependent variables) and experimental design measures (independent variables). As discussed in section “Introduction”, the relationship between dependent and independent variables is typically modeled as a linear model (LM).

A fixed LM is estimated using OLS (Eq. 4) with data from each voxel and the risk of false-rejection of null-hypothesis at a voxelis based on a test statistic such as a p-value. The p-value is characterized by a tail bound on a null-distribution which typically is a student-t, F or \(\chi ^{2}\) distribution in a symptotics. The assumptions of asymptotic behavior are expected to be satisfied with sufficient number of samples. Generally one wants to test if a linear combination of the \(\beta \)s is statistically significant. That is, at each voxel, the following hypothesis testing is performed
$$H_{0,i}:~\mathcal{T}\beta=0\quad\textrm{vs.}\quad H_{1,i}:~\mathcal{T}\beta\neq 0,$$
(8)
where \(H_{0,i},~H_{1,i}\) are the null and alternate hypotheses respectively at ith voxel and \(\mathcal {T}\) is an \(m\times p\) matrix typically called a contrast matrix. This method of using contrast matrices provides a general way of representing null-hypotheses. To assess the risk of false-rejection of the null hypotheses, we need to compute the following null-conditioned probability as the test-statistic
$$P\left(\overline{H_{0,i}}\right)\overset{\mathtt{def}}{=}P\left(\mathcal{T}\beta\neq 0|H_{0,i}\right).$$
(9)
The null-hypotheses can then be rejected with at least \(1-\alpha \) confidence level or at most \(\alpha \) risk, if \(P(\overline {H_{0,i}})\leq \alpha \). Typically \(\alpha \) is set to 0.05 in practice for various legendary and empirical reasons. Below show how Eq. 9 is computed typically usingp-values. Let
$$t_{0}^{j}=\frac{\mathcal{T}^{j}\widehat{\beta}}{\widehat{\mathtt{se}}(\mathcal{T}^{j}\widehat{\beta})},~\textrm{for each independent row}\; {j}\; \textrm{in}\; \mathcal{T},$$
(10)
$$\chi_{0}^{2}=\|Y-X\widehat{\beta}\|^{2}_{2}.$$
(11)
\(\widehat {\beta }\) denotes the estimated \(\beta \) using the data. Now, under the asymptotic and standard normality of residuals assumptions
$$t_{0}^{j}\sim t(\mathtt{df}),$$
(12)
$$\chi_{0}^{2}\sim \chi^{2}(\mathtt{df}),$$
(13)
where \(\mathtt {df}=n-\mathtt {rank}(X)\), is the degrees-of-freedom parameter for the two distributions. Thus t-distribution can be used if we rely on the precision in the estimated parameters and \(\chi ^{2}\) distribution can be used if we rely on the residuals of the estimated models. Assuming the above t and \(\chi ^{2}\) are the null-distributions, the probability of false rejection can then be computed as
$$P\left(\overline{H_{0,i}}\right)=P\left(\mathcal{T}\beta\neq 0|H_{0,i}\right)=\left\{\begin{array}{l}\underset{j}{\operatorname{max}}~P\left(x>\left|t_{0}^{j}\right|\right)~~\text{or}\\P\left(x<\chi_{0}^{2}\right)\end{array}\right.,$$
(14)
where
$$P\left(x>\left|t_{0}^{j}\right|\right)=\int_{t_{0}^{j}}^{\infty} f_{t}(x;\mathtt{df})dx+\int_{\infty}^{-t_{0}^{j}} f_{t}(x;\mathtt{df})dx,$$
(15)
$$P\left(x<\chi_{0}^{2}\right)=\int_{-\infty}^{\chi_{0}^{2}}f_{\chi^{2}}(x;\mathtt{df})dx,$$
(16)
and \(f_{t}\) and \(f_{\chi ^{2}}\) are the probability density functions for the t and \(\chi ^{2}\) distributions respectively. Testing significance of contrasts of \(\beta \)s using residuals involves using ratios of residuals of nested models resulting in F-tests (please see Adluru et al. (2012) for examples and details). Adluru et al. (2012) also show that more effective (data-driven) definitions of the \(\mathtt {df}\) can be used to obtain better sensitivity in rejecting the null-hypotheses.

Thus the first level of risk assessment in VBM involves computing the p-values using a test statistic such a s\(t_{0}\), \(\chi ^{2}_{0}\) or \(F_{0}\) and comparing those with \(\alpha \), at each voxel. The collection of the test-statistics at each voxel results in the so called statistical parametric maps (SPMs) which form the basis for image based phenotyping in the brain.

Image Phenotyping Using VBM (Level 2 Risk)

The second level of risk arises because we will have tested v hypotheses (one at each voxel) in order to produce an SPM. To control the overall risk (also known as family wise error (FWER) control) we need to control
$$P\left(\overline{H_{0,1}}\cup \overline{H_{0,2}}\cup \ldots\cup \overline{H_{0,v}}\right).$$
(17)
Although the risk at individual voxel level can be controlled by computing p-values, the joint probability is typically very hard to compute exactly since it is hard to know the interactions between the hypotheses explicitly. But assuming there is \(\alpha \) risk of false-rejection at each voxel, it can be upper bounded using the union bound as
$$\begin{array}{l} P\left(\overline{H_{0,1}}\cup \overline{H_{0,2}}\cup \ldots\cup \overline{H_{0,v}}\right) \leq P\left(\overline{H_{0,1}}\right)+ P\left(\overline{H_{0,2}}\right) \\ \qquad + \ldots + P\left(\overline{H_{0,v}}\right) \leq v\alpha. \end{array}$$
(18)
This upper bound, which is well known as Bonferonni bound or Boole’s inequality, says that the risk of false-rejection is linearly inflated with the number of hypothesis tests in a given experiment. This bound can be very loose in many neuroimaging studies and tends to be overly conservative. One can also use the following (slightly tighter) inequality, based on De Morgan’s law, also commonly known as Sídak bound
$$\begin{array}{l} P\left(\overline{H_{0,1}}\cup \overline{H_{0,2}}\cup\ldots\cup \overline{H_{0,v}}\right) \\ \quad =P\left(\overline{H_{0,1}\cap H_{0,2}\cap \ldots\cap H_{0,v}}\right) \\ \quad =1-P\left(H_{1,1}\cap H_{1,2}\cap\ldots\cap H_{1,v}\right) \\ \quad \leq 1-(1-\alpha)^{v}.\end{array}$$
(19)
Naturally \(\alpha \leq 1-(1-\alpha )^{v}\leq \alpha v\) for \(\alpha \in (0,1]\). Hence to control for false rejections with at most \(\alpha \) risk at the family level (17), we need to control the individual voxel level risk much more strictly. Specifically as follows,
$$\forall_{i}~P\left(\overline{H_{0,i}}\right)\leq\left\{\begin{array}{ll}\frac{\alpha}{v}, & \text{Bonferroni}\\1-(1-\alpha)^{\frac{1}{v}}, & \textrm{S\'{i}dak}\end{array}\right.$$
(20)
We can also observe that if \(P\left (\overline {H_{0,i}}\right )\) is not controlled at low values, the bounds would be trivial and practically un-useful. For instance, they would result in statements such as the family-wise risk is less than 1.0 and 2500 at \(\alpha =0.05,~v=50000\) when applying Eqs. 19 and 18, respectively.6 Such methods however do not take into account the amount of spatial smoothing performed on Y or smoothness present in the SPMs. From Eq. 14, we can see that \(P(\overline {H_{0,i}})\) essentially involves computing probabilities like \(P(\xi _{0,i}\gtrless x)\), where \(\xi _{0,i}\) is the test statistic at ith voxel in the SPM. By treating an SPM as a random field, the Eq. 17 can be approximated using random field theory (RFT) as follows (Worsley et al. 2004; Taylor and Worsley 2008)
$$\begin{array}{rll}P\left(\underset{i}{\bigcup}\overline{H_{0,i}}\right)&=&P\left(\underset{i}{\bigcup}\Big\{\xi_{0,i}>x\Big\}\right)\\&=&P\left(\underset{i}{\operatorname{max}}~\xi_{0,i}>x\right)\\&=&P\left(\underset{S}{\operatorname{max}}~\textrm{SPM}(S)>x\right)\\&\approx& \underset{d=0}{\overset{3}{\sum}}\mathtt{Resels}_{d}(S)\times \mathtt{EC}_{d}(x),\end{array}$$
(21)
where \(\mathtt {Resels}_{d}\) are the resels (resolution elements) of the search region (S) and \(\mathtt {EC}_{d}\) is the Euler characteristic density of the excursion set of the SPM (thresholded SPM) in d dimensions. The expressions for these can be found in Worsley et al. (1996). We use the implementations are available in the SurfStat software package Worsley et al. (2009) in our experiments. The RFT correction is very similar in spirit and in fact can be used to motivate the cluster-based corrections (Smith and Nichols 2009). Permutations-based correction (Nichols and Holmes 2002) controls the overall risk by Monte-Carlo simulations of the null-distributions rather than assuming any parametric form (such as \(t,\chi ^{2}\) or F) for them or clusterwise thresholding of the test statistics in the SPM. Finally, instead of controlling FWER one can also control for proportions of false rejections also known as false discovery rate (FDR) control (Benjamini and Hochberg 1995).

The key thing to be observed is that the multiple comparisons issue in VBM arises mainly when trying to obtain image phenotyping information by controlling the overall risk of false-rejection of null-hypotheses at the neuroanatomical level. Furthermore, predictions at an individual subject level is not a concern in VBM hence there is no explicit assessment of generalization of the model performance on “unseen” or “uncollected” data.

Multi-Voxel Pattern Analysis (Level 1 Risk)

As in the case of VBM there are two-levels of risk assessment in MVPA, except the treatment of the voxels is reversed as shown in Table 2. Here the data are typically not collected with a priori hypotheses as in the case of VBM. Hence the notion of independent and dependent measures is less clear and not relevant in this family of analyses. The idea is to discover flexible enough models, using already collected data, to be able to perform well on the future/unseen data. In VBM, either the precision of the estimated parameters or the residuals of the estimated models are used to assess the first level risk. Since the model selection is not performed a priori as in VBM and since \(p\gg n\), the risk assessment is necessarily different from that in VBM. The focus becomes more on generalization, model selection and interpretation, rather than testing to reject any null-hypotheses at an \(\alpha \) risk level.

It is important to first understand some historical context of the MVPA in neuroimaging based neuroscience, for careful risk assessment of MVPA models. MVPA in neuroimaging studies is mainly an off-spring of machine learning models used primarily in imaging based artificial intelligence (AI) applications such as computer vision and robotics.7The goal in imaging based AI applications is fundamentally different from imaging based neuroscience applications. In the former, the goal is to endow machines with models that potentially mimic human capabilities such as object recognition, autonomous navigation and text classification. For example it is easy (easy to learn) for humans to walk around autonomously, summarize textual images, parse natural scenes. While in neuroimaging the goal of mimicking humans,except in specific degenerative cases like Alzheimer’s disease with gray matter atrophy and stroke, is ill-defined since it is essentially impossible (yet) for even an expert neuroscientist to look at brain imaging data and decide, for example, whether a person has autism. Although theoretical work on assessing generalization risk in machine learning has been addressed by many researchers, the status-quo of many of the results are unsatisfactory in the practical setting leading to findings such as “generalization risk is less than 0.9” (Langford and Shawe-taylor 2002).8 This is a lesscritical issue in human-mimicking AI applications, since in those applications one can have a strong control on the risk by having a human in the loop.9

There are several approaches one can take in analyzing the risk of an estimated multivariate model such asusing probably approximately correct (PAC)-learn ability (Valiant 1984), defining the “capacity” or “effective dimensions/degrees of freedom” of the models using VC-theory (Vapnik and Chervonenkis 1971), prediction error/mistake-bounds (Langford and Shawe-taylor 2002), cross-validation based analysis (Kearnsand Ron 1999). Although these different frameworks are related to one another, we will mainly focus on employing the prediction error bounds. This is due to practical implications of the different approaches. The other frameworks are useful in general purpose understanding and designing models of MVPA. For example, SVM estimation can be interpreted as minimizing empirical risk, since the norm of the weights (\(||\beta ||\)) is related to VC-dimension of the class of SVMs. However, the bounds using such analogy are either very loose arehard to compute in practice. Prediction error bounds on the other hand are easier to compute for practical application of the prediction theory (Kääriäinen and Langford 2005). Hence we propose those as an extension to the routinely reported metrics such as average accuracies and receiver operating characteristic (ROC) curves.

The most commonly used cross-validation (either leave-one-out or k-fold) can provide a reasonable sense of the model performance but when computing the risk bounds, notions like algorithm stability and hypothesis stability come into play. We do not go into details of such results in this article but refer the interested readers to an excellent article (Kearns and Ron 1999). We will however use the cross-validation data for dissecting the ROC curves to report confidence intervals ofthe true-positive rate (TPR) at various optimal operating regions (please see section “Experimental Results” for details on this).

Before we introduce basic quantities needed to compute the prediction error bounds, we would like to highlight the difference between (1) MVPA regression models used for predicting continuous variables known in neuroscience as traits or individual differences or symptom severities, and (2) classification models used for predicting discrete variables called as state, class or group prediction. The notion of robustness, although can be defined using heuristics such as \(\ell _{1}\) loss (Huber 1981) in the case of trait prediction, can be more crisply captured in the state prediction using variants of the 0/1 loss (Valiant 1984). This also amounts to additional challenges in risk assessment of the estimated models. Hence in this manuscript we will primarily focus on computing state prediction error bounds. With these bounds the risk that we are assessing, is on the underestimation of the errors made by the classifier models (false positiveness in prediction accuracies).

The empirical (or estimated) prediction error of an MVPA classifier model (\(\widehat {\beta }\)) on a test-set \(D_{\mathtt {Test}}\), is defined as
$$\widehat{\varepsilon}=n\cdot P\left(X\widehat{\beta}\neq Y|(X,Y)\sim D_{\mathtt{Test}}\right)=\overset{n}{\underset{i=1}{\sum}}\mathbf{1}_{X_{i}\widehat{\beta}\neq Y_{i}}.$$
(22)
Since we do not know parametric forms of the distribution of \(\widehat {\varepsilon }\), the best we can hope for is to model its distribution using the hypothetical true prediction error defined as
$$\varepsilon=P(X\widehat{\beta}\neq Y|(X,Y)\sim D),$$
(23)
where D is the “true” multivariate high-dimensional distribution from which the data is sampled.
The empirical error rate (\(\widehat {\varepsilon }\)) then follows a hypothetical (since \(\varepsilon \) is not known) binomial distribution of heads/tails of biased coin flips with a bias of \(\varepsilon \)
$$P(\widehat{\varepsilon}=k)=f_{\text{Bin}}(k;n,\varepsilon)=\left(\begin{array}{c}n\\k\end{array}\right)\varepsilon^{k}(1-\varepsilon)^{k},$$
(24)
which characterizes the probability of observing k empirical errors in n predictions when the true error rate is \(\varepsilon \). The tail probability, which is the probability of observing up to k errors in classification, can then be defined as
$$F_{\text{Bin}}(k;n,\varepsilon)\equiv P(\widehat{\varepsilon}\leq k)=\overset{k}{\underset{j=0}{\sum}}\left(\begin{array}{c}n\\j\end{array}\right)\varepsilon^{j}(1-\varepsilon)^{j}.$$
(25)
However since we do not have access to \(\varepsilon \), we cannot compute the above probability exactly. In other words we do not know which binomial distribution the empirical error rate (\(\widehat {\varepsilon }\)) follows.10Hence the risk assessment is modeled as the probability of deviation of \(\widehat {\varepsilon }\) from \(\varepsilon \) at a pre-determined confidence (\(1-\alpha \)) or risk (\(\alpha \)) levels.
Now, using standard properties of the binomial distribution one can obtain (Kääriäinen and Langford 2005),
$$\begin{array}{l} \forall D,~\forall\textrm{ classifier models},~\forall\alpha\in(0,1], \\ P\left(\varepsilon\geq\overline{F_{\text{Bin}}}\left(\alpha;n,\widehat{\varepsilon}\right)\right)\leq\alpha,\end{array}$$
(26)
where
$$\overline{F_{\text{Bin}}}(\alpha;n,\widehat{\varepsilon})\equiv \underset{\epsilon}{\text{max}}\{\epsilon:F_{\text{Bin}}(\widehat{\varepsilon};n,\epsilon)\geq\alpha\},$$
(27)
is the Binomial tail inversion representing the largest bias (true error rate), such that we can observe up to \(\widehat {\varepsilon }\) in n prediction attempts, with at least \(\alpha \) probability. This is known as the test-set upper bound which states that the chance of large deviation of true error above the tail inversion obtained using the test-set error rate, is bounded by \(\alpha \). Similarly a lower bound can be obtained as (Kääriäinen and Langford 2005)
$$\begin{array}{l} \forall D,~\forall\textrm{ classifier models},~\forall\alpha\in(0,1], \\ P\left(\varepsilon \leq \underline{F_{\text{Bin}}}\left(\alpha;n,\widehat{\varepsilon}\right)\right)\leq\alpha,\end{array}$$
(28)
where
$$\underline{F_{\text{Bin}}}\left(\alpha;n,\widehat{\varepsilon}\right)\equiv\underset{\epsilon}{\text{min}}\{\epsilon:\left(1-F_{\text{Bin}}\left(\widehat{\varepsilon};n,\epsilon\right)\right)\geq\alpha\},$$
(29)
is the opposite of Eq. 27 reflecting the smallest bias, such that we can observe at least \(\widehat {\varepsilon }\) errors i.e. \(\widehat {\varepsilon }\) or more number of errors in n prediction attempts, with at least \(\alpha \) probability. These upper and lower bounds can provide us binomial confidence intervals on the prediction error rates thus allowing us to state the following about the true error rate. With \(1-\alpha \) confidence,
$$\varepsilon\in\left[\,\underline{F_{\text{Bin}}}(\alpha;n,\widehat{\varepsilon}),~\overline{F_{\text{Bin}}}(\alpha;n,\widehat{\varepsilon})\right].$$
(30)
This is a tighter interval compared to the following interval that is sometimes reported in literature (assuming \(\alpha \leq 0.05\)),
$$\varepsilon\in[\widehat{\mu}-2\widehat{\sigma},~\widehat{\mu}+2\widehat{\sigma}],$$
(31)
where
$$\begin{array}{l} \widehat{\mu}=\frac{\widehat{\varepsilon}}{n},~~\textrm{[sample mean of the empirical error]}\\ \widehat{\sigma}^{2}=\frac{\overset{n}{\underset{i=1}{\sum}}\left(\mathbf{1}_{X_{i}\widehat{\beta}\neq Y_{i}}-\widehat{\mu}\right)^{2}}{n-1},\\ \quad\; \textrm{[sample variance of the empirical error]}.\end{array}$$
Although at higher error rates binomial can approximate the normal distribution, for lower error rates (Eq. 30) is tighter (tightest possible) compared to Eq. 31.
We now discuss the practical issues encountered when reporting (Eq. 30). Computing this interval requires an independent test-set for evaluating the model. That is \(\widehat {\varepsilon }\) has be to be replaced by
$$\widehat{\varepsilon}_{\text{test}}=\overset{n_{\text{test}}}{\underset{i=1}{\sum}}\mathbf{1}_{X_{i}\widehat{\beta}\neq Y_{i}},$$
(32)
and n by \(n_{\text {test}}\). In our experimental results we split the data into various training/test sets as shown in Table 4. In typical neuroimaging studies however, there is insufficient data for training multivariate models. Hence, researchers tend to use all of the data without holding an independent test set. Furthermore the test-set bounds are agnostic to the specific loss-penalty combination and work purely based on the mistakes made by the model. This naturally leads to a discussion of the socalled training set bounds which can allow us to use all the available data as well as take into account specific class of the MVPA models. Classes of MVPA models are characterized by their “effective power” inprediction. The more powerful a class is the looser the training set bound would be, for a fixed amount of training data. One can draw an analogy to the degrees of freedom of a LM used in VBM, the larger the \(\mathtt {rank}(X)\), the less are the \(\mathtt {df}\) the power to reject anull-hypothesis, for a fixed \(\alpha \) and n. The flip-side of taking class into account is that the bounds are derived for the families of classifier models rather than the specific learned/estimated model. Although there are techniques to “de-randomize” the bounds, from a data analysis perspective the focus in this paper is computing the test-set intervals (please see the numerical results for the two studies are reported in section “Experimental Results”). Please see section “Discussion and Future Directions” and the Appendix for further discussion of the training set bounds for support vector machines.

Image Phenotyping Using MVPA (Level 2 Risk)

The second level of risk assessment specifically involves the model selection rather than model performance although both can be addressed jointly in some situations as discussed in Table 2. Summarizing the model parameters (\(\beta \)) in the MVPA to produce a SPM is a challenging open problem. The main challenges that need to be addressed are (1) the fact that \(\beta _{i}\)s are estimated jointly and we need to take into account the inter-dependencies and (2) that the estimation is performed using disproportionately small amounts of data and hence the precision and stability in the estimated \(\beta _{i}\)s need to be carefully defined. A typical approach is to perform some basic normalization of the \(\widehat {\beta }_{i}\)s and define a significance threshold for declaring some of the voxels as critical. We take a more principled approach of computing z-scores for qualitatively approximating a significance map. Let \(\bar {\hat {\beta }}_{i}\) be the average of the estimated parameter that corresponds to the voxel i across all cross-validation folds which serve as a bootstrap for the estimation. \(z_{i}\) is then simply the normalized score defined as
$$\frac{\bar{\hat{\beta}}_{i}-u\left(\bar{\hat{\beta}}_{i}\right)}{s\left(\bar{\hat{\beta}}_{i}\right)},$$
(33)
where u and s are the functions for sample means and standard deviations. We can then work with the assumption \(z_{i}\sim \mathcal {N}(0,1)\) and can threshold for an appropriate \(\alpha \)significance level. Note that although we do not explicitly face the multiple comparisons problem as in the case of VBM, we are still faced with another challenge of robust model selection using limited data known commonly as variable selection problem in statistics and machine learning communities. Some loss-penalty combinations, such as the LASSO (Tibshirani 1996), induce sparse solutions and allow for simultaneous variable selection and parameter estimation. For more details we refer the reader to Bickel et al. (2006), Hesterberg et al. (2008), and Fraley and Hesterberg (2009).

Neuroimaging Data

We used data from two different neuroimaging studies. The first study investigates white matter substrates of autism spectrum disorders (ASD). The second study investigates the neuroplasticity effects of meditation practice on white matter. The sample characteristics of both the studies are presented in Table 3. Both studies used diffusion tensor imaging(DTI) data to investigate how white matter microstructure is affected. Briefly, DTI is a modality of magnetic resonance imaging that is exquisitely sensitive and non-invasively maps and characterizes the microstructural properties and macroscopic organization of the white matter (Basser et al. 1994; Jones et al. 1999; Mori et al. 2002). This is achieved by sensitizing MR signal to the diffusion of the water molecules (protons in them). Specifically the diffusion of the protons causes exponential attentuation in the signal proportional to their apparent diffusion coefficient (\(\mathtt {ADC}\)) and the MR acquisition parameter known as b-value or b-factor (Le Bihan et al. 2001). That is \(S=S_{0}e^{-b\mathtt {ADC}}\), where S is the measured MR signal, \(S_{0}\) is the signal without diffusion weighting or \(b=0\). The b-value represents the magnitude, duration, shape of the applied magnetic field gradients and time between the paired gradients used to flip the precessing protons and has units of seconds per square millimeters (s/mm \(^{2}\)) (Le Bihan et al. 2001). By measuring the attenuated signal in atleast six different directions one can estimate the diffusion pattern of the protons in the three orthogonal Cartesian directions using a positive semi-definite covariance matrix also known as the diffusion tensor.
Table 3

Sample characteristics such as sample sizes, gender distribution, means (standard deviations) of the continuous measures from the two studies

 

Autism

Meditation

\(\mathtt {Group}\)

TDC (\(n=55\))

ASD (\(n=99\))

MNP (\(n=26\))

LTM (\(n=23\))

\(\mathtt {Age}\)

15.25 (6.53)

13.73 (8.94)

49.7 (10.5)

50.73 (9.9)

\(\mathtt {Gender}\)

\(\mathrm {M=55}\), \(\mathrm {F=0}\)

\(\mathrm {M=99}\), \(\mathrm {F=0}\)

\(\mathrm {M=8}\), \(\mathrm {F=18}\)

\(\mathrm {M=9}\), \(\mathrm {F=14}\)

\(\mathtt {IQ}\)

117.36 (15.30)

93.57 (22.31)

N/A

 

\(\mathtt {SRS}\)

16.12 (11.40) \((n=42)\)

101.65 (27.89) \((n=94)\)

N/A

 

\(\mathtt {TLPH}\)

N/A

 

N/A

9602.8 (7979.7)

\(\mathtt {ASD}\) stands for autism spectrumdis orders group, \(\mathtt {TDC}\) for typicallydeveloping controls group. \(\mathtt {LTM}\)—longterm meditators group, \(\mathtt {MNP}\)—meditationnäive practitioners group. \(\mathtt {TLPH}\)—total life-time practice hours. \(\mathtt {TLPH}\) is computed using self-report measures of the number of retreats and number of hours at those retreats. Social Reciprocity/Responsiveness Scale (\(\mathtt {SRS}\)) is a quantitative, dimensional measure of social functioning across the entire distribution from normal to severely impaired functioning (Constantino et al. 2000)

In the brain white matter, which consists of packed axon fibers, the diffusion of water is anisotropic, i.e., directionally dependent, because the movement of water molecules perpendicular to the axon fibers is more hindered than in the parallel direction. From the diffusion tensor one can obtain maps of the diffusion tensor trace, the three eigen values, anisotropy and orientation (direction of the largest eigen vector) (Basser and Pierpaoli 1996). Various such measures from DTI have been used to characterize differences in brain microstructure for a broad spectrum of disease processes (e.g., demyelination, edema, inflammation, neoplasia), injury, disorders, brain development and aging, and response to therapy (Alexander et al. 2007). Fractional anisotropy (FA), the most commonly used measure of diffusion anisotropy, is a normalized standard deviation of the eigen values that ranges between 0 and 1. Although FA can be effected by many factors, empirically different studies have indicated that the higher the value, the more organized (in a primary direction) and the greater is the white matter integrity.

Autism Study

DTI data from a total of 154 male subjects were used in this study. The data were acquired on a Siemens Trio 3.0 Tesla Scanner with an 8-channel, receive-only head coil using a single-shot, spin-echo, echo planar imaging (EPI) pulse sequence and sensitivity encoding (SENSE) based parallel imaging (under sampling factor of 2). Diffusion-weighted images were acquired in twelve non-collinear diffusion encoding directions with diffusion weighting factor \(b=1000\) s/mm \(^{2}\) in addition to a single reference image (\(b=0\)). Other acquisition parameters included the following: contiguous (no-gap) fifty 2.5 mm thick axial slices with an acquisition matrix of \(128\times 128\) over a field of view (FOV) of 256 mm, 4 averages, repetition time \((\mathtt {TR})=7000\) ms, and echo time \((\mathtt {TE})=84\) ms.

Meditation Study

DTI data from 49 subjects were used in this study. The diffusion weighted images were acquired on a GE3.0 Tesla scanner using 48 non-collinear diffusion encoding directions with diffusion weighting factor of \(b=1000\) s/mm \(^{2}\) in addition to eight \(b=0\) images.

Image Pre-Processing

For both the studies, eddy current related distortion and head motion of each data set were correctedusing FSL software package (Smith et al. 2004). The brain region was extracted using the brain extraction tool (BET), also part of the FSL. Field in homogeneity distortions were corrected using field maps acquired in the meditation study. The tensor elements were calculated using non-linear estimation using CAMINO.11 It is important to establish spatial correspondence of voxels among all the subjects before performing VBM. State-of-the-art diffusion tensor image registration DTI-TK12was used for spatial normalization of the subjects. It performs white matter alignment using a non-parametric, highly deformable, diffeomorphic (topology preserving) registration method that incrementally estimates its displacement field using a tensor-based registration formulation (Zhang et al. 2006).

Features

For MVPA models, extracting features from the DTI data can be very sophisticated. For example, Lange et al. (2010) uses various diffusion tensor invariants such as FA in combination with other geometric properties in highly specific regions chosen a priori. While such efforts are extremely critical and useful for improving classification accuracies (Chu et al. 2012), we aim at keeping the features to be simple for two key reasons: (1) we have limited data to learn a classifier and by having a complex combination of features we risk generalizability; (2) we want to be able to interpret the features used in the classifier so that we can obtain image phenotypical information from the MVPA. In this study we perform both VBM and MVPA using fractional anisotropy (FA) in the white matter defined as the voxels with mean \(\text {FA}>0.2\) resulting in approximately 50000 voxels in each study. Furthermore, we restrict ourselves only to linear kernels, where the estimated parameters (learned weights) can be interpreted using z-scores. The data is smoothed for full-width half-maximum(FWHM) of Gaussian 4 mm to account for mis-registrations during spatial normalization. Following the empirically chosen defaults from DTI-TK, the final set of spatially normalized images are resampled to \(96\times 112\times 72\) voxels with \(2\times 2\times 2~\textit {mm}^{3}\) size per voxel. Hence, the FWHM 4 mm smoothing roughly accounts for misalignment up to 1.5 voxels.

Hypotheses Examined

In this section, we present the hypotheses that are examined (estimated and tested for significance) for the penalized likelihood phenotyping i.e. VBM and MVPA. We would first like to note that the ‘addition of covariates’ approach is not a substitute to a clean randomization of a population study. In the case of VBM nuisance variables can lead to (a) either artificially reducing the sample size (by forcing to use only ’controlled’ subset of the full sample) or (b)adding those variables into the models, both of which result in a reduction of statistical power of hypotheses testing essentially due to decrease in degrees of freedom of error. Furthermore finding a controlled sub-sample from the full-sample could be tricky because(1) there could be a large number of subsets to examine, (2)one has to either verify if the sub-samples follow known parametric distributions (such as normal) on the nuisance variables or perform permutation testing using each of the sub-division, and (3) one has to choose the subset that can maximize the statistical power. Satisfying all the three could be computationally very demanding.

Since in most realistic-population studies there always are a few nuisance variables and since choosing ‘controlled’ subsets is very challenging, addition of nuisance variables to the models is often used as a remedy. In the case of MVPA however, there is a stronger reason for not using only subsets of data. MVPA models are orders of magnitude higher in dimensionality compared to VBM models and larger models need more data for accuracy in estimation. Hence say by adding one or two nuisance variables to a model already containing 50,000 variables we can salvage about 100 samples, the cost to benefit ratio would be low at least in building an accurate predictive model.

Although fundamentally the idea of adding covariates into a model to control for the nuisance variables is similar inboth VBM and MVPA, the control in MVPA is obtained at the phenotyping stage (using their relative importance compared to imaging covariates) and not in the predictive stage.

With these notes in mind, we examine the following set of hypotheses in the VBM category at each voxel in the white matter.

Autism Study (VBM Models)

  1. 1.
    \(\mathtt {Group}\) effect on fractional anisotropy (\(\mathtt {FA}\)). Here we test if the null-hypothesis that there is no difference in group-means of \(\mathtt {FA}\) between autism spectrum disorders (\(\mathtt {ASD}\)) group and typically developing control (\(\mathtt {TDC}\)) group, can be rejected. Since the groups are not matched on \(\mathtt {IQ}\) distributions, we regress out the effect of \(\mathtt {IQ}\) on \(\mathtt {FA}\), the outcome measure in this case. The corresponding linear model (LM) can be expressed using MATLAB \(\mathtt {term}\)s (Worsley et al. 2009) as
    $$\mathtt{FA}=\beta_{0}+\beta_{1}\mathtt{Group}+\beta_{2}\mathtt{IQ}.$$
    For experiments, we estimate the over-parameterized version of the above LM as
    $$\mathtt{FA}=\beta_{0}+\beta_{1}\mathtt{ASD}+\beta_{1}'\mathtt{TDC}+\beta_{2}\mathtt{IQ},$$
    where \(\mathtt {ASD}\) and \(\mathtt {TDC}\) are the indicator variables. The contrast matrix used is \(\mathcal {T}=[0 ~-1 ~1~ 0]\), i.e. the null-hypothesis is \(\beta _{1}=\beta _{1}'\).
     
  2. 2.
    Effect of Social Reciprocity/Responsiveness Scale (\(\mathtt {SRS}\)) on \(\mathtt {FA}\). \(\mathtt {SRS}\) is a quantitative, dimensional measure of social functioning across the entire distribution from normal to severely impaired functioning (Constantino et al. 2000). In this hypothesis we measure the effect of \(\mathtt {SRS}\) on \(\mathtt {FA}\) above and beyond \(\mathtt {IQ}\) and \(\mathtt {Age}\). That is we want to measure how important \(\mathtt {SRS}\) is in predicting \(\mathtt {FA}\) in the context of using \(\mathtt {IQ}\) and \(\mathtt {Age}\) jointly. The corresponding LM is
    $$\mathtt{FA}=\beta_{0}+\beta_{1}\mathtt{SRS}+\beta_{2}\mathtt{IQ}+\beta_{3}\mathtt{Age}.$$
    The contrast matrix is \(\mathcal {T}=[0 ~1~ 0 ~0]\), i.e. the null-hypothesis is \(\beta _{1}=0\). We test this hypothesis only using the \(\mathtt {ASD}\) group (\(n=93\)) since the variance in the measured \(\mathtt {SRS}\) for the \(\mathtt {TDC}\) group is limited.
     

Meditation Study (VBM Models)

  1. 1.
    \(\mathtt {Group}\) effect on \(\mathtt {FA}\). In this hypothesis we are interested if the group-means of \(\mathtt {FA}\) are different between long term meditation practitioners (\(\mathtt {LTM}\))s and wait-list controls (\(\mathtt {WL}\))s. The \(\mathtt {WL}\) group sometimes is also referred to as meditation näive practitioners (\(\mathtt {MNP}\)). The corresponding over-parameterized LM is
    $$\mathtt{FA}=\beta_{0}+\beta_{1}\mathtt{LTM}+\beta_{1}'\mathtt{WL},$$
    where \(\mathtt {LTM}\) and \(\mathtt {WL}\) are the indicator variables and the contrast matrix is \(\mathcal {T}=[0 ~-1 ~1]\) representing the null-hypothesis \(\beta _{1}=\beta _{1}'\).
     
  2. 2.
    Here we are interested in the neuroplasticity effect of meditation practice. Here meditation is conceptualized as a form of mental training cultivated for various ends, including emotional balanced and well-being. In the present study, meditators have been trained in standard Buddhist meditation techniques. These techniques lead to the cultivation of emotion regulation and attention control. The amount of formal meditation in life is used in the present study to interrogate the effect of meditation on the \(\mathtt {FA}\). Although this is cross-sectional data we can still hope to measure the effect of life time practice hours (\(\mathtt {TLPH}\)) on the \(\mathtt {FA}\). \(\mathtt {TLPH}\) is computed using self-report measures of the number of retreats and number of hours at those retreats. The corresponding LM is
    $$\mathtt{FA}=\beta_{0}+\beta_{1}\mathtt{TLPH}+\beta_{2}\mathtt{Age}+\beta_{3}\mathtt{Gender},$$
    and the contrast matrix is \(\mathcal {T}=[0 ~1~0~0]\).
     

We estimate all of the above LMs using ordinary least squares loss implementation available in SurfStat (Worsley et al. 2009) and produce t-statistic maps.

Besides the VBM based hypotheses, we explore the following MVPA based hypotheses for phenotyping. We perform the following two types of MVPA: (1) classification, most commonly performed in the spirit of computer aided diagnosis; and (2) high-dimensional regression in the spirit of computer aided prognosis.

Autism Study (MVPA Models)

  1. 1.
    Here we ask the question of how well the \(\mathtt {FA}\) in the white matter voxels jointly can predict the group label of an individual. The MVPA model for this can be expressed as
    $$\begin{array}{rll}\mathtt{Group}&=&\beta_{0}+\beta_{1}\mathtt{FA}_{1}+\beta_{2}\mathtt{FA}_{2}+\ldots\\ &&+\, \beta_{v}\mathtt{FA}_{v}+\beta_{v+1}\mathtt{IQ}. \end{array} $$
    (34)

    Here the \(\mathtt {Group}\) variable is treated as a binary variable. We include \(\mathtt {IQ}\) in the model since the samples in the two groups are not perfectly matched for it and could be a predictive feature. We would like to highlight a key difference between the MVPA and VBM is that there is no explicit randomization performed on the model parameters when the data is being collected and hence the MVPA models are also known as “data-driven”. Hence, as discussed in the beginning of this section, to maximize the training data available we can be agnostic to the randomizations and can include such non-imaging parameters as well in these models. The z-score phenotyping will then show the relative importance of the imaging parameters as compared to such variables which are typically labeled as confounding or nuisance variables in VBM.

     
  2. 2.
    We then examine if \(\mathtt {SRS}\) in the ASD group can be predicted using white matter \(\mathtt {FA}\) in the context of using \(\mathtt {IQ}\) and \(\mathtt {Age}\) as predictors. The corresponding MVPA model is
    $$\begin{array}{rll}\mathtt{SRS}&=&\beta_{0}+\beta_{1}\mathtt{FA}_{1}+\beta_{2}\mathtt{FA}_{2}+\ldots\\ &&+\,\beta_{v}\mathtt{FA}_{v}+\beta_{v+1}\mathtt{IQ}+\beta_{v+2}\mathtt{Age}.\end{array} $$
    (35)
     

Meditation Study (MVPA Models)

The following correspondingly similar MVPA models are estimated
  1. 1.

    \(\mathtt {Group}=\beta _{0}+\beta _{1}\mathtt {FA}_{1}+\beta _{2}\mathtt {FA}_{2}+\ldots +\beta _{v}\mathtt {FA}_{v}\),

     
  2. 2.

    \(\mathtt {TLPH}=\beta _{0}+\beta _{1}\mathtt {FA}_{1}+\beta _{2}\mathtt {FA}_{2}+\ldots +\beta _{v}\mathtt {FA}_{v}\,+\beta _{v+1}\mathtt {Age} +\beta _{v+2}\mathtt {Gender}\).

     

We estimate the above multivariate models using support vector machine (SVM) and support vector regression (SVR) loss-penalty combinations (see Table 1). We use the implementations available via LIBSVM (Chang and Lin 2001) and GLMNet (Friedman et al. 2010; Yuan et al. 2011). We report both the model performance as well as the image phenotyping information.

Experimental Results

In this section we will present and discuss the results of the models discussed in the previous section both for VBM and MVPA. Figure 1 presents the distribution the t-statistics and the thresholds (FWER\(\leq 0.05\)) based on Bonferonni correction and random field theory. We can observe that the t-statistics do not quite reach those correction thresholds. Hence we present the corresponding statistical parametric maps (SPMs) thresholded at a significance of \(\alpha =0.005\) which iscalled “uncorrected” significance. Figure 2 shows SPMs and sample cluster data in corpus callosum and brain stem from the VBM hypotheses examined in the autism study. Similar results for the meditation study are shown in Fig. 3. The corresponding figure captions provide more detailed discussions of these results.
https://static-content.springer.com/image/art%3A10.1007%2Fs12021-012-9175-9/MediaObjects/12021_2012_9175_Fig1_HTML.gif
Fig. 1

Distribution of the voxelwise t-statistics on the four VBM hypotheses examined. The corresponding LMs are shown in the titles of the plots. The uncorrected, Bonferroni and random-field theory (RFT) based t thresholds at \(\alpha =0.005\) and FWER \(\leq 0.05\) respectively, are shown as well. The Bonferroni and RFT thresholds although plotted with different color are very close and hence are visually coinciding

https://static-content.springer.com/image/art%3A10.1007%2Fs12021-012-9175-9/MediaObjects/12021_2012_9175_Fig2_HTML.gif
Fig. 2

t-statistic maps and distributions of measures for VBM hypotheses in the autism study. The two groups in this study are individuals with autism spectrum disorders (\(\mathtt {ASD}\)) and typically developing controls (\(\mathtt {TDC}\)). Left The left figure displays a voxelwise t-statistic map for group effect in the autism study, thresholded at asignificance of \(p < 0.005\) (uncorrected). The insert displays the distributions of mean \(\mathtt {FA}\) in the cluster in the corpus callosum for the two groups. The figure clearly displays that the mean \(\mathtt {FA}\) (adjusted for confounds) in the \(\mathtt {ASD}\) group is shifted to the left, which indicates a decrease in \(\mathtt {FA}\) relative to the \(\mathtt {TDC}\) group. Right The right figure displays a similar map for measuring the effect of \(\mathtt {SRS}\) on \(\mathtt {FA}\) in the \(\mathtt {ASD}\) group. We observe that the average \(\mathtt {FA}\) (adjusted for confounds) of a cluster in the cerebellum decreases as \(\mathtt {SRS}\) increases. Not surprisingly, \(\mathtt {ASD}\) individuals tend to have higher \(\mathtt {SRS}\) than \(\mathtt {TDC}\) individuals. The insert demonstrates that, even within the \(\mathtt {ASD}\) group, higher \(\mathtt {SRS}\) is associated with a decrease in \(\mathtt {FA}\)

https://static-content.springer.com/image/art%3A10.1007%2Fs12021-012-9175-9/MediaObjects/12021_2012_9175_Fig3_HTML.gif
Fig. 3

t-statistic maps and distributions of measures for VBM hypotheses in the meditation study with long-term meditators (\(\mathtt {LTM}\)) and meditation näive practitioners (\(\mathtt {MNP}\)) as the two groups. Left The SPM for group effect thresholded at a significance of \(p<0.005\) (uncorrected). The insert shows that the distributions of mean \(\mathtt {FA}\) in the cluster in the inferior part of cortico-spinal tract for the two groups. The \(\mathtt {LTM}\) group mean \(\mathtt {FA}\) is shifted to the left indicating a decrease in \(\mathtt {FA}\). Right The effect of \(\mathtt {TLPH}\) on \(\mathtt {FA}\) in the \(\mathtt {LTM}\) group. We can observe a positive correlation between the average \(\mathtt {FA}\) in the cluster (adjusted for confounds) and \(\mathtt {TLPH}\)

In the following, we show the results of MVPA for both classification and regression in both the studies. First, we present the test-set bounds or binomial confidence intervals (Kääriäinen and Langford 2005) for the prediction performance of the linear SVM classifier in Table 4. We then show the classifier and regression performance across the 100 iterations of 10-fold cross-validation in Fig. 4, as distributions of the test-set prediction accuracies of the binary group labels and mean squared errors of prognosis prediction.
Table 4

Test-set bounds (binomial confidence intervals) for various train/test splits for both neuroimaging studies

Train/test proportions

Autism

Meditation

90/10

56.25 % [45.17 86.79]

100 % [0 54.93]

80/20

70.97 % [54.81 83.94]

60 % [30.35 85]

70/30

63.83 % [50.82 75.48]

60 % [35.96 80.91]

60/40

72.58 % [61.76 81.71]

55 % [34.69 74.13]

50/50

62.34 % [52.36 71.58]

68 % [49.64 82.97]

We can observe that in several of these attempts even when accuracy reaches beyond chance level, the bounds are quite loose, thus showing that one has to be careful in claiming generalization of the performance of the classification

https://static-content.springer.com/image/art%3A10.1007%2Fs12021-012-9175-9/MediaObjects/12021_2012_9175_Fig4_HTML.gif
Fig. 4

Distributions of the model-performance metrics for the MVPA hypotheses. We can observe that there is more variance in predicting the group labels in the meditation study compared to the autism study. This is expected because the dichotomy of \(\mathtt {LTM}\) vs. \(\mathtt {MNP}\) may not be as strong as those of psychiatric conditions such as \(\mathtt {ASD}\) vs. \(\mathtt {TDC}\). The mean squared error performance also has higher variance in the meditation study

For classifier models, we show the receiver operating characteristic (ROC) curves. ROC curves plot false positive rate (FPR) vs. true positive rate (TPR) as shown in Figs. 5 and 6. Typically only the area under the curve (AUC) isreported, and the closer it is to 1 the better is the performance of a classifier. However, it is important to understand the various operating regions in the ROC curve, which determine a trade-off between FPR and TPR. To define an operating region, we need user-defined costs for the confusion matrix as shown in Table 5.
https://static-content.springer.com/image/art%3A10.1007%2Fs12021-012-9175-9/MediaObjects/12021_2012_9175_Fig5_HTML.gif
Fig. 5

ROC curve for classification in the autism study. The figure includes confidence intervals for TPR at three different optimal points. As expected, the confidence intervals for TPR become tighter as FPR increases

https://static-content.springer.com/image/art%3A10.1007%2Fs12021-012-9175-9/MediaObjects/12021_2012_9175_Fig6_HTML.gif
Fig. 6

ROC curve for classification in the meditation study. The figure includes confidence intervals for TPR at three different optimal points. We can observe that at the optimal point with lower FPR, the TPR is much lower compared to that in the autism study again suggesting more difficulty in classifying \(\mathtt {LTM}\) from \(\mathtt {MNP}\) than classifying \(\mathtt {ASD}\) from \(\mathtt {TDC}\)

Table 5

The user-defined costs of making mistakes (similar in spirit to Type I and Type II errors), allow one to identify an optimal operating region on the ROC curve of a classifier

 

True positive

True negative

Predicted P

\(c(P|P)\)

\(c(P|N)\)

Predicted N

\(c(N|P)\)

\(c(N|N)\)

These user-defined costs allow us to identify optimal regions on the ROC curves by moving a line with slope
$$m=\frac{c(P|N)-c(N|N)}{c(N|P)-c(P|P)}\frac{N}{P},$$
(36)
from top-left (FPR=0,TPR=1) inward until it meets the ROC curve. In addition to the costs of making mistakes, Eq. 36 also takes into account the imbalance in the training data using the ratio \(N/P\) which is the ratio of the total number of training examples in the “negative” class to the that in the “positive” class. If these costs can be set a priori they can also potentially be used in choosing the MVPA model itself so that one can obtain a model with appropriate trade-off between TPR and FPR on the ROC curves. The costs chosen a posteriori can also be useful in providing a more detailed assessment of the ROC curves using their curvatures at various points rather than just using a summary feature of the curve like AUC.
Although choosing these costs precisely (either a priori or a posteriori) will require strong knowledge from the domain (and sometimes might even be infeasible), we can observe that the slope m is proportional to the ratio of false-positive cost to the false-negative cost if the training data is well balanced. That is,
$$m=\frac{c(P|N)}{c(N|P)},$$
(37)
if \(N/P=1\) and \(c(P|P)=0,c(N|N)=0\). Hence one can choose the ratios of the costs rather than the individual costs themselves. If the ratio is chosen to be greater than 1, the line is more likely to touch the ROC curve towards stricter (lower) FPR regime. If aparticular application demands the cost of a miss (false-negative) to be higher than the cost of a false alarm (false-positive) then the slope would be smaller than 1 and the optimal operating region would towards higher FPR regime. The confidence intervals for the corresponding TPR at such optimal points can be computed via any reasonable boot strapping method. In this paper they are computed using the bias-corrected accelerated (\(\mathtt {BC_{a}}\)) method (Efron 1987; Diciccio and Romano 1988).
For our experiments, we set and define three different optimal points using In our experiments we demonstrate the optimal region calculation (Figs. 5 and 6) using three simple values of m by setting \(c(N|N)=c(P|P)=0\) and arbitrarily setting \(c(P|N)\) as
$$m=\left\{\begin{array}{ll}\tan(60^{circ})>1, & c(P|N)=0.5\;\;\leftarrow\mathrm{Setting}~ A,\\ \tan(45^{circ})=1, & c(P|N)=0.5\;\;\leftarrow\mathrm{Setting}~ B,\\ \tan(30^{circ})<1, & c(P|N)=0.8\;\;\leftarrow\mathrm{Setting}~ C.\end{array}\right.$$
(38)
Hence in this case the \(c(N|P)\)s are calculated retrospectively according to Eq. 36. The Table 6 shows the values of \(c(N|P)\) for the two different studies.
Table 6

\(c(N|P)\) computed retrospectively using Eq. 36 for the three different settings presented in Eq. 38 and \(c(P|P)=c(N|N)=0\)

\(m,c(P|N)\)

Autism study (\(N/P=0.5528\))

Meditation study (\(N/P=1.0942\))

Setting A

0.1596

0.3159

Setting B

0.2764

0.5471

Setting C

0.7660

1.5162

One can notice the dependence of the false-negative coston the imbalance in the training data—with increased \(N/P\) ratio the false-negative cost increases with all other factors held constant

Finally, Figs. 7 (autism study) and 8 (meditation study) show the z-score SPMs to obtain image phenotyping using the MVPA classifiers and regressors. The z-score maps are obtained using average weight vector estimated from the cross-validation folds as described in section “Risk Assessment”. We can observe that estimated MVPA models in both the studies produce biologically plausible findings. The maps for autism study shown in Fig. 7 also display an interhemispheric asymmetry in significance of the weights, which is consistent with the findings in the autism literature. For example, Lange et al. (2010) show that asymmetry is atypical in \(\mathtt {ASD}\) and serves as a good predictor of the group.
https://static-content.springer.com/image/art%3A10.1007%2Fs12021-012-9175-9/MediaObjects/12021_2012_9175_Fig7_HTML.gif
Fig. 7

z-score maps thresholded at a significance of \(p<0.05\) (uncorrected) for MVPA in the autism study. Left Classification results. Observe that voxels in the cingulum region have significant weight in the classifier model. Right Regression results of predicting \(\mathtt {SRS}\) using \(\mathtt {FA}\). In both cases, the z-score maps demonstrate that the selected weights are contiguous; we emphasize that this is achieved without using an explicit spatial prior. This result suggests that the estimated MVPA models produce biologically plausible phenotyping information which also shows interhemispheric asymmetry in the importance of the voxels

https://static-content.springer.com/image/art%3A10.1007%2Fs12021-012-9175-9/MediaObjects/12021_2012_9175_Fig8_HTML.gif
Fig. 8

z-score maps thresholded at a significance of \(p<0.05\) (uncorrected) for MVPA in the meditation study. Left Classification results. We note that voxels in similar regions as identified by SPMs in VBM (Fig. 3) are found to have significant weights in the classifier model as well. Right Regression results for predicting \(\mathtt {TLPH}\). The z-score maps show that the weights are selected are again contiguous as in Fig. 7 reflecting biological plausibility. We observe bi-lateral significance in both the VBM (Fig. 3) and MVPA phenotyping; this is contrasted with the asymmetry observed in the autism study

Discussion and Future Directions

In this paper we presented a unifying treatment of the two major families of neuroimage analyses, namely, the standard voxel wise analysis (VBM) and the emerging multi- voxel pattern analysis (MVPA). Penalized likelihood modeling provides us a natural way to view the models estimated in both families as different combinations of loss and penalty. Hence we call this unified framework as penalized likelihood phenotyping. Although it is increasingly well known that VBM involves estimating a massive number of univariate models and MVPA estimates one massive multivariate model, the proposed unification naturally allows us to identify two, practically relevant, levels of risk assessment of the models in both the families. Such an assessment provides insight into some of the fundamental commonalities and differences between VBM and MVPA and also the challenges in the latter. The first level is concerned with explanatory or predictive power of the models, while the second level of risk arises due to the need for interpretations of the model parameters (MVPA) or collections of models (VBM). In both cases essentially this involves computing tail bounds of various probability distributions or deviation bounds essentially relying on Chernoff type inequalities (Chernoff 1952).

The first level of risk of the VBM models at individual voxels can be analyzed in a relatively straight forward manner using parametric (such as t or F) or non-parametric distributions (using Monte-Carlo simulations). But for interpreting these models at the neuroanatomical level one needs to analyze the joint risk of all the models to obtain phenotypical information of important regions in the brain. This second level of risk assessment is more complicated and, in fact, this problem, in the general statistical framework of multiple comparisons or familywise error rate, still represents a fertile area of open research in the statistics community. There is no optimal method for solving these problems and the “best”solution is typically a choice from Bonferroni/Sídak type corrections, cluster-based corrections (sometimes invoking results from random field theory) or controlling false discovery rates, depending on the specific instance of the VBM.

In MVPA only one model is estimated; however, it is a large multivariate model (the dimension is on the order of 50,000). Therefore, although MVPA avoids the multiple comparisons problem in the second level, the first level of risk assessment is considerably more challenging. By drawing the fundamental differences between the use of MVPA in artificial intelligence applications versus neuroimaging studies, we demonstrated the need for using metrics beyond the traditional cross-validation accuracies and ROC curves for assessing the risk in prediction power of the models. When using MVPA to extract image based phenotyping information, for interpretation of the brain regions, one still is susceptible to a “multiple comparisons-type” problem since one has to test the significance of each parameterof the massive model. In fact, this is studied in statistics and machine learning as the variable selection problem, which is particularly challenging in the high dimensional low sample size setting (Hesterberg et al. 2008).

For the first level analysis in MVPA, we presented prediction error bounds for the linear SVM classifier models that give us more precise sense of risk in the prediction power of the classifiers. Rather than reporting the usual sample mean and variance of the empirical prediction error rates, which assumes normality of their distribution, we reported precise binomial confidence intervals using a test-set bound. Although normal and binomial distributions behave similarly when the error rates are high, the approximation is not accurate for lower error rates. We also showed how one can define optimal operating regions on the ROC curves using user-defined costs of mistakes/errors and compute confidence intervals on the true detection rates at those regions. We note that the importanceof reporting confidence intervals, in addition to test-statistics such as p-values, has also been highlighted elsewhere in cognitive science field (Cohen 1994). For interpretation of the classifier and regression MVPA models we described a principled way to produce z-statistic maps using z-scores that reflect the relative significance of the parameters in the estimated models. Such a technique also lets us to include non-imaging parameters in the models, thus allowing us to maximize the use of available training data in learning (model estimation).

In this paper we mainly evaluated linear models both for VBM and MVPA which have an advantage in terms of interpretability. One can develop non-parametric as well as implicit non-linear models via kernel methods both for VBM and MVPA (Scholkopf and Smola 2001; Jäkel et al. 2009; Arthur et al. 2012). Reporting first level risk in these types of models would be similar to the case of linear models. But the interpretation risk analysis becomes harder since, besides having additional estimation challenges, we do not explicitly have access to the parameters of the model. Furthermore, the primary focus of the paper has been on model evaluation via risk assessment and not in obtaining better classification accuracies by using sophisticated feature selection or advanced tensor norm based classifiers. We still would like to note that feature selection can be very useful in neuroimaging since the parameter selection of the model would be based on domain specific knowledge. Such efforts not only can improve the performance in prediction (Chu et al. 2012) but also reduce the risk of false interpretations once the model is trained.

Importance of modeling (including machine learning models) and their careful evaluation has also been highlighted in cognitive science (Shiffrin et al. 2008; Shiffrin 2010). Computational modeling is not only expanding in neuroimaging but also being embraced in behavioral phenotyping of psychiatric illness (Montague et al. 2012). Hence it becomes important to be cognizant of the model evaluation frameworks and the limits of validatory conclusions one obtains when using MVPA models in neuroscience. In the spirit of understanding the relevance and importance of MVPA in neuroscience, there has also been increasing demand on the empirical validation of MVPA models across studies resulting in upcoming packages like PyMVPA (Hanke et al. 2009ab) which can facilitate neuroimaging researchers to perform cross-center analyses. To the best of our knowledge, this paper is the first to bring the theoretical prediction risk bounds into practice in neuroimaging settings. Our original contribution is in that this is the first work that addresses the problem of risk assessment and image phenotyping in a principled way when applying MVPA for neuroimaging based neuroscience applications by jointly studying VBM and MVPA.

There are several potential avenues of future research along the lines presented in this paper. (1) The test set bounds provide useful information about the prediction error rates of the MVPA classifier models, however they are not tuned for specific models like support vector machines (used in this paper), decision trees and various other variants. We presented a training set based framework for obtaining bounds for specific families of models. But the main challenge is in coming up with clever de-randomization techniques. This is needed so that confidence intervals can be applicable for the specific learned model rather than the family of the learned model. Although there are some techniques such as using unlabeled training data for de-randomization (Kääriäinenand Langford 2005), incorporating domain specific information from neuroscience and neuroimaging to improve the existing error bounds would be a great direction to pursue. (2) Additionally, computing these error-bounds in a cross-validation setting by taking into account issues like algorithmic stability is of potential interest (Kearns and Ron 1999). Currently there are no practically applicable results in that direction. (3) Finally, instead of focusing on designing complex MVPA models (i.e. different loss-penalty combinations and non-parametric forms) which could be harder to visualize and interpret, one could focus on sample complexity (Sivanet al. 2012) results for simple linear models and develop experimental design techniques of new studies for achieving computational learnability. This would result in an MVPA analog for power-analyses used for VBM in neuroimaging (Mumford and Nichols 2008). Such perspectives and techniques introduced in this paper will hopefully allow neuroscientists and neuroimage researchers to use computational modeling to address increasingly sophisticated questions about the brain-behavior relationships in a powerful and rigorous way.

Information Sharing Statement

The MATLAB scripts (along with the parameters and visualization code) and appropriate C-code and software libraries for classification, ROI optimal region search and the test-set bounds used to produce the results will be made available viathe first author’s website at http://brainimaging.waisman.wisc.edu/~adluru/PLP. For book-keeping purposes initially email requests will be solicited for copies of the scripts. If there is enough demand (more than a dozen requests) then all the material will be hosted on the NITRC http://www.nitrc.org/.

Footnotes
1

These are also known as voxel based analyses (VBA) or voxel based morphometry (VBM). Also there are popular, mathematically equivalent, variants of VBM such as region-of-interest (ROI) analysis (Nieto-Castanon et al. 2003).

 
2

As is standard in the statistics literature we use the acronym GLM for generalized linear model, not general linear model. We broadly use the term linear model (LM) to include what some authors might call the general linear model. The key point is that the LM is a special case of GLM and assumes the error distribution is normally distributed whereas the error distribution in GLM can belong to the more general exponential dispersion family.

 
3

Whether MVPA denotes multi-variate pattern analysis or multi-voxel pattern analysis, it has the same meaning (Carp et al. 2011).

 
4

We also note that in both VBM and MVPA, Y can also belong to \(\mathbb {R}^{n\times k}\) for \(k>1\) but we shall not consider that class of models in this manuscript.

 
5

The tensors in this case are 3rd order, i.e. their elements are indexed by three numbers, while a diffusion tensor (introduced in section “Neuroimaging Data”) is of 2nd order and can be treated as a regular matrix.

 
6

We will observe similar statements arising in level 1 risk assessment of the MVPA models where the voxels are treated jointly.

 
7

MVPA are quite prevalent and span applications beyond just those based on imaging data.

 
8

Compare this with the typical p-values in VBM that allow us to control the first level risk under 0.05 and the multiple comparisons potentially resulting in similar practically un-useful statements.

 
9

Humans can quickly (in one pass) isolate wrong findings such as misclassification of cats as dogs.

 
10

As a comparison, imagine a setting in which one is given an estimated mean which is expected to follow a student-t distribution, but is not given the degrees of freedom to know which t-distribution to be used for computing the tail-probability.

 
11

Camino is an open-source Diffusion-MRI processing software library.

 

Acknowledgments

We are thankful to Kristen Zygmunt and P. Thomas Fletcher at the University of Utah, for data organization and eddy correction of the diffusion tensor imaging data of the autism study. We also thank Molly DuBray Prigge and Alyson Froehlich for providing us with the subject demographic and assessment information for the autism study. We are extremely thankful to Brianna Schuyller, Amelia Cayo and David Bachubber at the University of Wisconsin-Madison, for assisting us with the sample characteristics of the meditation study.

This work was supported by the NIMH R01 MH080826 (JEL) and R01 MH084795 (JEL) (University of Utah), the NIH Mental Retardation/ Developmental Disabilities Research Center (MRDDRC Waisman Center), NIMH 62015 (ALA), the AutismSociety of Southwestern Wisconsin, the NCCAM P01 AT004952-04 (RJD and AL) and the Waisman Core grant P30HD003352-45 (RJD). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Mental Health, the National Institutes of Health or the Waisman Center.

Copyright information

© Springer Science+Business Media New York 2013