Introduction

A common treatment for patients suffering from a brain tumor is surgical resection of the tumor. In order to minimize the risk of resecting brain tissue involved in essential brain functions, such as speech or language comprehension, these patients often undergo presurgical functional magnetic resonance imaging (fMRI). This is a technique that shows subject-specific neural activity changes in the brain. The resulting fMRI data can assist the surgeon in performing the tumor resection while preserving the brain tissue involved in important cognitive and sensorimotor functions (Bartsch, Homola, Biller, Solymosi, & Bendszus, 2006) and can even be used to predict the outcome of postoperative cognitive functioning (Richardson et al., 2004).

To analyze fMRI data, a huge number of statistical tests are performed simultaneously. In cognitive neuroscience, this technique is used to link neurological and neuropsychological functions with their respective location in the brain, supporting different theories of brain function. To be confident that a brain area is associated with a task, it is essential to account for the multiple testing problem. This can be done using corrections for either the familywise error rate (Friston, Frith, Liddle, & Frackowiak, 1991; Worsley et al., 1996) or the false discovery rate (Genovese, Lazar, & Nichols, 2002). These multiple testing corrections result in a more stringent control of the null hypothesis of no activation, and consequently, the probability of a false negative increases (Lieberman & Cunningham, 2009; Logan & Rowe, 2004). In cognitive neuroscience, a false positive means fallacious support for a given cognitive theory. While false positives can often be discovered by unsuccessfully trying to replicate the study, much time, effort, and money can be expended. As a result, the scientific discipline generally deems stringent control of false positives necessary, accepting the concomitant sacrifices in sensitivity.

In a clinical setting such as presurgical fMRI, however, a loss in power means that true activation is not discovered, and this might result in the resection of vital brain tissue (Haller & Bartsch, 2009). Inversely, false positives have a less negative impact on the surgical result (Gorgolewski, Storkey, Bastin, & Pernet, 2012). The goal of classical hypothesis testing is to prevent the null hypothesis from being rejected by considering voxels as being active only when enough evidence against the null of no activation is found. This asymmetrical way of penalizing errors in statistical inference is undesirable in this context (Johnson, Liu, Bartsch, & Nichols, 2012), and instead, the focus should be on protecting the alternative hypothesis: one wants to exclude activation only when enough evidence against activation is found. We therefore present a new hypothesis thresholding procedure that incorporates information on both false positives and false negatives and, thus, is ideally suited for presurgical fMRI.

In classical hypothesis testing, the evidence against null hypothesis is measured with the p-value, the null hypothesis probability of data as or more extreme than that observed. Thresholding a p-value at α produces a statistical test that controls the false positive rate at α. To allow direct control of false negative risk, we present a symmetrical measure that quantifies evidence against the alternative hypothesis (Moerkerke, Goetghebeur, De Riek, & Roldan-Ruiz, 2006). Correspondingly, thresholding this probability measure at β ensures control of the false negative rate at β.

By combining thresholds on the classical and alternative p-values, we use information on the probability of false positives and false negatives. We show that thresholding both error measurements results in a layered statistical map for the brain, each layer marking voxels with evidence (or lack thereof) against the null and/or alternative hypothesis. One layer consists of voxels exhibiting strong evidence against the null of no activation, while a second layer is formed by voxels for which activation cannot be confidently excluded. The third level then consists of voxels for which the presence of activation can be rejected.

fMRI data can be analyzed in different ways. The most popular method is a confirmatory mass-univariate general linear model (GLM) analysis, where the measured time series in each voxel is regressed onto the design of the experiment, resulting in an estimate of the effect, for which a T-statistic with a corresponding classical p-value can be computed for each voxel. This method has been shown to be very effective and robust, but its downside is the mass-univariate character. While many attempts have been made to take into account the spatial character of the data with data smoothing and peak- and cluster-thresholding, the GLM fails to recognize patterns of activation or noise. In this light, statistical techniques for multivariate data have been successfully applied to fMRI data. Independent component analysis (ICA; Beckmann & Smith, 2004) is an exploratory method used to find hidden source signals, modeling the observed data as a (unobserved) linear mixture of (unobserved) sources. ICA therefore allows one to discover spatially and temporally structured noise. Given the popularity of the GLM and the upcoming interest for ICA, especially in a clinical context, we will introduce the thresholding procedure for both techniques. We show how the ideas can be translated to different statistical techniques.

In the Method section, we introduce and combine quantities to measure significance when testing for activation. To this end, we start with a simple setting in which test statistics are assumed to be Gaussian distributed and take the general form of the ratio of an observed effect and its standard error. These settings directly translate to the case of univariate linear modeling that makes use of T-distributions. We further demonstrate how to use the principle for ICA. In the Results section, we present the results of the procedure applied to presurgical fMRI data.

Method

Measures of evidence against the null and alternative

At each voxel i, i = 1, …, I, we assume that a linear model is fit and produces \( {\widehat{\varDelta}}_i \), an unbiased estimate of the BOLD effect of interest Δ i , and an estimate of the standard deviation of \( {\widehat{\varDelta}}_i \), its “standard error” \( \mathrm{SE}\left({\widehat{\varDelta}}_i\right) \). We henceforth suppress the voxel subscript unless needed for clarity. We assume that the degrees of freedom are sufficiently large so that \( \mathrm{SE}\left(\widehat{\varDelta}\right) \) has negligible variability, as is the case for fMRI time series. We further assume that the data, model, and contrast have been scaled appropriately so that \( \widehat{\varDelta} \) has units of percent BOLD change (or at least approximately, as when global brain intensity is scaled to 100Footnote 1).

The null and the alternative hypotheses

The null hypothesis H 0:Δ = 0 states that the true effect magnitude is zero and an underlying difference between conditions Δ is equal to 0. Classical statistical inference involves computing a test statistic, converted to a p-value, that measures the evidence against this null hypothesis. The decision procedure to reject H 0 is calibrated to maintain the type I error at α. However, failing to reject H 0 does not allow one to conclude that H 0 is true. The reason is that the probability calculation of the p-value is based on the assumption that the null hypothesis is true. It is a logical fallacy, “affirming the consequent” or “reasoning to a forgone conclusion,” to begin by assuming something and then, eventually, conclude that the initial thing is true. More concretely, when we fail to reject H 0, it could simply be because there are only subtle deviations from H 0 that are not detected or because the precision on the observed effect is too small to reach statistical significance. Scientists frequently make this mistake, and there have been various guidelines for reporting study results (see, e.g., Meehl, 1978; Schmidt & Hunter, 2002), all of which stress the importance of complementing p-values with effect sizes.

Our procedure considers an “alternative hypothesis” p-value, p 1, that measures the evidence against H a :Δ = Δ1, the nonzero effect magnitude expected under activation. Often, fMRI studies are preceded by power analyses for sample size calculations, which also require the specification of Δ1. In literature, different approaches to choosing a meaningful Δ1 have been presented (Desmond & Glover, 2002; Hayasaka, Pfeiffer, Hugenschmidt, & Laurienti, 2007; Mumford & Nichols, 2008; Zarahn & Slifstein, 2001). Alternatively, in presurgical fMRI, one can estimate Δ1 on the basis of data in previous patients.

Measures of significance

At a given voxel, we have a test statistic T with observed value

$$ t=\frac{\widehat{\varDelta}}{\mathrm{SE}\left(\widehat{\varDelta}\right)}. $$
(1)

We assume that T has a known distribution under H 0 (e.g., Student’s t with given degrees of freedom or Gaussian), so that we can compute the classical p-value:

$$ {p}_0=P\left(T\ge t\left|{H}_0\right.\right). $$
(2)

That is, p 0 quantifies the evidence against the null hypothesis H 0 of no task-related activation.

In a symmetrical fashion, the alternative p-value is defined as in Moerkerke et al., 2006):

$$ {p}_1=P\left(T\le t\left|{H}_a\right.\right). $$
(3)

Correspondingly, p 1 measures the evidence against H a and corresponds to the classical p-value for testing a “null” H 1 versus an “alternative” H 0. In general, as the evidence in favor of H 1 grows, p 0 becomes smaller and p 1 becomes larger.

In order to compute p 1, we need the distribution of T under H a , which requires specification of Δ1. However, we expect not a single magnitude of true activation, but a distribution of different true values (Desmond & Glover, 2002). Therefore, in a Bayesian spirit, we specify a distribution of likely values of Δ1 instead of a fixed value:

$$ {\varDelta}_1\sim \mathcal{N}\left(\mu, {\tau}^2\right), $$
(4)

where μ is the expected magnitude of effect under true activation while acknowledging variation among voxels—specifically, Gaussian variation with standard deviation τ.

Assuming that T also follows a Gaussian distribution, it has the following distribution under H a at voxel i:

$$ {T}_i\sim \mathcal{N}\left(\frac{\mu }{\mathrm{SE}\left({{\displaystyle \widehat{\varDelta}}}_i\right)},\frac{\mathrm{SE}{\left({{\displaystyle \widehat{\varDelta}}}_i\right)}^2+{\tau}^2}{\mathrm{SE}{\left({{\displaystyle \widehat{\varDelta}}}_i\right)}^2}\right), $$
(5)

where voxel subscripts are used to emphasize that the values of μ and τ are fixed for the entire brain and based on prior knowledge or other experiments, while \( \mathrm{SE}\left({\widehat{\varDelta}}_i\right) \) is from each individual voxel. With this distribution, we can compute p 1 at each voxel. An illustration of both measures of significance can be seen in Fig. 1. Since the alternative distribution depends on the voxel-specific standard error, the distance between the null and alternative distributions will be voxel specific. In particular, a large standard error results in a large overlap between H 0 and H a , while small standard errors lead to a large distance and little overlap between H 0 and H a .

Fig. 1
figure 1

The distributions of an effect under H 0 and H a are displayed for an observed effect of t = 1.5, \( \mathrm{SE}\left(\widehat{\varDelta}\right)=1 \), Δ1 = 2, and τ = 1. Note that H a has a wider distribution than H 0 due to the uncertainty on Δ1

Combining measures of significance

In classical null hypothesis significance testing, a threshold α on p 0 can be translated into a threshold t α for the test statistic in Eq. 1. In parallel, a threshold β on p 1 can be translated into a test statistic threshold t β . While t α is determined by α (and degrees of freedom, if not using a Gaussian), t β further depends on β, μ, τ, and \( \mathrm{SE}\left({\widehat{\varDelta}}_i\right) \). Thus, t β varies over the brain depending on the (estimated) standard error.

Figure 2 shows the possible results of this testing procedure, with α and β relatively small. In what is expected to be the typical scenario, with a standard error that is large relative to the true effect magnitude, t β < t α and three possible outcomes can be distinguished.

Fig. 2
figure 2

When thresholding p 0 and p 1 at significance levels α and β, two possibilities arise: t β < t α (upper panel) or t α < t β (lower panel)

One outcome is when voxels exhibit evidence against H 0 and, at the same time, are consistent with H a (p 0 < α and p 1 > β; red in Fig. 2). This is the most compelling case for the presence of true activation (Δ > 0). The opposite outcome is a large p 0 and a small p 1 (p 0 > α and p 1 < β; gray in Fig. 2). Here, the data are consistent with the null, and there is evidence to reject the alternative; this is the most compelling case for true absence of activation (Δ = 0). The third outcome is when the data are compatible with both the null and the alternative and neither can be excluded (p 0 > α and p 1 > β; yellow in Fig. 2).

A less frequent, albeit possible scenario appears when the standard error is small relative to the true effect magnitude, t α < t β , and H 0 and H a can be clearly distinguished. Voxels with no effect or strong effects will be identified as before (p 0 > α and p 1 < β, no activation; p 0 < α and p 1 > β, activation). However, for certain data, there is evidence against both H 0 and H a (p 0 < α and p 1 < β; orange in Fig. 2). It indicates a case where the effect is so small as to lack practical significance.

For presurgical fMRI, this procedure provides information on which areas are confidently safe to be resected (gray areas), which areas should absolutely be avoided when resecting brain tissue (red areas), and in which areas the surgeon should take care because neither hypotheses can be rejected (yellow areas). When the fourth type of voxel is found, meaning both hypotheses can be rejected (orange areas), an abundance of caution suggests that again care be taken, since rejection of H 0 does suggest some association with the task, just at a possibly very small magnitude. The specific application to real data is shown in the Results section.

Alternative thresholding of independent component analysis

Above, we described the classical and alternative p-value for a traditional setting, where a test statistic is the ratio of an observed effect and its standard error, as is the case for T-statistics when the GLM is used. Here, we demonstrate that the technique is also applicable in more general settings—in particular, with maps from independent component analysis. Exact implementation details of ICA methods differ; our development here follows the FSLFootnote 2 software’s implementation, MELODIC (Beckmann & Smith, 2004), but should be readily applicable to other ICA software.

ICA is a technique for multivariate data-driven analysis of fMRI data. It does not require the specification of the experimental design and produces spatiotemporal patterns that explain the variability in the data. ICA transforms the four-dimensional fMRI data into K pairs of spatial and temporal modes. Each spatial mode, or independent component (IC) image, is associated with one IC time series. The variation explained by each component is the IC time series scaled by the weights at each voxel in the IC image; equivalently, it is the spatial pattern in an IC image scaled by each value of the IC time series. Stated simply, the weights represent the association between the temporal activation pattern observed in the voxel and the temporal pattern in the K different components.

Let Y represent the J × I data matrix, where J is the number of time points and I is the number of voxels. We assume that the data at each voxel have been mean-centered; that is, the column means of Y are zero. ICA decomposes the data as per

$$ \boldsymbol{Y}\approx \boldsymbol{M}\kern0.5em {\boldsymbol{S}}_1\kern0.5em \boldsymbol{C}\kern0.5em {\boldsymbol{S}}_2{\boldsymbol{S}}_0, $$
(6)

where M is a J × K matrix with one temporal mode in each column and C is a K × I matrix with one spatial mode in each row; S 1 (K × K) is a diagonal scaling matrix that ensures that the temporal modes have unit variance, and S 0 and S 2 (both I × I) are diagonal scaling matrices that ensure that background noise in the spatial modes have unit variance (see Appendix 1 for detailed definitions of these scaling factors).

In the presentation and interpretation of ICA results, each of the K spatial modes in C are visualized and explored. Since they have been noise-normalized, they are often treated as z-score images and thresholded to control a nominal false positive rate. The end result is an inference that quantifies the relation between the corresponding temporal mode in M and Y. We seek to apply our alternative hypothesis thresholding procedure to these maps, but first we need to define a meaningful effect size in percentage of BOLD change and transform this to the scale of C.

Meaningful BOLD effect sizes with ICA

Consider a particular IC of interest, k ∊ {1, …, K}, and a particular voxel i ∊ {1, …, I} of interest in the spatial mode. Specifically, consider the contribution of the kth IC to the time series at voxel i:

$$ {\boldsymbol{m}}_k\kern0.5em {s}_{1,k}\kern0.5em {c}_{ki}\kern0.5em {s}_{2,i}\kern0.5em {s}_{0,i}, $$
(7)

where m k is the kth column of M, c ki = (C) ki , and s 1,k , s 2,i , and s 0,i are the indicated diagonal elements of the scaling matrices.

As was previously mentioned, the rows of C are normalized to have noise variance of 1, so c ki has z-score (and not BOLD data) units. We need to compute a meaningful percentage of BOLD change effect. We will first compute this for a fixed Δ1, in the units of c ki , and will later impose a distribution on the effect size. Equation 7 shows that the temporal variation from IC k is determined not only by c ki and the scaling factors, but also by m k . But m k is scaled to unit variance and will not induce a unit BOLD change in the data. We propose scaling m k so that it (roughly) expresses a unit BOLD effect and, as a result, preserves the units of the other terms. Specifically, we introduce h k :

$$ {\boldsymbol{m}}_k\kern0.5em {h}_k{h}_k^{-1}\kern0.5em {s}_{1,k}\kern0.5em {c}_{ki}\kern0.5em {s}_{2,i}\kern0.5em {s}_{0,i}., $$
(8)

so that m k h k expresses a unit BOLD effect in the data. One way to set the factor h k is so that m k h k has a baseline-to-peak range of 1. Another way is to regress m k on a covariate d that expresses the anticipated (unit) experimental effect; setting h k to the inverse of the regression coefficient will ensure that m k h k corresponds to an approximate unit BOLD effect.

Finally, we correct for the attenuation of the hypothesized effect based on the mismatch between m k and d. That is, even if we choose h k well, m k h k may only be weakly correlated with d. As a result, we scale the expected BOLD effect of IC k at voxel i by \( {\rho}_{{\boldsymbol{m}}_k\boldsymbol{d}} \), the correlation between m k and d (see Appendix 2 for details).

Now we can relate the expected (attenuated) percentage of BOLD change, \( {\rho}_{{\boldsymbol{m}}_k\boldsymbol{d}}{\varDelta}_1 \), to the units of IC temporal mode. Let Δ *1 be the expected alternative mean effect in the z-score statistic c ki ; then,

$$ {\rho}_{{\boldsymbol{m}}_k\boldsymbol{d}}\kern0.5em {\varDelta}_1\approx {h}_k^{-1}\kern0.5em {s}_{1,k}\kern0.5em {\varDelta}_1^{*}\kern0.5em {s}_{2,i}\kern0.5em {s}_{0,i}, $$
(9)

and thus we can translate BOLD units into c ik units with Δ *1  ≈ s * ik  Δ1, where

$$ {s}_{ki}^{*}={\rho}_{{\boldsymbol{m}}_k\boldsymbol{d}}\kern0.5em {h}_k\kern0.5em {s}_{1,k}^{-1}\kern0.5em {s}_{2,i}^{-1}\kern0.5em {s}_{0,i}^{-1}. $$
(10)

Finally, this implies that our distribution of alternative effects in c ki units is

$$ {\varDelta}_1^{*}\sim \mathcal{N}\left({s}_{ki}^{*}\kern0.5em \mu, \kern0.5em {s}_{ki}^{*2}\kern0.5em {\tau}^2\right) $$
(11)

(cf Eq. 4).

Significance procedure

Since c ki has unit noise variance, with an assumption of Gaussianity, the null distribution is given by

$$ {c}_{ki}\left|{H}_0\sim \mathcal{N}\left(0,1\right)\right.. $$
(12)

Under the alternative, we consider the addition of effect Δ *1 to c ki yielding the alternative distribution

$$ {c}_{ki}\left|{H}_a\sim \mathcal{N}\right.\left({s}_{ki}^{*}\kern0.5em \mu, \kern0.5em 1+{s}_{ki}^{*2}\kern0.33em {\tau}^2\right). $$
(13)

Data

We consider data from a patient suffering from a left prefrontal brain tumor. The study design was a boxcar design, where the patient was asked to alternate between recitation of tongue-twisters and quiescence. Figure 3 shows a sagittal slice of the T2 image, with the tumor visible in the inferior prefrontal frontal cortex. For the application to mass univariate linear modeling, the data were analyzed with FEAT in FSL 4.1 (Smith et al., 2004). The application to independent component analysis was performed using MELODIC in FSL 4.1 (Beckmann & Smith, 2004).

Fig. 3
figure 3

Anatomical scan of the patient. The tumor can be clearly seen in the prefrontal cortex

Results

Univariate linear modeling

We applied these techniques to the data described in the Data section. We derived the expected effect magnitude for Δ1 and the variability of that effect τ from 5 patients who underwent the same fMRI paradigm. We threshold the image of each individual using an FDR control at 0.05 and look at the average percent BOLD change units in each individual. The results are shown in Table 1. Therefore, we specify the expected effect magnitude for Δ1 of μ = 0.73 percent BOLD change units and variability of that effect as \( \tau =\sqrt{{{\displaystyle \widehat{\tau}}}^2}=0.21 \) percent BOLD change. These results are consistent with others in the literature (see, e.g., Desmond & Glover, 2002, Fig. 7A).

Table 1 Average effect sizes in 5 previously tested patients in percent BOLD change units

Results are shown in Fig. 4 with thresholds α = 0.001 and β = 0.20. In other words, we specified a p 0 threshold for declaring an activation when there is none at 1-in-1,000; and we set the p 1 threshold for declaring the absence of activation when, in fact, the specified activation magnitude is present at 1-in-5. The red and the (scant) orange voxels show where H 0 can be confidently rejected, and, if presurgical planning was done only on the basis of classical null hypothesis testing, all other tissue would be regarded as “safe.” Considering information on the alternative, we have the red voxels where, specifically, H 0 can be rejected and H a cannot be rejected; that is, the red voxels are incompatible with the null and compatible with the alternative and, thus, are strong evidence for the effect. The yellow areas are areas where neither H 0 nor H a can be rejected; here, the data are compatible with both the null and alternative and suggest a lack of confidence in ruling out activation. Finally, for voxels with no coloration, the H 0 cannot be rejected, but H a can; the data are compatible with the null and incompatible with the alternative and, thus, have good evidence for a lack of activation and suggest that these brain regions can be safely resected. This shows the key strength of the procedure: Among voxels traditionally classified as “nonactive”—that is, those with insufficiently small p 0s, it distinguishes between voxels where there is compelling evidence for nonactivation (not colored) and those voxels where we cannot rule out the possibility of activation (yellow).

Fig. 4
figure 4

Sagittal slice of “layered” activation inference overlaying grayscale T2* reference image, threshold values of α = 0.001 and β = 0.20. Red areas show areas of high confidence of activation (H 0 rejected, H a not rejected), while yellow areas show areas where activation cannot be ruled out (neither H 0 nor H a rejected); uncolored areas have high confidence of no activation (H 0 not rejected, H a rejected), while the few orange voxels indicate voxels with significant but surprisingly small BOLD response magnitude (H 0 and H a rejected)

The orange voxels represent voxels for which the observed effect size is between the null hypothesis of no activation and the expected effect size. In these voxels, both the null and the alternative hypotheses are rejected, which corresponds to very low residual noise in the GLM.

Independent components analysis results

We applied these techniques to the data described in the Data section. We used the same effect size and uncertainty as in the Univariate linear modeling section—that is, μ = 0.73 and τ = 0.18 percent BOLD change units.

MELODIC’s automated dimensionality estimation method in MELODIC found 52 components. We chose one IC whose time series corresponded to the design matrix, shown in Fig. 5. Regressing this temporal mode on the design gives a coefficient of \( \widehat{\beta}=1.48 \), and thus \( h={\widehat{\beta}}^{-1}=0.677 \) is the scaling factor used to have the temporal mode express a unit-BOLD effect (see Eq. 8). The pointwise correlation between the design and the chosen component is ρ md = 0.63, which is used to attenuate the expected effect magnitude (see Eq. 9).

Fig. 5
figure 5

The time series of a selected IC, with a least squares fit of regressing the series on the design shown in gray. The estimated response height is used to normalize the component to have unit BOLD effect, and the pointwise correlation between the design and the selected IC is used to attenuate the expected BOLD response magnitude

The layered thresholding procedure for this IC is shown in Fig. 6, for α = 0.001 and β = 0.20. There is a set of voxels with strong evidence (red, H 0 rejected; H a , accepted) but also additional voxels where both hypotheses are rejected (orange). As was mentioned above, in the setting of presurgical planning, these orange regions are best regarded as regions of possible activation and, thus, excluded from resection.

Fig. 6
figure 6

Results of the alternative thresholding procedure when using ICA. Sagittal slice of “layered” activation inference overlaying grayscale T2* reference image, threshold values of α = 0.001 and β = 0.20. Red areas show areas of high confidence of activation (H 0 rejected, H a not rejected), orange areas show voxels with significant but surprisingly small BOLD response magnitude (H 0 and H a rejected); uncolored areas have high confidence of no activation (H 0 not rejected, H a rejected)

This result is quite different from the GLM results and is a reflection of the dramatically lower voxel-wise variance in the IC spatial mode relative to the GLM statistic image. The explanation is that the GLM result accounts for all noise variance, while the IC spatial map reflects only the noise in the subspace corresponding to the IC temporal mode (Beckmann & Smith, 2004).

Crucially, we stress that our thresholding procedure seeks only to improve the interpretability of the ICA result and does not produce confirmatory inferences; IC selection is intrincally post hoc and subsequent inferences circular, and all we attempt to do here is improve the thresholding of a selected IC spatial map.

Discussion

Statistical thresholding in the context of multiple tests is generally driven by the need to limit false positives. These stringent testing procedures in fMRI research lead to an abundance of false negatives (Lieberman & Cunningham, 2009) and are, therefore, less useful in the context of presurgical fMRI, where a false negative can have dire consequences. While many attempts have been made to propose more liberal testing criteria—for example, by controlling the FDR instead of the FWER (Genovese et al., 2002)—the focus is still on protecting the type I error rate. The unilateral focus on preventing false positives leads to a bias toward large obvious effects and against complex cognitive and affective effects (Lieberman & Cunningham, 2009). We therefore propose a measure that quantifies the evidence against the alternative hypothesis as introduced in Moerkerke et al. (2006). We use this quantity p 1 in addition to the classical p 0 value in a procedure that results in a thresholding procedure with multiple layers of significance. One layer consists of voxels exhibiting strong evidence of activation (red in Figs. 4 and 6), while another layer shows voxels with ambiguous evidence (yellow and orange), and a final layer then consists of voxels for which the presence of activation can be confidently rejected (an absence of overlaid statistic values). Thereby, we offer a more symmetrical interest toward both false positives and false negatives.

We have chosen to focus on voxel-wise inference instead of other topological features, such as peaks (Chumbley, Worsley, Flandin, & Friston, 2010) or clusters (Chumbley & Friston, 2009). These topological inference methods have reduced spatial specificity relative to voxel-wise inference and are, therefore, less suitable for presurgical fMRI, where maximal spatial precision is needed.

To use the procedure described in this article, an expected effect size and its variance need to be defined on a BOLD scale. This is an arbitrary choice, however many possibilities are available. Desmond and Glover (2002) showed, for a specific experimental paradigm, the distribution of percentage of signal change with its distribution. They showed, on average, a BOLD effect size of 0.48 percent BOLD change. Another possibility for estimating the expected effect size can be based on previous research. Since, in presurgical fMRI, the same experiment is repeated over most patients, the effect size can be derived from patients who already underwent the experiment and surgery. The degree to which brain activation in patients is representative for the particular setting for which estimates are needed highly depends on the context and should be carefully judged. It can be expected that different methods will affect the estimates for μ and τ; however, we found that the estimates we obtained by averaging over voxels is close to the effect sizes that can be found in literature.

The two different analytical approaches we used, the GLM and ICA, showed somewhat different results. While both GLM and ICA analyses found similar sets of voxels that were confidently activated (H 0 rejected, H a not), in the GLM analysis many voxels were found that did not show evidence against the null or against the alternative (yellow in Fig. 4). The explanation for this outcome is the high level of noise present in the data and, thus, confusion about the veracity of either H 0 or H a . In contrast, in the ICA analysis, almost no voxels have this ambiguity, and instead, we find voxels that have evidence against both the null and the alternative. Since ICA is a good tool for identifying structured noise in a data-driven manner, it can be expected that the residual voxel-wise variance will be smaller. Low variances result in a large distance between the null and the alternative distribution functions. Whereas the difference between ICA and the GLM seem contradictory at first, we argue that the differences in our approach reflect real differences between the two analysis tools.

The quantity p 1 shows a relationship with the voxel-based statistical power defined by Van Horn, Ellmore, Esposito, and Berman (1998). The voxel-wise power in Van Horn et al. translates to the complement of the alternative p-value, p 1, in our study. However, the use of the quantity is fundamentally different. Whereas Van Horn et al. used the voxel-wise power to visualize and interpret the results of a certain study, we explicitly threshold the quantity. Moreover, the interpretation of both quantities is not so straightforward. When a high power is encountered in a certain voxel, with the method of Van Horn et al., it is interpreted as follows: “If the observed effect in the voxel is used as a cutoff when testing from H 0, we have a high probability of rejecting H 0 when H 0 is indeed false.” However, a large voxel-wise power translates to a small p 1 and is, in our study, interpreted as follows: “When the alternative hypothesis is true, there is a small probability of observing this effect,” and we will interpret this effect as evidence against the alternative hypothesis. This interpretation is much more straightforward and usable.

This procedure has been developed in light of presurgical fMRI, since false negatives can have harmful consequences for the patient. However, the lack of power is omnipresent in fMRI analyses (Lieberman & Cunningham, 2009), and therefore, this procedure is also very useful in all branches of cognitive neuroscience. For example, negative results (i.e., voxels that are not significantly related to the task) are sometimes regarded as evidence against activation. However, these conclusions are not provided by null hypothesis significance testing. The presented procedure, on the other hand, quantifies the evidence for no activation at each voxel and is, therefore, perfectly suited to interpreting negative results.

We would like to stress that this procedure does not abandon null hypothesis significance testing. The classical significance testing framework is still included in the procedure, represented by one layer of significance. The method is merely an extension of the thresholded statistical parametric map, thereby providing a new layer with information on type II error rate control. Mixture modeling is similar in spirit to this method, in that null and the alternative distribution are used; however, mixture model applications usually focus on only controlling type I errors. With a fitted mixture model, you could also apply our method and find p 0 and p 1 values; however, we take pains to estimate alternative effect magnitudes a priori, from separate data, to remove any circularity.

In this procedure, control of false positives remains possible, but our procedure also takes into account information on the false negative rate. We do not assert that our method alleviates all concerns with multiplicity, and one possible direction of future work is a multiplicity correction that adjusts both null and alternative hypothesis inferences for the number of tests.