Skip to main content

Handling Multiplicity in Neuroimaging Through Bayesian Lenses with Multilevel Modeling


Here we address the current issues of inefficiency and over-penalization in the massively univariate approach followed by the correction for multiple testing, and propose a more efficient model that pools and shares information among brain regions. Using Bayesian multilevel (BML) modeling, we control two types of error that are more relevant than the conventional false positive rate (FPR): incorrect sign (type S) and incorrect magnitude (type M). BML also aims to achieve two goals: 1) improving modeling efficiency by having one integrative model and thereby dissolving the multiple testing issue, and 2) turning the focus of conventional null hypothesis significant testing (NHST) on FPR into quality control by calibrating type S errors while maintaining a reasonable level of inference efficiency. The performance and validity of this approach are demonstrated through an application at the region of interest (ROI) level, with all the regions on an equal footing: unlike the current approaches under NHST, small regions are not disadvantaged simply because of their physical size. In addition, compared to the massively univariate approach, BML may simultaneously achieve increased spatial specificity and inference efficiency, and promote results reporting in totality and transparency. The benefits of BML are illustrated in performance and quality checking using an experimental dataset. The methodology also avoids the current practice of sharp and arbitrary thresholding in the p-value funnel to which the multidimensional data are reduced. The BML approach with its auxiliary tools is available as part of the AFNI suite for general use.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. In real practice, the ROIs are not randomly drawn from a hypothetical pool like recruiting experimental subjects. However, from the practical perspective it is not too far-fetched to assume that the effects at those ROIs form a distribution such as Gaussian, similar to the assumption of Gaussian distribution for cross-subject effects. It is under this assumption that we treat the cross-ROI effects as random, and the assumption can be further validated through various cross-validation methods and model comparisons later in this paper.

  2. For simplicity, here we assume that both πi and ξj being independent and identically distributed (iid). In reality, the strict iid assumption can be problematic for the cross-ROI effects when they are spatially proximate or neurologically related. Nevertheless, the assumption can be relaxed later on to exchangeability for BML.

  3. Entity effects are more popularly called group effects in the Bayesian literature. However, to avoid potential confusions with the neuroimaging terminology in which the word group refers to subject categorization (e.g., males vs. females, patients vs. controls) or the analytical step of generalization from individual subjects to the group (corresponding to the word population in the Bayesian literature) level, we adopt entity to mean each measuring unit such as subject and ROI in the current context.

  4. See for the density p(ν,μ,σ2) of folded non-standardized t-distribution, where the parameters ν,μ, and σ2 are the degrees of freedom, mean, and variance.

  5. The LKJ prior (Lewandowski et al. 2009) is a distribution over symmetric positive-definite matrices with the diagonals of 1s.

  6. Needless to say, the concept of true effect only makes sense under the current model framework at hand, and may not hold once the model is revised.

  7. A single voxel is still possible, but much less likely, to survive this correction approach.

  8. A popular cluster reporting method among the neuroimaging software packages is to simply present the investigator only with the icebergs above the water, the surviving clusters, reinforcing the illusionary either-or dichotomy under NHST.

  9. The investigator would not be able to even see such borderline clusters since the typical software implementations mechanically adopt a dichotomous results presentation.


  • Amrhein, V., & Greenland, S. (2017). Remove, rather than redefine, statistical significance. Nature Human Behavior, 1, 0224.

    Google Scholar 

  • Bates, B., Maechler, M., Bolker, B., Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1–48.

    Google Scholar 

  • Benjamin, D.J., Berger, J., Johannesson, M., Nosek, B.A., Wagenmakers, E.-J., Berk, R., Johnson, É.V. (2017). Redefine statistical significance. Nature Human Behavior, 1, 0189.

    Google Scholar 

  • Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.

    Google Scholar 

  • Carp, J. (2012). On the plurality of (Methodological) worlds: estimating the analytic flexibility of fMRI experiments. Frontiers in Neuroscience, 6, 149.

    PubMed  PubMed Central  Google Scholar 

  • Chen, G., Saad, Z.S., Nath, A.R., Beauchamp, M.S., Cox, R.W. (2012). FMRI Group analysis combining effect estimates and their variances. NeuroImage, 60, 747–765.

    PubMed  Google Scholar 

  • Chen, G., Saad, Z.S., Britton, J.C., Pine, D.S., Cox, R.W. (2013). Linear mixed-effects modeling approach to FMRI group analysis. NeuroImage, 73, 176–190.

    PubMed  PubMed Central  Google Scholar 

  • Chen, G., Adleman, N.E., Saad, Z.S., Leibenluft, E., Cox, R.W. (2014). Applications of multivariate modeling to neuroimaging group analysis: a comprehensive alternative to univariate general linear model. NeuroImage, 99, 571–588.

    PubMed  PubMed Central  Google Scholar 

  • Chen, G., Taylor, P.A., Shin, Y.W., Reynolds, R.C., Cox, R.W. (2017a). Untangling the relatedness among correlations, part II: inter-subject correlation group analysis through linear mixed-effects modeling. NeuroImage, 147, 825–840.

    PubMed  Google Scholar 

  • Chen, G., Taylor, P.A., Cox, R.W. (2017b). Is the statistic value all we should care about in neuroimaging? NeuroImage, 147, 952– 959.

    PubMed  Google Scholar 

  • Chen, G., Taylor, P.A., Haller, S.P., Kircanski, K., Stoddard, J., Pine, D.S., Leibenluft, E., Brotman, M.A., Cox, R.W. (2018a). Intraclass correlation: improved modeling approaches and applications for neuroimaging. Human Brain Mapping, 39(3), 1187–1206.

    PubMed  PubMed Central  Google Scholar 

  • Chen, G., Cox, R.W., Glen, D.R., Rajendra, J.K., Reynolds, R.C., Taylor, P.A. (2018b). A tail of two sides: Artificially doubled false positive rates in neuroimaging due to the sidedness choice with t-tests. Human Brain Mapping. In press.

  • Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003.

    Google Scholar 

  • Cox, R.W. (1996). AFNI: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical Research, 29, 162–173.

    CAS  PubMed  Google Scholar 

  • Cox, R.W., Chen, G., Glen, D.R., Reynolds, R.C., Taylor, P.A. (2017). FMRI clustering in AFNI: false-positive rates redux. Brain Connection, 7(3), 152–171.

    Google Scholar 

  • Cox, R.W. (2018). Equitable Thresholding and Clustering. In preparation.

  • Cox, R.W., & Taylor, P.A. (2017). Stability of Spatial Smoothness and Cluster-Size Threshold Estimates in FMRI using AFNI. arXiv:1709.07471.

  • Cremers, H.R., Wager, T.D., Yarkoni, T. (2017). The relation between statistical power and inference in fMRI. PLoS ONE, 12(11), e0184923.

    PubMed  PubMed Central  Google Scholar 

  • Eklund, A., Nichols, T.E., Knutsson, H. (2016). Cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates. PNAS, 113(28), 7900–7905.

    CAS  PubMed  Google Scholar 

  • Forman, S.D., Cohen, J.D., Fitzgerald, M., Eddy, W.F., Mintun, M.A., Noll, D.C. (1995). Improved assessment of significant activation in functional magnetic resonance imaging (fMRI): use of a cluster-size threshold. Magnetic Resonance Medicine, 33, 636– 647.

    CAS  Google Scholar 

  • Gelman, A. (2015). Statistics and the crisis of scientific replication. Significance, 12(3), 23–25.

    Google Scholar 

  • Gelman, A. (2016). The problems with p-values are not just with p-values. The American Statistician, Online Discussion.

  • Gelman, A., & Carlin, J.B. (2014). Beyond power calculations: assessing type s (sign) and type m (magnitude) errors. Perspectives on Psychological Science, 1–11.

  • Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B. (2014). Bayesian data analysis, Third edition. Boca Raton: Chapman & Hall/CRC Press.

    Google Scholar 

  • Gelman, A., & Hennig, C. (2017). Beyond subjective and objective in statistics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180(4), 1–31.

    Google Scholar 

  • Gelman, A., Hill, J., Yajima, M. (2012). Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness, 5, 189–211.

    Google Scholar 

  • Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ”fishing expedition” or ”p-hacking” and the research hypothesis was posited ahead of time.

  • Gelman, A., & Shalizi, C.R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66, 8–38.

    PubMed  Google Scholar 

  • Gelman, A., Simpson, D., Betancourt, M. (2017). The prior can generally only be understood in the context of the likelihood. arXiv:1708.07487.

  • Gelman, A., & Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics15, 373–390.

  • Gonzalez-Castillo, J., Saad, Z.S., Handwerker, D.A., Inati, S.J., Brenowitz, N., Bandettini, P.A. (2012). Whole-brain, time-locked activation with simple tasks revealed using massive averaging and model-free analysis. PNAS, 109(14), 5487–5492.

    CAS  PubMed  Google Scholar 

  • Gonzalez-Castillo, J., Chen, G., Nichols, T., Cox, R.W., Bandettini, P.A. (2017). Variance decomposition for single-subject task-based fMRI activity estimates across many sessions. NeuroImage, 154, 206–218.

    PubMed  Google Scholar 

  • Lazzeroni, L.C., Lu, Y., Belitskaya-Lévy, I. (2016). Solutions for quantifying P-value uncertainty and replication power. Nature Methods, 13, 107–110.

    CAS  PubMed  Google Scholar 

  • Lewandowski, D., Kurowicka, D., Joe, H. (2009). Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis, 100, 1989–2001.

    Google Scholar 

  • Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584–585.

    CAS  PubMed  Google Scholar 

  • McElreath, R. (2016). Statistical Rethinking: a Bayesian course with examples in R and Stan. Boca Raton: Chapman & Hall/CRC Press.

    Google Scholar 

  • McShane, B.B., Gal, D., Gelman, A., Robert, C., Tackett, J.L. (2017). Abandon statistical significance. arXiv:1709.07588.

  • Mejia, A., Yue, Y.R., Bolin, D., Lindren, F., Lindquist, M.A. (2017). A Bayesian general linear modeling approach to cortical surface fMRI data analysis. arXiv:1706.00959.

  • Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D., Wagenmakers, E.-J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin and Review, 23(1), 103–123.

    PubMed  Google Scholar 

  • Mueller, K., Lepsien, J., Möller, H.E., Lohmann, G. (2017). Commentary: cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates. Frontiers in Human Neuroscience, 11, 345.

    PubMed  PubMed Central  Google Scholar 

  • Nichols, T.E., & Holmes, A.P. (2001). Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human Brain Mapping, 15(1), 1–25.

    Google Scholar 

  • Olszowy, W., Aston, J., Rua, C., Williams, G.B. (2017). Accurate autocorrelation modeling substantially improves fMRI reliability. arXiv:1711.09877.

  • Poline, J.B., & Brett, M. (2012). The general linear model and fMRI: does love last forever? NeuroImage, 62 (2), 871–880.

    PubMed  Google Scholar 

  • R Core Team. (2017). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

  • Saad, Z.S., Reynolds, R.C., Argall, B., Japee, S., Cox, R.W. (2004). SUMA: an interface for surface-based intra- and inter-subject analysis with AFNI. In Proceedings of the 2004 IEEE International Symposium on Biomedical Imaging (pp. 1510–1513).

  • Schaefer, A., Kong, R., Gordon, E.M., Zuo, X.N., Holmes, A.J., Eickhoff, S.B., Yeo, B.T. (2017). Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. Cerebral Cortex. In press.

  • Simmons, J.P., Nelson, L.D., Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366.

    PubMed  Google Scholar 

  • Smith, S.M., & Nichols, T.E. (2009). Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage, 44(1), 83–98.

    PubMed  Google Scholar 

  • Stan Development Team. (2017). Stan modeling language users guide and reference manual, Version 2.17.0.

  • Steegen, S., Tuerlinckx, F., Gelman, A., Vanpaemel, W. (2016). Increasing transparency through a multiverse Analysis. Perspectives on Psychological Science, 11(5), 702–712.

    PubMed  Google Scholar 

  • Wasserstein, R.L., & Lazar, N.A. (2016). The ASA’s statement on p-values: context, process, and purpose. The American Statistician 70, 2, 129–133.

    Google Scholar 

  • Vehtari, A., Gelman, A., Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432.

    Google Scholar 

  • Westfall, J., Nichols, T.E., Yarkoni, T. (2017). Fixing the stimulus-as-fixed-effect fallacy in task fMRI. Wellcome Open Research, 1, 23.

    PubMed Central  Google Scholar 

  • Wickham, H. (2009). Ggplot2: elegant graphics for data analysis. New York: Springer.

    Google Scholar 

  • Worsley, K.J., Marrett, S., Neelin, P., Evans, A.C. (1992). A three-dimensional statistical analysis for CBF activation studies in human brain. Journal of Cerebral Blood Flow and Metabolism, 12, 900–918.

    CAS  PubMed  Google Scholar 

  • Xiao, Y., Geng, F., Riggins, T., Chen, G., Redcay, E. (2018). Neural correlates of developing theory of mind competence in early childhood. Under review.

  • Yeung, A.W.K. (2018). An updated survey on statistical thresholding and sample size of fMRI studies. Frontiers in Human Neuroscience, 12, 16.

    PubMed  PubMed Central  Google Scholar 

Download references


The research and writing of the paper were supported (GC, PAT, and RWC) by the NIMH and NINDS Intramural Research Programs (ZICMH002888) of the NIH/HHS, USA, and by the NIH grant R01HD079518A to TR and ER. Much of the modeling work here was inspired from Andrew Gelman’s blog. We are indebted to Paul-Christian Bürkner and the Stan development team members Ben Goodrich, Daniel Simpson, Jonah Sol Gabry, Bob Carpenter, and Michael Betancourt for their help and technical support. The simulations were performed in the R language for statistical computing and the figures were generated with the R package ggplot2 (Wickham 2009).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Gang Chen.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix A: Pitfalls of NHST

  1. i.

    It is a common mistake by investigators and even statistical analysts to misinterpret the conditional probability under NHST as the posterior probability of the truth of the null hypothesis (or the probability of the null event conditional on the current data at hand) even though fundamentally P(data | H0)≠P(H0 | data).

  2. ii.

    One may conflate statistical significance with practical significance, and subsequently treat the failure to reach statistical significance as the nonexistence of any meaningful effect. Even though the absence of evidence is not an evidence of absence, it is common to read discussions in scientific literature wherein the authors implicitly (or even explicitly) treat statistically non-significant effects as if they were zero.

  3. iii.

    Statistic- or p-values cannot easily be compared: the difference between a statistically significant effect and another effect that fails to pass the significance level does not necessarily itself reach statistical significance.

  4. iv.

    How should the investigator handle the demarcation, due to sharp thresholding, between one effect with p = 0.05 (or a surviving cluster cutoff of 54 voxels) and another with p = 0.051 (or a cluster size of 53 voxels)Footnote 9?

  5. v.

    The focus on statistic- or p-value seems to, in practice, lead to the preponderance of reporting only statistical, instead of effect, maps in neuroimaging, losing an effective safeguard that could have filtered out potentially spurious results (Chen et al. 2017b).

  6. vi.

    One may mistakenly gain more confidence in a statistically significant result (e.g., high statistic value) in the context of data with relatively heavy noise or with a small sample size (e.g., leading to statement such as “despite the small sample size” or “despite the limited statistical power”). In fact, using statistical significance as a screener can lead researchers to make a wrong assessment about the sign of an effect or drastically overestimate the magnitude of an effect.

  7. vii.

    While the conceptual classifications of false positives and false negatives make sense in a system of discrete nature (e.g., juror decision on H0: the suspect is innocent), what are the consequences when we adopt a mechanical dichotomous approach to assessing a quantity of continuous, instead of discrete, nature?

  8. viii.

    It is usually under-appreciated that the p-value, as a function of data, is a random variable, and thus itself has a sampling distribution. In other words, p-values from experiments with identical designs can differ substantially, and statistically significant results may not necessarily be replicated (Lazzeroni et al. 2016).

Appendix B: Type S and Type M Errors

We discuss two types of error that have not been introduced in neuroimaging: type S and type M. These two types of error cannot be directly captured by the FPR concept and may become severe when the effect is small relative to the noise, which is usually the situation in BOLD neuroimaging data. In the NHST framework, we formulate a null hypothesis H0 (e.g., the effect of an easy task E is identical to a difficult one D), and then commit a type I (or false positive) error if wrongly rejecting H0 (e.g., the effect of easy is judged to be statistically different from difficult when actually their effects are the same); in contrast, we make a type II (or false negative) error when accepting H0 when H0 is in fact false (e.g., the effect of easy is assessed to be not statistically different from difficult even though their effects do differ). These are the dichotomous errors associated with NHST, and the counterbalance between these two types of error are the underpinnings of typical experimental design as well results reporting.

However, we could think about other ways of framing errors when making a statistical assessment (e.g., the easy case elicits a stronger BOLD response at some region than the difficult case) conditional on the current data. We are exposed to a risk that our decision is contrary to the truth (e.g., the BOLD response to the easy condition is actually lower than to the difficult condition). Such a risk is gauged as a type S (for “sign”) error when we incorrectly identify the sign of the effect; its values range from 0 (no chance of error) to 1 (full chance of error). Similarly, we make a type M (for “magnitude”) error when estimating the effect as small in magnitude if it is actually large, or when claiming that the effect is large in magnitude if it is in fact small (e.g., saying that the easy condition produces a much large response than the difficult one when actually the difference is tiny); its values range across the positive real numbers: [0, 1) correspond to underestimation of effect magnitude, 1 describes correct estimation, and (1, +) mean overestimation. The two error types are illustrated in Fig. 5 for inferences made under NHST. In the neuroimaging realm, effect magnitude is certainly a property of interest, therefore the corresponding type S and type M errors would be of research interest.

Geometrically speaking, if the null hypothesis H0 can be conceptualized as the point at zero, NHST aims at the real space R excluding zero with a pivot at the point of zero (e.g., DE = 0); in contrast, type S error gauges the relative chance that a result is assessed on the wrong side of the distribution between the two half spaces of R (e.g., DE > 0 or DE < 0), and type M error gauges the relative magnitude of differences along segments of R+ (e.g., the ratio of measured to actual effect is ≫ 1 or ≪ 1). Thus, we characterize type I and type II errors as “point-wise” errors, driven by judging the equality, and describe type S and type M errors as “direction-wise,” driven by the focus of inequality or directionality.

One direct application of type M error is that publication bias can lead to type M errors, as large effect estimates are more likely to filter through the dichotomous decisions in statistical inference and reviewing process. Using the type S and type M error concepts, it might be surprising for those who encounter these two error types for the first time to realize that, when the data are highly variable or noisy, or when the sample size is small with a relatively low power (e.g., 0.06), a statistically significant result at the 0.05 level is quite likely to have an incorrect sign – with a type S error rate of 24% or even higher (Gelman and Carlin 2014). In addition, such a statistically significant result would have a type M error with its effect estimate much larger (e.g., 9 times higher) than the true value. Put it another way, if the real effect is small and sampling variance is large, then a dataset that reaches statistical significance must have an exaggerated effect estimate and the sign of the effect estimate is likely to be incorrect. Due to the ramifications of type M errors and publication filtering, an effect size from the literature could be exaggerated to some extent, seriously calling into question the usefulness of power analysis under NHST in determining sample size or power, which might explain the dramatic contrast between the common practice of power analysis as a requirement for grant applications and the reproducibility crisis across various fields. Fundamentally, power analysis inherits the same problem with NHST: a narrow emphasis on statistical significance is placed as a primary focus (Gelman and Carlin 2014).

The typical effect magnitude in BOLD FMRI at 3 Tesla is usually small, less than 1% signal change in most brain regions except for areas such as motor and primary sensory cortex. Such a weak signal can be largely submerged by the overwhelming noise and distortion embedded in the FMRI data. The low power for detection of typical FMRI data analyses in typical datasets is further compounded by the modeling challenges in accurately capturing the effect. For example, even though large number of physiological confounding effects are embedded in the data, it is still difficult to properly incorporate the physiological “noises” (cardiac and respirary effects) in the model. Moreover, habituation, saturation, or attenuation across trials or within each block are usually not considered, and such fluctuations relative to the average effect would be treated as noise or fixed- instead of random-effects (Westfall et al. 2017). There are also strong indications that a large portion of BOLD activations are usually unidentified at the individual subject level due to the lack of power (Gonzalez-Castillo et al. 2012). Because of these factors, the variance due to poor modeling overwhelms all other sources (e.g., across trials, runs, and sessions) in the total data variance (Gonzalez-Castillo et al. 2017); that is, the majority (e.g., 60-80%) of the total variance in the data is not properly accounted for in statistical models.

Appendix C: Multiplicity in Neuroimaging

In general, we can classify four types of multiplicity issues that commonly occur in neuroimaging data analysis.

  1. A)

    Multiple testing. The first and major multiplicity arises when the same design (or model) matrix is applied multiple times to different values of the response or outcome variable, such as the effect estimates at the voxels within the brain. As the conventional voxel-wise neuroimaging data analysis is performed with a massively univariate approach, there are as many models as the number of voxels, which is the source of the major multiplicity issue: multiple testing. Those models can be, for instance, Student’s t-tests, AN(C)OVA, univariate or multivariate GLM, LME or Bayesian model. Regardless of the specific model, all the voxels share the same design matrix, but have different response variable values on the left-hand side of the equation. With human brain size on the order of 106 mm3, the number of voxels may range from 20,000 to 150,000 depending on the voxel dimensions. Each extra voxel adds an extra model and leads to incrementally mounting odds of pure chance or “statistically significant outcomes,” presenting the challenge to account for the occurrence of mounting FWE, while effectively holding the overall false positive rate (FPR) at a nominal level (e.g., 0.05). In the same vein, surface-based analysis is performed with 30,000 to 50,000 nodes (Saad et al. 2004), sharing a similar multiple testing issue with its volume-based counterpart. Sometimes the investigator performs analyses at a smaller number of ROIs, perhaps of order 100, but even here adjustment is still required for the multiple testing issue (though it is often not made).

  2. B)

    Double sidedness. Another occurrence of multiplicity is the widespread adoption of two separate one-sided (or one-tailed) tests in neuroimaging. For instance, the comparison between the two conditions of “easy” and “difficult” are usually analyzed twice for the whole brain: one showing whether the easy effect is higher than difficult, and the other for the possibility of the difficult effect being higher than easy. One-sided testing for one direction would be justified if prior knowledge is available regarding the sign of the test for a particular brain region. When no prior information is available for all regions in the brain, one cannot simply finesse two separate one-sided tests in place of one two-sided test, and a double sidedness practice warrants a Bonferroni correction because the two tails are independent with respect to each other (and each one-sided test is more liberal than a two-sided test at the same significance level). However, simultaneously testing both tails in tandem for whole brain analysis without correction is widely used without clear justification, and this forms a source of multiplicity issue that needs proper accounting (Chen et al. 2018b).

  3. C)

    Multiple comparisons. It rarely occurs that only one statistical test is carried out in a specific neuroimaging study, such as a single one-sample t-test. Therefore, a third source of multiplicity is directly related to the popular term, multiple comparisons, which occur when multiple tests are conducted under one model. For example, an investigator that designs an emotion experiment with three conditions (easy, difficult, and moderate) may perform several separate tests: comparing each of the three conditions to baseline, making three pairwise comparisons, or testing a linear combination of the three conditions (such as the average of easy and difficult versus moderate). However, neuroimaging publications seldom consider corrections for such separate tests.

  4. D)

    Multiple paths. The fourth multiplicity issue to affect outcome interpretation arises from the number of potential preprocessing, data dredging and analytical pipelines (Carp 2012). For instance, all common steps have a choice of procedures: outlier handling (despiking, censoring), slice timing correction (yes/no, various interpolations), head motion correction (different interpolations), different alignment methods from EPI to anatomical data plus upsampling (1 to 4 mm), different alignment methods to different standard spaces (Talairach and MNI variants), spatial smoothing (3 to 10 mm), data scaling (voxel-wise, global or grand mean), confounding effects (slow drift modeling with polynomials, high pass filtering, head motion parameters), hemodynamic response modeling (different presumed functions and multiple basis functions), serial correlation modeling (whole brain, tissue-based, voxel-wise AR or ARMA), and population modeling (univariate or multivariate GLM, treating sex as a covariate of no interest (thus no interactions with other variables) or as a typical factor (plus potential interactions with other variables)). Each choice represents a ”branching point” that could have a quantitative change to the final effect estimate and inference. Conservatively assuming three options at each step here would yield totally 310 = 59, 049 possible paths, commonly referred to as researcher degrees of freedom (Simmons et al., 2011). The impact of the choice at each individual step for this abbreviated list might be negligible, moderate, or substantial. For example, different serial correlation models may lead to substantially different effect estimate reliability (Olszowy et al. 2017); the estimate for spatial correlation of the noise could be sensitive to the voxel size to which the original data were upsampled (Mueller et al. 2017; Cox and Taylor 2017), which may lead to different cluster thresholds and poor control to the intended FPR in correcting for multiplicity. Therefore, the cumulative effect across all these multilevel branching points could be a large divergence between any two paths for the final results. A multiverse analysis (Steegen et al. 2016) has been suggested for such situations of having a “garden of forking paths” (Gelman and Loken 2013), but this seems highly impractical for neuroimaging data. Even when one specific analytical path is chosen by the investigator, it remains possible to invoke potential or implicit multiplicity in the sense that the details of the analytical steps such as data sanitation are conditional on the data (Gelman and Loken 2013). The final interpretation of significance typically ignores the number of choices or the potential branchings that may affect the final outcome, even though it would be more preferable to have the statistical significance independent of these preprocessing steps.

Appendix D: Bayesian Modeling for One-Way Random-Effects ANOVA

Here we discuss a classical framework, a hierarchical or multilevel model for a one-way random-effects ANOVA, and use it as a building block to expand to a Bayesian framework for neuroimaging group analysis. In evaluating this model, the controllability of inference errors will be focused on type S errors instead of the traditional FPR. Suppose that there are r measured entities (e.g., ROIs), with entity j measuring the effect 𝜃j from nj independent Gaussian-distributed data points yij, each of which represents a sample (e.g., trial), i = 1, 2,...,nj. The conventional statistical approach formulates r separate models,

$$ y_{ij} = \theta_{j} + \epsilon_{ij},\ i = 1,2,..., n_{j}, $$

where 𝜖ij is the residual for the j th entity and is assumed to be Gaussian \(\mathcal {N}(0, \sigma ^{2}\)), j = 1, 2,...,r. Depending on whether the sampling variance σ2 is known or not, each effect can be assessed through its sample mean \(\bar {y}_{\cdot j} = \frac {1}{n_{j}}{\sum }_{i = 1}^{n_{j}} y_{ij}\) relative to the corresponding variance \({V_{j}^{0}}=\frac {\sigma ^{2}}{n_{j}}\), resulting in a Z- or t-test.

By combining the data from the r entities and further decomposing the effect 𝜃j into an overall effect b0 across the r entities and the deviation ξj of the j th entity from the overall effect (i.e., 𝜃j = b0 + ξj,j = 1, 2,...,r), we have a conventional one-way random-effects ANOVA,

$$ y_{ij} = b_{0} + \xi_{j} + \epsilon_{ij},\ i = 1,2,..., n_{j},\ j = 1,2,..., r, $$

where b0 is conceptualized as a fixed-effects parameter, ξj codes the random fluctuation of the j th entity from the overall mean b0, with the assumption of \(\xi _{j} \sim \mathcal {N}(0, \tau ^{2})\), and the residual 𝜖ij follows a Gaussian distribution \(\mathcal {N}(0, \sigma ^{2}\)). The classical one-way random-effects ANOVA model (19) is typically formulated to examine the null hypothesis,

$$ H_{0}: \tau= 0, $$

with an F-statistic, which is constructed as the ratio of the between mean sums of squares and the within mean sums of squares. An application of this ANOVA model (19) to neuroimaging is to compute the intraclass correlation ICC(1,1) as \(\frac {\tau ^{2}}{\tau ^{2}+\sigma ^{2}}\) when the measuring entities are exchangeable (e.g., families with identical twins; Chen et al. 2018a).

Whenever multiple values (e.g., two effect estimates from two scanning sessions) from each measuring unit (e.g., subject or family) are correlated (e.g., the levels of a within-subject or repeated-measures factor), the data can be formulated using a linear mixed-effects (LME) model, sometimes referred to as a multilevel or hierarchical model. One natural ANOVA extension is simply to treat the model conceptually as LME, without the need of reformulating the model equation (19). However, LME can only provide point estimates for the overall effect b0, cross-region variance τ2 and the data variance σ2; that is, the LME (19) cannot directly provide any information regarding the individual ξj or 𝜃j values because of over-fitting due to the fact that the number of data points is less than the number of parameters that need to be estimated.

Our interest here is neither to assess the variability τ2 nor to calculate ICC, but instead to make statistical inferences about the individual effects 𝜃j. Nevertheless, the conventional NHST (20) may shed some light on potential strategies (Gelman et al. 2014) for 𝜃j. If the deviations ξj are relatively small compared to the overall mean b0, then the corresponding F-statistic value will be small as well, leading to the decision of not rejecting the null hypothesis (20) at a reasonable, predetermined significance level (e.g., 0.05); in that case, we can estimate the equal individual effects 𝜃j using the overall weighted mean \(\bar {y}_{\cdot \cdot }\) through full pooling with all the data,

$$ \hat\theta_{1} = \hat\theta_{2}=...=\hat\theta_{r}=\bar{y}_{\cdot\cdot}=\frac{{\sum}_{j = 1}^{r}\frac{1}{{\sigma_{j}^{2}}}\bar{y}_{\cdot j}}{{\sum}_{j = 1}^{r}\frac{1}{{\sigma_{j}^{2}}}}, $$

where \(\bar {y}_{\cdot j}=\frac {1}{n_{j}}{\sum }_{i = 1}^{n_{j}} y_{ij}\) and \({\sigma _{j}^{2}} = \frac {\sigma ^{2}}{n_{j}}\) are the sampling mean and variance for the j th measuring entity, and the subscript dot (⋅) notation indicates the (weighted) mean across the corresponding index(es). On the other hand, if the deviations ξj are relatively large, so is the associated F-statistic value, leading to the decision of rejecting the null hypothesis (20); similarly, we can reasonably estimate 𝜃j with no pooling across the r entities; that is, each 𝜃j is estimated using the jth measuring entity’s data separately,

$$ \hat\theta_{j} = \bar{y}_{\cdot j}=\frac{1}{n_{j}}\sum\limits_{i = 1}^{n_{j}} y_{ij},\ j = 1,2,..., r. $$

However, in estimating 𝜃j we do not have to take a dichotomous approach of choosing, based on a preset significance level, between these two extreme choices, the overall weighted mean \(\bar {y}_{\cdot \cdot }\) in Eq. 21 through full pooling and the separate means \(\bar {y}_{\cdot j}\) in Eq. 21 with no pooling; instead, we could make the assumption that a reasonable estimate to 𝜃j lies somewhere along the continuum between \(\bar {y}_{\cdot \cdot }\) and \(\bar {y}_{\cdot j}\), with its exact location derived from the data instead of by imposing an arbitrary threshold. This thinking brings us to the Bayesian methodology.

To simplify the situation, we first assume a known sampling variance σ2 for the i th data point (e.g., trial) for the j th entity; or, in Bayesian-style formulation, we build a BML about the distribution of yij conditional on 𝜃j,

$$ y_{ij} | \theta_{j} \sim \mathcal{N}(\theta_{j}, \sigma^{2}),\ i = 1,2,..., n_{j},\ j = 1,2,..., r. $$

With a prior distribution \(\mathcal {N}(b_{0}, \tau ^{2})\) for 𝜃j and a noninformative uniform hyperprior for b0 given τ (i.e., b0|τ ∼ 1), the conditional posterior distributions for 𝜃j can be derived (Gelman et al. 2014) as,

$$\begin{array}{@{}rcl@{}} &&\theta_{j} | b_{0}, \tau, y \sim \mathcal{N}(\hat{\theta}_{j}, V_{j}),\text{ where } \hat{\theta}_{j} = \frac{\frac{1}{{\sigma_{j}^{2}}} \bar{y}_{\cdot j} +\frac{1}{\tau^{2}} b_{0}}{\frac{1}{{\sigma_{j}^{2}}}+\frac{1} {\tau^{2}}},\\ &&\quad V_{j} = \frac{1}{\frac{1}{{\sigma_{j}^{2}}}+\frac{1} {\tau^{2}}},\ {\sigma_{j}^{2}} = \frac{\sigma^{2}}{n_{j}},\ j = 1,2,..,r. \end{array} $$

The analytical solution (24) indicates that \(\frac {1}{V_{j}} = \frac {1}{{\sigma _{j}^{2}}}+\frac {1} {\tau ^{2}}\), manifesting an intuitive fact that the posterior precision is the cumulative effect of the data precision and the prior precision; that is, the posterior precision is improved by the amount \(\frac {1} {\tau ^{2}}\) relative to the data precision \(\frac {1}{{\sigma _{j}^{2}}}\). Moreover, the expression for the posterior mode of \(\hat {\theta }_{j}\) in Eq. 24 shows that the estimating choice in the continuum can be expressed as a precision-weighted average between the individual sample means \(\bar {y}_{\cdot j}\) and the overall mean b0:

$$\begin{array}{@{}rcl@{}} \hat{\theta}_{j} &=& \frac{\frac{1}{{\sigma_{j}^{2}}} \bar{y}_{\cdot j} +\frac{1}{\tau^{2}} b_{0}}{\frac{1}{{\sigma_{j}^{2}}}+\frac{1} {\tau^{2}}} = w_{j} \bar{y}_{\cdot j} + (1-w_{j})b_{0}\\ &=&b_{0}+w_{j}(\bar{y}_{\cdot j}-b_{0})\\ &=&\bar{y}_{\cdot j}-(1-w_{j})(\bar{y}_{\cdot j}-b_{0}), \ j = 1,2,..,r, \end{array} $$

where the weights \(w_{j} =\frac {V_{j}}{{\sigma _{j}^{2}}}\). The precision weighting in Eq. 25 makes intuitive sense in terms of the previously described limiting cases:

  1. i.

    The full pooling (21) corresponds to wj = 0 or τ2 = 0, which means that the 𝜃j are assumed to be the same or fixed at a common value. The approach would lead to underfitting because the effect is assumed to be invariant across ROIs.

  2. ii.

    The no pooling (22) corresponds to wj = 1 or τ2 = , indicating that the r effects 𝜃j are uniformly distributed within (−,); that is, it corresponds to a noninformative uniform prior on 𝜃j. In contrast to full pooling, no pooling tends to overfit the data as the information at one ROI is not utilized to shed light on any other ROIs.

  3. iii.

    The partial pooling (24) or (25) reflects the fact that the r effects 𝜃j are a priori assumed to follow an independent and identically distribution, the prior \(\mathcal {N}(b_{0}, \tau ^{2})\). Under the Bayesian framework, we make statistical inferences about the r effects 𝜃j with a posterior distribution (24) that includes the conventional dichotomous decisions between full pooling (21) and no pooling (22) as two special and extreme cases. Moreover, as expressed in Eq. 25, the Bayesian estimate \(\hat {\theta }_{j}\) can be conceptualized as the precision-weighted average between the individual estimate \(\bar {y}_{\cdot j}\) and the overall (or prior) mean b0, the adjustment of 𝜃j from the overall mean b0 toward the observed mean \(\bar {y}_{\cdot j}\), or conversely, the observed mean \(\bar {y}_{\cdot j}\) being shrunk toward the overall mean b0. As a middle ground between full pooling and no pooling, partial pooling usually provides a better fit to the data since the information is effectively pooled and shared across ROIs.

An important concept for a Bayesian model is exchangeability. Specifically for the BML (23), the effects 𝜃j are exchangeable if their joint distribution p(𝜃1,𝜃2,...,𝜃r) is immutable or invariant to any random permutation among their indices or orders (e.g., p(𝜃1,𝜃2,...,𝜃r) is a symmetric function). Using the ROIs as an example, exchangeability means that, without any a priori knowledge about their effects, we can randomly shuffle or relabel them without reducing our knowledge about their effects. In other words, complete ignorance equals exchangeability: before poring over the data, there is no way for us to distinguish the regions from each other. When the exchangeability assumption can be assumed for 𝜃j, their joint distribution can be expressed as a mixture of independent and identical distributions (Gelman et al. 2014), which is essential in the derivation of the posterior distribution (24) from the prior distribution \(\mathcal {N}(b_{0}, \tau ^{2})\) for 𝜃j.

To complete the Bayesian inferences for the model (23), we proceed to obtain (i) p(b0,τ|y), the marginal posterior distribution of the hyperparameters (b0,τ), (ii) p(b0|τ,y), the posterior distribution of b0 given τ, and (iii) p(τ|y), the posterior distribution of τ with a prior for τ, for example, a noninformative uniform distribution p(τ) ∼ 1. In practice, the numerical solutions are achieved in a backward order, through Monte Carlo simulations of τ to get p(τ|y), simulations of b0 to get p(b0|τ,y), and, lastly, simulations of 𝜃j to get p(𝜃j|b0,τ,y) in Eq. 24.

Assessing type S Error Under BML

In addition to the advantage of information merging across the r entities between the limits of complete and no pooling, a natural question remains: how does BML perform in terms of the conventional type I error as well as type S and type M errors? With the “standard” analysis of r separate models in Eq. 18, each effect 𝜃j is assessed against the sampling variance \({V_{j}^{0}}={\sigma _{j}^{2}}\). In contrast, under the BML (23), the posterior variance, as shown in Eq. 24, is \(V_{j} = \frac {1}{\frac {1}{{\sigma _{j}^{2}}}+\frac {1} {\tau ^{2}}},\ {\sigma _{j}^{2}} = \frac {\sigma ^{2}}{n_{j}}\). As the ratio of the two variances \(\frac {{V_{j}^{0}}}{V_{j}}=\frac {\tau ^{2}}{\tau ^{2}+{\sigma _{j}^{2}}}\) is always less than 1 (except for the limiting cases of σ2 → 0 or τ2), BML generally assigns a larger uncertainty than the conventional approach with no pooling. That is, the inference for each effect 𝜃j based on the unified model (23) is more conservative than when the effect is assessed individually through the model (18). Instead of tightening the overall FPR through some kind of correction for multiplicity among the r separate models, BML addresses the multiplicity issue through precision adjustment or partial pooling under one model with a shrinking or pooling strength of \(\sqrt {\frac {{V_{j}^{0}}}{V_{j}}}=\frac {1}{\sqrt {1+{\sigma _{j}^{2}}/\tau ^{2}}}\).

Simulations (Gelman and Tuerlinckx 2000) indicate that, when making inference based on the 95% quantile interval of the posterior distribution for a single effect 𝜃j (j is fixed, e.g., j = 1), the type S error rate for the Bayesian model (23) is less than 0.025 under all circumstances. In contrast, the conventional model (18) would have a substantial type S error rate especially when the sampling variance is large relative to the cross-entities variance (e.g., \({\sigma _{j}^{2}}/\tau ^{2} > 2\)); specifically, type S error reaches 10% when \({\sigma _{j}^{2}}/\tau ^{2} = 2\), and may go up to 50% if \({\sigma _{j}^{2}}\) much larger than τ2. When multiple comparisons are performed, a similar patterns remains; that is, the type S error rate for the Bayesian model is in general below 2.5%, and is lower than the conventional model with rigorous correction (e.g., Tukey’s honestly significant difference test, wholly significant differences) for multiplicity when σ/τ > 1. The controllability of BML on type S errors is parallel to the usual focus on type I errors under NHST; however, unlike NHST in which the typical I error rate is deliberately controlled through an FPR threshold, the controllability of type S errors under BML is intrinsically embedded in the modeling mechanism without any explicit imposition.

The model (23) is typically seen in Bayesian statistics textbooks as an intuitive introduction to BML (e.g., Gelman et al. 2014). With the indices i and j coding the task trials and ROIs, respectively, the ANOVA model (19) or its Bayesian counterpart (23) can be utilized to make inferences on an ensemble of ROIs at the individual subject level. The conventional analysis would have to deal with the multiplicity issue because of separate inferences at each ROI (i.e., entity). In contrast, there is only one integrated model (23) that leverages the information among the r entities, and the resulting partial pooling effectively dissolves the multiple testing concern. However, the modeling framework can only be applied for single subject analysis, and it is not suitable at the population level; nevertheless, it serves as an intuitive tool for us to extend to more sophisticated scenarios.

Appendix E: Derivation of Posterior Distribution for BML (5)

We start with the BML system (5) with a known sampling variance σ2,

$$y_{ij} | \pi_{i}, \theta_{j} \sim \mathcal{N}(\pi_{i}+\theta_{j}, \sigma^{2}),\ i = 1,2,...,n,\ j = 1,2,..., r. $$

Conditional on 𝜃j and prior πiN(0,λ2), the variance for the sampling mean at the j th ROI, \(\bar {y}_{\cdot j} = \frac {1}{n}{\sum }_{i = 1}^{n} y_{ij}=\theta _{j}+\frac {1}{n}{\sum }_{i = 1}^{n} \pi _{i}+\frac {1}{n}{\sum }_{i = 1}^{n} \epsilon _{ij}\), is \(\frac {\lambda ^{2}+\sigma ^{2}}{n}\); that is,

$$\bar{y}_{\cdot j}| \theta_{j}, \lambda^{2} \sim N(\theta_{j}, \frac{\lambda^{2}+\sigma^{2}}{n}),\ j = 1,2,..., r. $$

With priors πiN(0,λ2) and 𝜃jN(μ,τ2), we follow the same derivation as in the likelihood (23), and obtain the posterior distribution,

$$\begin{array}{@{}rcl@{}} &&\theta_{j}|\mu, \tau, \lambda, \boldsymbol{y} \sim \mathcal{N}(\hat{\theta}_{j}, V),\text{ where } \boldsymbol{y}=\{y_{ij}\}, \hat{\theta}_{j}\\ &&\quad= \frac{\frac{n}{\lambda^{2}+\sigma^{2}} \bar{y}_{\cdot j} +\frac{1}{\tau^{2}} \mu}{\frac{n}{\lambda^{2}+\sigma^{2}}+\frac{1} {\tau^{2}}},\ V = \frac{1}{\frac{n}{\lambda^{2}+\sigma^{2}}+\frac{1} {\tau^{2}}},\ j = 1,2,..,r. \end{array} $$

When the sampling variance σ2 is unknown, we can solve the LME counterpart in Eq. 4,

$$y_{ij}= \mu + \pi_{i} + \xi_{j} + \epsilon_{ij},\ i = 1,2,...,n,\ j = 1,2,..., r. $$

We then plug the estimated variances \(\hat {\lambda }^{2}\), \(\hat {\tau }^{2}\) and \(\hat {\sigma }^{2}\) into the above posterior distribution formulas, and obtain the posterior mean and variance through an approximate Bayesian approach.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, G., Xiao, Y., Taylor, P.A. et al. Handling Multiplicity in Neuroimaging Through Bayesian Lenses with Multilevel Modeling. Neuroinform 17, 515–545 (2019).

Download citation

  • Published:

  • Issue Date:

  • DOI:


  • Null Hypothesis Significance Testing (NHST)
  • False Positive Rate (FPR)
  • Type S and type M errors
  • Regions of Interest (ROIs)
  • General Linear Model (GLM)
  • Linear Mixed-Effects (LME) modeling
  • Bayesian Multilevel (BML) modeling
  • Markov Chain Monte Carlo (MCMC)
  • Stan
  • Priors
  • Leave-one-out (LOO) cross-validation