Abstract
Here we address the current issues of inefficiency and overpenalization in the massively univariate approach followed by the correction for multiple testing, and propose a more efficient model that pools and shares information among brain regions. Using Bayesian multilevel (BML) modeling, we control two types of error that are more relevant than the conventional false positive rate (FPR): incorrect sign (type S) and incorrect magnitude (type M). BML also aims to achieve two goals: 1) improving modeling efficiency by having one integrative model and thereby dissolving the multiple testing issue, and 2) turning the focus of conventional null hypothesis significant testing (NHST) on FPR into quality control by calibrating type S errors while maintaining a reasonable level of inference efficiency. The performance and validity of this approach are demonstrated through an application at the region of interest (ROI) level, with all the regions on an equal footing: unlike the current approaches under NHST, small regions are not disadvantaged simply because of their physical size. In addition, compared to the massively univariate approach, BML may simultaneously achieve increased spatial specificity and inference efficiency, and promote results reporting in totality and transparency. The benefits of BML are illustrated in performance and quality checking using an experimental dataset. The methodology also avoids the current practice of sharp and arbitrary thresholding in the pvalue funnel to which the multidimensional data are reduced. The BML approach with its auxiliary tools is available as part of the AFNI suite for general use.
Similar content being viewed by others
Notes
In real practice, the ROIs are not randomly drawn from a hypothetical pool like recruiting experimental subjects. However, from the practical perspective it is not too farfetched to assume that the effects at those ROIs form a distribution such as Gaussian, similar to the assumption of Gaussian distribution for crosssubject effects. It is under this assumption that we treat the crossROI effects as random, and the assumption can be further validated through various crossvalidation methods and model comparisons later in this paper.
For simplicity, here we assume that both Ο_{i} and ΞΎ_{j} being independent and identically distributed (iid). In reality, the strict iid assumption can be problematic for the crossROI effects when they are spatially proximate or neurologically related. Nevertheless, the assumption can be relaxed later on to exchangeability for BML.
Entity effects are more popularly called group effects in the Bayesian literature. However, to avoid potential confusions with the neuroimaging terminology in which the word group refers to subject categorization (e.g., males vs. females, patients vs. controls) or the analytical step of generalization from individual subjects to the group (corresponding to the word population in the Bayesian literature) level, we adopt entity to mean each measuring unit such as subject and ROI in the current context.
See https://en.wikipedia.org/wiki/Foldedt_and_halft_distributions for the density p(Ξ½,ΞΌ,Ο^{2}) of folded nonstandardized tdistribution, where the parameters Ξ½,ΞΌ, and Ο^{2} are the degrees of freedom, mean, and variance.
The LKJ prior (Lewandowski et al. 2009) is a distribution over symmetric positivedefinite matrices with the diagonals of 1s.
Needless to say, the concept of true effect only makes sense under the current model framework at hand, and may not hold once the model is revised.
A single voxel is still possible, but much less likely, to survive this correction approach.
A popular cluster reporting method among the neuroimaging software packages is to simply present the investigator only with the icebergs above the water, the surviving clusters, reinforcing the illusionary eitheror dichotomy under NHST.
The investigator would not be able to even see such borderline clusters since the typical software implementations mechanically adopt a dichotomous results presentation.
References
Amrhein, V., & Greenland, S. (2017). Remove, rather than redefine, statistical significance. Nature Human Behavior, 1, 0224.
Bates, B., Maechler, M., Bolker, B., Walker, S. (2015). Fitting Linear MixedEffects Models Using lme4. Journal of Statistical Software, 67(1), 1β48.
Benjamin, D.J., Berger, J., Johannesson, M., Nosek, B.A., Wagenmakers, E.J., Berk, R., Johnson, Γ.V. (2017). Redefine statistical significance. Nature Human Behavior, 1, 0189.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289β300.
Carp, J. (2012). On the plurality of (Methodological) worlds: estimating the analytic flexibility of fMRI experiments. Frontiers in Neuroscience, 6, 149.
Chen, G., Saad, Z.S., Nath, A.R., Beauchamp, M.S., Cox, R.W. (2012). FMRI Group analysis combining effect estimates and their variances. NeuroImage, 60, 747β765.
Chen, G., Saad, Z.S., Britton, J.C., Pine, D.S., Cox, R.W. (2013). Linear mixedeffects modeling approach to FMRI group analysis. NeuroImage, 73, 176β190.
Chen, G., Adleman, N.E., Saad, Z.S., Leibenluft, E., Cox, R.W. (2014). Applications of multivariate modeling to neuroimaging group analysis: a comprehensive alternative to univariate general linear model. NeuroImage, 99, 571β588.
Chen, G., Taylor, P.A., Shin, Y.W., Reynolds, R.C., Cox, R.W. (2017a). Untangling the relatedness among correlations, part II: intersubject correlation group analysis through linear mixedeffects modeling. NeuroImage, 147, 825β840.
Chen, G., Taylor, P.A., Cox, R.W. (2017b). Is the statistic value all we should care about in neuroimaging? NeuroImage, 147, 952β 959.
Chen, G., Taylor, P.A., Haller, S.P., Kircanski, K., Stoddard, J., Pine, D.S., Leibenluft, E., Brotman, M.A., Cox, R.W. (2018a). Intraclass correlation: improved modeling approaches and applications for neuroimaging. Human Brain Mapping, 39(3), 1187β1206. https://doi.org/10.1002/hbm.23909.
Chen, G., Cox, R.W., Glen, D.R., Rajendra, J.K., Reynolds, R.C., Taylor, P.A. (2018b). A tail of two sides: Artificially doubled false positive rates in neuroimaging due to the sidedness choice with ttests. Human Brain Mapping. In press.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997β1003.
Cox, R.W. (1996). AFNI: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical Research, 29, 162β173. http://afni.nimh.nih.gov.
Cox, R.W., Chen, G., Glen, D.R., Reynolds, R.C., Taylor, P.A. (2017). FMRI clustering in AFNI: falsepositive rates redux. Brain Connection, 7(3), 152β171.
Cox, R.W. (2018). Equitable Thresholding and Clustering. In preparation.
Cox, R.W., & Taylor, P.A. (2017). Stability of Spatial Smoothness and ClusterSize Threshold Estimates in FMRI using AFNI. arXiv:1709.07471.
Cremers, H.R., Wager, T.D., Yarkoni, T. (2017). The relation between statistical power and inference in fMRI. PLoS ONE, 12(11), e0184923.
Eklund, A., Nichols, T.E., Knutsson, H. (2016). Cluster failure: why fMRI inferences for spatial extent have inflated falsepositive rates. PNAS, 113(28), 7900β7905.
Forman, S.D., Cohen, J.D., Fitzgerald, M., Eddy, W.F., Mintun, M.A., Noll, D.C. (1995). Improved assessment of significant activation in functional magnetic resonance imaging (fMRI): use of a clustersize threshold. Magnetic Resonance Medicine, 33, 636β 647.
Gelman, A. (2015). Statistics and the crisis of scientific replication. Significance, 12(3), 23β25.
Gelman, A. (2016). The problems with pvalues are not just with pvalues. The American Statistician, Online Discussion.
Gelman, A., & Carlin, J.B. (2014). Beyond power calculations: assessing type s (sign) and type m (magnitude) errors. Perspectives on Psychological Science, 1β11.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B. (2014). Bayesian data analysis, Third edition. Boca Raton: Chapman & Hall/CRC Press.
Gelman, A., & Hennig, C. (2017). Beyond subjective and objective in statistics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180(4), 1β31.
Gelman, A., Hill, J., Yajima, M. (2012). Why we (usually) donβt have to worry about multiple comparisons. Journal of Research on Educational Effectiveness, 5, 189β211.
Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no βfishing expeditionβ or βphackingβ and the research hypothesis was posited ahead of time. http://www.stat.columbia.edu/gelman/research/unpublished/p_hacking.pdf.
Gelman, A., & Shalizi, C.R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66, 8β38.
Gelman, A., Simpson, D., Betancourt, M. (2017). The prior can generally only be understood in the context of the likelihood. arXiv:1708.07487.
Gelman, A., & Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics15, 373β390.
GonzalezCastillo, J., Saad, Z.S., Handwerker, D.A., Inati, S.J., Brenowitz, N., Bandettini, P.A. (2012). Wholebrain, timelocked activation with simple tasks revealed using massive averaging and modelfree analysis. PNAS, 109(14), 5487β5492.
GonzalezCastillo, J., Chen, G., Nichols, T., Cox, R.W., Bandettini, P.A. (2017). Variance decomposition for singlesubject taskbased fMRI activity estimates across many sessions. NeuroImage, 154, 206β218.
Lazzeroni, L.C., Lu, Y., BelitskayaLΓ©vy, I. (2016). Solutions for quantifying Pvalue uncertainty and replication power. Nature Methods, 13, 107β110.
Lewandowski, D., Kurowicka, D., Joe, H. (2009). Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis, 100, 1989β2001.
Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584β585.
McElreath, R. (2016). Statistical Rethinking: a Bayesian course with examples in R and Stan. Boca Raton: Chapman & Hall/CRC Press.
McShane, B.B., Gal, D., Gelman, A., Robert, C., Tackett, J.L. (2017). Abandon statistical significance. arXiv:1709.07588.
Mejia, A., Yue, Y.R., Bolin, D., Lindren, F., Lindquist, M.A. (2017). A Bayesian general linear modeling approach to cortical surface fMRI data analysis. arXiv:1706.00959.
Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D., Wagenmakers, E.J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin and Review, 23(1), 103β123.
Mueller, K., Lepsien, J., MΓΆller, H.E., Lohmann, G. (2017). Commentary: cluster failure: why fMRI inferences for spatial extent have inflated falsepositive rates. Frontiers in Human Neuroscience, 11, 345.
Nichols, T.E., & Holmes, A.P. (2001). Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human Brain Mapping, 15(1), 1β25.
Olszowy, W., Aston, J., Rua, C., Williams, G.B. (2017). Accurate autocorrelation modeling substantially improves fMRI reliability. arXiv:1711.09877.
Poline, J.B., & Brett, M. (2012). The general linear model and fMRI: does love last forever? NeuroImage, 62 (2), 871β880.
R Core Team. (2017). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.Rproject.org/.
Saad, Z.S., Reynolds, R.C., Argall, B., Japee, S., Cox, R.W. (2004). SUMA: an interface for surfacebased intra and intersubject analysis with AFNI. In Proceedings of the 2004 IEEE International Symposium on Biomedical Imaging (pp. 1510β1513).
Schaefer, A., Kong, R., Gordon, E.M., Zuo, X.N., Holmes, A.J., Eickhoff, S.B., Yeo, B.T. (2017). Localglobal parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. Cerebral Cortex. In press.
Simmons, J.P., Nelson, L.D., Simonsohn, U. (2011). Falsepositive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359β1366.
Smith, S.M., & Nichols, T.E. (2009). Thresholdfree cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage, 44(1), 83β98.
Stan Development Team. (2017). Stan modeling language users guide and reference manual, Version 2.17.0. http://mcstan.org.
Steegen, S., Tuerlinckx, F., Gelman, A., Vanpaemel, W. (2016). Increasing transparency through a multiverse Analysis. Perspectives on Psychological Science, 11(5), 702β712.
Wasserstein, R.L., & Lazar, N.A. (2016). The ASAβs statement on pvalues: context, process, and purpose. The American Statistician 70, 2, 129β133.
Vehtari, A., Gelman, A., Gabry, J. (2017). Practical Bayesian model evaluation using leaveoneout crossvalidation and WAIC. Statistics and Computing, 27(5), 1413β1432.
Westfall, J., Nichols, T.E., Yarkoni, T. (2017). Fixing the stimulusasfixedeffect fallacy in task fMRI. Wellcome Open Research, 1, 23.
Wickham, H. (2009). Ggplot2: elegant graphics for data analysis. New York: Springer.
Worsley, K.J., Marrett, S., Neelin, P., Evans, A.C. (1992). A threedimensional statistical analysis for CBF activation studies in human brain. Journal of Cerebral Blood Flow and Metabolism, 12, 900β918.
Xiao, Y., Geng, F., Riggins, T., Chen, G., Redcay, E. (2018). Neural correlates of developing theory of mind competence in early childhood. Under review.
Yeung, A.W.K. (2018). An updated survey on statistical thresholding and sample size of fMRI studies. Frontiers in Human Neuroscience, 12, 16.
Acknowledgments
The research and writing of the paper were supported (GC, PAT, and RWC) by the NIMH and NINDS Intramural Research Programs (ZICMH002888) of the NIH/HHS, USA, and by the NIH grant R01HD079518A to TR and ER. Much of the modeling work here was inspired from Andrew Gelmanβs blog. We are indebted to PaulChristian BΓΌrkner and the Stan development team members Ben Goodrich, Daniel Simpson, Jonah Sol Gabry, Bob Carpenter, and Michael Betancourt for their help and technical support. The simulations were performed in the R language for statistical computing and the figures were generated with the R package ggplot2 (Wickham 2009).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisherβs Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Pitfalls of NHST

i.
It is a common mistake by investigators and even statistical analysts to misinterpret the conditional probability under NHST as the posterior probability of the truth of the null hypothesis (or the probability of the null event conditional on the current data at hand) even though fundamentally P(data  H_{0})β P(H_{0}  data).

ii.
One may conflate statistical significance with practical significance, and subsequently treat the failure to reach statistical significance as the nonexistence of any meaningful effect. Even though the absence of evidence is not an evidence of absence, it is common to read discussions in scientific literature wherein the authors implicitly (or even explicitly) treat statistically nonsignificant effects as if they were zero.

iii.
Statistic or pvalues cannot easily be compared: the difference between a statistically significant effect and another effect that fails to pass the significance level does not necessarily itself reach statistical significance.

iv.
How should the investigator handle the demarcation, due to sharp thresholding, between one effect with p =β0.05 (or a surviving cluster cutoff of 54 voxels) and another with p =β0.051 (or a cluster size of 53 voxels)^{Footnote 9}?

v.
The focus on statistic or pvalue seems to, in practice, lead to the preponderance of reporting only statistical, instead of effect, maps in neuroimaging, losing an effective safeguard that could have filtered out potentially spurious results (Chen et al. 2017b).

vi.
One may mistakenly gain more confidence in a statistically significant result (e.g., high statistic value) in the context of data with relatively heavy noise or with a small sample size (e.g., leading to statement such as βdespite the small sample sizeβ or βdespite the limited statistical powerβ). In fact, using statistical significance as a screener can lead researchers to make a wrong assessment about the sign of an effect or drastically overestimate the magnitude of an effect.

vii.
While the conceptual classifications of false positives and false negatives make sense in a system of discrete nature (e.g., juror decision on H_{0}: the suspect is innocent), what are the consequences when we adopt a mechanical dichotomous approach to assessing a quantity of continuous, instead of discrete, nature?

viii.
It is usually underappreciated that the pvalue, as a function of data, is a random variable, and thus itself has a sampling distribution. In other words, pvalues from experiments with identical designs can differ substantially, and statistically significant results may not necessarily be replicated (Lazzeroni et al. 2016).
Appendix B: Type S and Type M Errors
We discuss two types of error that have not been introduced in neuroimaging: type S and type M. These two types of error cannot be directly captured by the FPR concept and may become severe when the effect is small relative to the noise, which is usually the situation in BOLD neuroimaging data. In the NHST framework, we formulate a null hypothesis H_{0} (e.g., the effect of an easy task E is identical to a difficult one D), and then commit a type I (or false positive) error if wrongly rejecting H_{0} (e.g., the effect of easy is judged to be statistically different from difficult when actually their effects are the same); in contrast, we make a type II (or false negative) error when accepting H_{0} when H_{0} is in fact false (e.g., the effect of easy is assessed to be not statistically different from difficult even though their effects do differ). These are the dichotomous errors associated with NHST, and the counterbalance between these two types of error are the underpinnings of typical experimental design as well results reporting.
However, we could think about other ways of framing errors when making a statistical assessment (e.g., the easy case elicits a stronger BOLD response at some region than the difficult case) conditional on the current data. We are exposed to a risk that our decision is contrary to the truth (e.g., the BOLD response to the easy condition is actually lower than to the difficult condition). Such a risk is gauged as a type S (for βsignβ) error when we incorrectly identify the sign of the effect; its values range from 0 (no chance of error) to 1 (full chance of error). Similarly, we make a type M (for βmagnitudeβ) error when estimating the effect as small in magnitude if it is actually large, or when claiming that the effect is large in magnitude if it is in fact small (e.g., saying that the easy condition produces a much large response than the difficult one when actually the difference is tiny); its values range across the positive real numbers: [0, 1) correspond to underestimation of effect magnitude, 1 describes correct estimation, and (1, β^{+}) mean overestimation. The two error types are illustrated in Fig.Β 5 for inferences made under NHST. In the neuroimaging realm, effect magnitude is certainly a property of interest, therefore the corresponding type S and type M errors would be of research interest.
Geometrically speaking, if the null hypothesis H_{0} can be conceptualized as the point at zero, NHST aims at the real space R excluding zero with a pivot at the point of zero (e.g., D β E =β0); in contrast, type S error gauges the relative chance that a result is assessed on the wrong side of the distribution between the two half spaces of R (e.g., D β E >β0 or D β E <β0), and type M error gauges the relative magnitude of differences along segments of R^{+} (e.g., the ratio of measured to actual effect is β«β1 or βͺβ1). Thus, we characterize type I and type II errors as βpointwiseβ errors, driven by judging the equality, and describe type S and type M errors as βdirectionwise,β driven by the focus of inequality or directionality.
One direct application of type M error is that publication bias can lead to type M errors, as large effect estimates are more likely to filter through the dichotomous decisions in statistical inference and reviewing process. Using the type S and type M error concepts, it might be surprising for those who encounter these two error types for the first time to realize that, when the data are highly variable or noisy, or when the sample size is small with a relatively low power (e.g., 0.06), a statistically significant result at the 0.05 level is quite likely to have an incorrect sign β with a type S error rate of 24% or even higher (Gelman and Carlin 2014). In addition, such a statistically significant result would have a type M error with its effect estimate much larger (e.g., 9 times higher) than the true value. Put it another way, if the real effect is small and sampling variance is large, then a dataset that reaches statistical significance must have an exaggerated effect estimate and the sign of the effect estimate is likely to be incorrect. Due to the ramifications of type M errors and publication filtering, an effect size from the literature could be exaggerated to some extent, seriously calling into question the usefulness of power analysis under NHST in determining sample size or power, which might explain the dramatic contrast between the common practice of power analysis as a requirement for grant applications and the reproducibility crisis across various fields. Fundamentally, power analysis inherits the same problem with NHST: a narrow emphasis on statistical significance is placed as a primary focus (Gelman and Carlin 2014).
The typical effect magnitude in BOLD FMRI at 3 Tesla is usually small, less than 1% signal change in most brain regions except for areas such as motor and primary sensory cortex. Such a weak signal can be largely submerged by the overwhelming noise and distortion embedded in the FMRI data. The low power for detection of typical FMRI data analyses in typical datasets is further compounded by the modeling challenges in accurately capturing the effect. For example, even though large number of physiological confounding effects are embedded in the data, it is still difficult to properly incorporate the physiological βnoisesβ (cardiac and respirary effects) in the model. Moreover, habituation, saturation, or attenuation across trials or within each block are usually not considered, and such fluctuations relative to the average effect would be treated as noise or fixed instead of randomeffects (Westfall et al. 2017). There are also strong indications that a large portion of BOLD activations are usually unidentified at the individual subject level due to the lack of power (GonzalezCastillo et al. 2012). Because of these factors, the variance due to poor modeling overwhelms all other sources (e.g., across trials, runs, and sessions) in the total data variance (GonzalezCastillo et al. 2017); that is, the majority (e.g., 6080%) of the total variance in the data is not properly accounted for in statistical models.
Appendix C: Multiplicity in Neuroimaging
In general, we can classify four types of multiplicity issues that commonly occur in neuroimaging data analysis.
 A)
Multiple testing. The first and major multiplicity arises when the same design (or model) matrix is applied multiple times to different values of the response or outcome variable, such as the effect estimates at the voxels within the brain. As the conventional voxelwise neuroimaging data analysis is performed with a massively univariate approach, there are as many models as the number of voxels, which is the source of the major multiplicity issue: multiple testing. Those models can be, for instance, Studentβs ttests, AN(C)OVA, univariate or multivariate GLM, LME or Bayesian model. Regardless of the specific model, all the voxels share the same design matrix, but have different response variable values on the lefthand side of the equation. With human brain size on the order of 10^{6} mm^{3}, the number of voxels may range from 20,000 to 150,000 depending on the voxel dimensions. Each extra voxel adds an extra model and leads to incrementally mounting odds of pure chance or βstatistically significant outcomes,β presenting the challenge to account for the occurrence of mounting FWE, while effectively holding the overall false positive rate (FPR) at a nominal level (e.g., 0.05). In the same vein, surfacebased analysis is performed with 30,000 to 50,000 nodes (Saad et al. 2004), sharing a similar multiple testing issue with its volumebased counterpart. Sometimes the investigator performs analyses at a smaller number of ROIs, perhaps of order 100, but even here adjustment is still required for the multiple testing issue (though it is often not made).
 B)
Double sidedness. Another occurrence of multiplicity is the widespread adoption of two separate onesided (or onetailed) tests in neuroimaging. For instance, the comparison between the two conditions of βeasyβ and βdifficultβ are usually analyzed twice for the whole brain: one showing whether the easy effect is higher than difficult, and the other for the possibility of the difficult effect being higher than easy. Onesided testing for one direction would be justified if prior knowledge is available regarding the sign of the test for a particular brain region. When no prior information is available for all regions in the brain, one cannot simply finesse two separate onesided tests in place of one twosided test, and a double sidedness practice warrants a Bonferroni correction because the two tails are independent with respect to each other (and each onesided test is more liberal than a twosided test at the same significance level). However, simultaneously testing both tails in tandem for whole brain analysis without correction is widely used without clear justification, and this forms a source of multiplicity issue that needs proper accounting (Chen et al. 2018b).
 C)
Multiple comparisons. It rarely occurs that only one statistical test is carried out in a specific neuroimaging study, such as a single onesample ttest. Therefore, a third source of multiplicity is directly related to the popular term, multiple comparisons, which occur when multiple tests are conducted under one model. For example, an investigator that designs an emotion experiment with three conditions (easy, difficult, and moderate) may perform several separate tests: comparing each of the three conditions to baseline, making three pairwise comparisons, or testing a linear combination of the three conditions (such as the average of easy and difficult versus moderate). However, neuroimaging publications seldom consider corrections for such separate tests.
 D)
Multiple paths. The fourth multiplicity issue to affect outcome interpretation arises from the number of potential preprocessing, data dredging and analytical pipelines (Carp 2012). For instance, all common steps have a choice of procedures: outlier handling (despiking, censoring), slice timing correction (yes/no, various interpolations), head motion correction (different interpolations), different alignment methods from EPI to anatomical data plus upsampling (1 to 4 mm), different alignment methods to different standard spaces (Talairach and MNI variants), spatial smoothing (3 to 10 mm), data scaling (voxelwise, global or grand mean), confounding effects (slow drift modeling with polynomials, high pass filtering, head motion parameters), hemodynamic response modeling (different presumed functions and multiple basis functions), serial correlation modeling (whole brain, tissuebased, voxelwise AR or ARMA), and population modeling (univariate or multivariate GLM, treating sex as a covariate of no interest (thus no interactions with other variables) or as a typical factor (plus potential interactions with other variables)). Each choice represents a βbranching pointβ that could have a quantitative change to the final effect estimate and inference. Conservatively assuming three options at each step here would yield totally 3^{10} =β59, 049 possible paths, commonly referred to as researcher degrees of freedom (Simmons et al., 2011). The impact of the choice at each individual step for this abbreviated list might be negligible, moderate, or substantial. For example, different serial correlation models may lead to substantially different effect estimate reliability (Olszowy et al. 2017); the estimate for spatial correlation of the noise could be sensitive to the voxel size to which the original data were upsampled (Mueller et al. 2017; Cox and Taylor 2017), which may lead to different cluster thresholds and poor control to the intended FPR in correcting for multiplicity. Therefore, the cumulative effect across all these multilevel branching points could be a large divergence between any two paths for the final results. A multiverse analysis (Steegen et al. 2016) has been suggested for such situations of having a βgarden of forking pathsβ (Gelman and Loken 2013), but this seems highly impractical for neuroimaging data. Even when one specific analytical path is chosen by the investigator, it remains possible to invoke potential or implicit multiplicity in the sense that the details of the analytical steps such as data sanitation are conditional on the data (Gelman and Loken 2013). The final interpretation of significance typically ignores the number of choices or the potential branchings that may affect the final outcome, even though it would be more preferable to have the statistical significance independent of these preprocessing steps.
Appendix D: Bayesian Modeling for OneWay RandomEffects ANOVA
Here we discuss a classical framework, a hierarchical or multilevel model for a oneway randomeffects ANOVA, and use it as a building block to expand to a Bayesian framework for neuroimaging group analysis. In evaluating this model, the controllability of inference errors will be focused on type S errors instead of the traditional FPR. Suppose that there are r measured entities (e.g., ROIs), with entity j measuring the effect π_{j} from n_{j} independent Gaussiandistributed data points y_{ij}, each of which represents a sample (e.g., trial), i =β1, 2,...,n_{j}. The conventional statistical approach formulates r separate models,
where π_{ij} is the residual for the j th entity and is assumed to be Gaussian \(\mathcal {N}(0, \sigma ^{2}\)), j =β1, 2,...,r. Depending on whether the sampling variance Ο^{2} is known or not, each effect can be assessed through its sample mean \(\bar {y}_{\cdot j} = \frac {1}{n_{j}}{\sum }_{i = 1}^{n_{j}} y_{ij}\) relative to the corresponding variance \({V_{j}^{0}}=\frac {\sigma ^{2}}{n_{j}}\), resulting in a Z or ttest.
By combining the data from the r entities and further decomposing the effect π_{j} into an overall effect b_{0} across the r entities and the deviation ΞΎ_{j} of the j th entity from the overall effect (i.e., π_{j} = b_{0} + ΞΎ_{j},j =β1, 2,...,r), we have a conventional oneway randomeffects ANOVA,
where b_{0} is conceptualized as a fixedeffects parameter, ΞΎ_{j} codes the random fluctuation of the j th entity from the overall mean b_{0}, with the assumption of \(\xi _{j} \sim \mathcal {N}(0, \tau ^{2})\), and the residual π_{ij} follows a Gaussian distribution \(\mathcal {N}(0, \sigma ^{2}\)). The classical oneway randomeffects ANOVA model (19) is typically formulated to examine the null hypothesis,
with an Fstatistic, which is constructed as the ratio of the between mean sums of squares and the within mean sums of squares. An application of this ANOVA model (19) to neuroimaging is to compute the intraclass correlation ICC(1,1) as \(\frac {\tau ^{2}}{\tau ^{2}+\sigma ^{2}}\) when the measuring entities are exchangeable (e.g., families with identical twins; Chen et al. 2018a).
Whenever multiple values (e.g., two effect estimates from two scanning sessions) from each measuring unit (e.g., subject or family) are correlated (e.g., the levels of a withinsubject or repeatedmeasures factor), the data can be formulated using a linear mixedeffects (LME) model, sometimes referred to as a multilevel or hierarchical model. One natural ANOVA extension is simply to treat the model conceptually as LME, without the need of reformulating the model equation (19). However, LME can only provide point estimates for the overall effect b_{0}, crossregion variance Ο^{2} and the data variance Ο^{2}; that is, the LME (19) cannot directly provide any information regarding the individual ΞΎ_{j} or π_{j} values because of overfitting due to the fact that the number of data points is less than the number of parameters that need to be estimated.
Our interest here is neither to assess the variability Ο^{2} nor to calculate ICC, but instead to make statistical inferences about the individual effects π_{j}. Nevertheless, the conventional NHST (20) may shed some light on potential strategies (Gelman et al. 2014) for π_{j}. If the deviations ΞΎ_{j} are relatively small compared to the overall mean b_{0}, then the corresponding Fstatistic value will be small as well, leading to the decision of not rejecting the null hypothesis (20) at a reasonable, predetermined significance level (e.g., 0.05); in that case, we can estimate the equal individual effects π_{j} using the overall weighted mean \(\bar {y}_{\cdot \cdot }\) through full pooling with all the data,
where \(\bar {y}_{\cdot j}=\frac {1}{n_{j}}{\sum }_{i = 1}^{n_{j}} y_{ij}\) and \({\sigma _{j}^{2}} = \frac {\sigma ^{2}}{n_{j}}\) are the sampling mean and variance for the j th measuring entity, and the subscript dot (β ) notation indicates the (weighted) mean across the corresponding index(es). On the other hand, if the deviations ΞΎ_{j} are relatively large, so is the associated Fstatistic value, leading to the decision of rejecting the null hypothesis (20); similarly, we can reasonably estimate π_{j} with no pooling across the r entities; that is, each π_{j} is estimated using the jth measuring entityβs data separately,
However, in estimating π_{j} we do not have to take a dichotomous approach of choosing, based on a preset significance level, between these two extreme choices, the overall weighted mean \(\bar {y}_{\cdot \cdot }\) in Eq.Β 21 through full pooling and the separate means \(\bar {y}_{\cdot j}\) in Eq.Β 21 with no pooling; instead, we could make the assumption that a reasonable estimate to π_{j} lies somewhere along the continuum between \(\bar {y}_{\cdot \cdot }\) and \(\bar {y}_{\cdot j}\), with its exact location derived from the data instead of by imposing an arbitrary threshold. This thinking brings us to the Bayesian methodology.
To simplify the situation, we first assume a known sampling variance Ο^{2} for the i th data point (e.g., trial) for the j th entity; or, in Bayesianstyle formulation, we build a BML about the distribution of y_{ij} conditional on π_{j},
With a prior distribution \(\mathcal {N}(b_{0}, \tau ^{2})\) for π_{j} and a noninformative uniform hyperprior for b_{0} given Ο (i.e., b_{0}Ο βΌ 1), the conditional posterior distributions for π_{j} can be derived (Gelman et al. 2014) as,
The analytical solution (24) indicates that \(\frac {1}{V_{j}} = \frac {1}{{\sigma _{j}^{2}}}+\frac {1} {\tau ^{2}}\), manifesting an intuitive fact that the posterior precision is the cumulative effect of the data precision and the prior precision; that is, the posterior precision is improved by the amount \(\frac {1} {\tau ^{2}}\) relative to the data precision \(\frac {1}{{\sigma _{j}^{2}}}\). Moreover, the expression for the posterior mode of \(\hat {\theta }_{j}\) in Eq.Β 24 shows that the estimating choice in the continuum can be expressed as a precisionweighted average between the individual sample means \(\bar {y}_{\cdot j}\) and the overall mean b_{0}:
where the weights \(w_{j} =\frac {V_{j}}{{\sigma _{j}^{2}}}\). The precision weighting in Eq.Β 25 makes intuitive sense in terms of the previously described limiting cases:
 i.
The full pooling (21) corresponds to w_{j} =β0 or Ο^{2} =β0, which means that the π_{j} are assumed to be the same or fixed at a common value. The approach would lead to underfitting because the effect is assumed to be invariant across ROIs.
 ii.
The no pooling (22) corresponds to w_{j} =β1 or Ο^{2} = β, indicating that the r effects π_{j} are uniformly distributed within (ββ,β); that is, it corresponds to a noninformative uniform prior on π_{j}. In contrast to full pooling, no pooling tends to overfit the data as the information at one ROI is not utilized to shed light on any other ROIs.
 iii.
The partial pooling (24) or (25) reflects the fact that the r effects π_{j} are a priori assumed to follow an independent and identically distribution, the prior \(\mathcal {N}(b_{0}, \tau ^{2})\). Under the Bayesian framework, we make statistical inferences about the r effects π_{j} with a posterior distribution (24) that includes the conventional dichotomous decisions between full pooling (21) and no pooling (22) as two special and extreme cases. Moreover, as expressed in Eq.Β 25, the Bayesian estimate \(\hat {\theta }_{j}\) can be conceptualized as the precisionweighted average between the individual estimate \(\bar {y}_{\cdot j}\) and the overall (or prior) mean b_{0}, the adjustment of π_{j} from the overall mean b_{0} toward the observed mean \(\bar {y}_{\cdot j}\), or conversely, the observed mean \(\bar {y}_{\cdot j}\) being shrunk toward the overall mean b_{0}. As a middle ground between full pooling and no pooling, partial pooling usually provides a better fit to the data since the information is effectively pooled and shared across ROIs.
An important concept for a Bayesian model is exchangeability. Specifically for the BML (23), the effects π_{j} are exchangeable if their joint distribution p(π_{1},π_{2},...,π_{r}) is immutable or invariant to any random permutation among their indices or orders (e.g., p(π_{1},π_{2},...,π_{r}) is a symmetric function). Using the ROIs as an example, exchangeability means that, without any a priori knowledge about their effects, we can randomly shuffle or relabel them without reducing our knowledge about their effects. In other words, complete ignorance equals exchangeability: before poring over the data, there is no way for us to distinguish the regions from each other. When the exchangeability assumption can be assumed for π_{j}, their joint distribution can be expressed as a mixture of independent and identical distributions (Gelman et al. 2014), which is essential in the derivation of the posterior distribution (24) from the prior distribution \(\mathcal {N}(b_{0}, \tau ^{2})\) for π_{j}.
To complete the Bayesian inferences for the model (23), we proceed to obtain (i) p(b_{0},Οy), the marginal posterior distribution of the hyperparameters (b_{0},Ο), (ii) p(b_{0}Ο,y), the posterior distribution of b_{0} given Ο, and (iii) p(Οy), the posterior distribution of Ο with a prior for Ο, for example, a noninformative uniform distribution p(Ο) βΌ 1. In practice, the numerical solutions are achieved in a backward order, through Monte Carlo simulations of Ο to get p(Οy), simulations of b_{0} to get p(b_{0}Ο,y), and, lastly, simulations of π_{j} to get p(π_{j}b_{0},Ο,y) in Eq.Β 24.
Assessing type S Error Under BML
In addition to the advantage of information merging across the r entities between the limits of complete and no pooling, a natural question remains: how does BML perform in terms of the conventional type I error as well as type S and type M errors? With the βstandardβ analysis of r separate models in Eq.Β 18, each effect π_{j} is assessed against the sampling variance \({V_{j}^{0}}={\sigma _{j}^{2}}\). In contrast, under the BML (23), the posterior variance, as shown in Eq.Β 24, is \(V_{j} = \frac {1}{\frac {1}{{\sigma _{j}^{2}}}+\frac {1} {\tau ^{2}}},\ {\sigma _{j}^{2}} = \frac {\sigma ^{2}}{n_{j}}\). As the ratio of the two variances \(\frac {{V_{j}^{0}}}{V_{j}}=\frac {\tau ^{2}}{\tau ^{2}+{\sigma _{j}^{2}}}\) is always less than 1 (except for the limiting cases of Ο^{2} β 0 or Ο^{2} ββ), BML generally assigns a larger uncertainty than the conventional approach with no pooling. That is, the inference for each effect π_{j} based on the unified model (23) is more conservative than when the effect is assessed individually through the model (18). Instead of tightening the overall FPR through some kind of correction for multiplicity among the r separate models, BML addresses the multiplicity issue through precision adjustment or partial pooling under one model with a shrinking or pooling strength of \(\sqrt {\frac {{V_{j}^{0}}}{V_{j}}}=\frac {1}{\sqrt {1+{\sigma _{j}^{2}}/\tau ^{2}}}\).
Simulations (Gelman and Tuerlinckx 2000) indicate that, when making inference based on the 95% quantile interval of the posterior distribution for a single effect π_{j} (j is fixed, e.g., j =β1), the type S error rate for the Bayesian model (23) is less than 0.025 under all circumstances. In contrast, the conventional model (18) would have a substantial type S error rate especially when the sampling variance is large relative to the crossentities variance (e.g., \({\sigma _{j}^{2}}/\tau ^{2} > 2\)); specifically, type S error reaches 10% when \({\sigma _{j}^{2}}/\tau ^{2} =β2\), and may go up to 50% if \({\sigma _{j}^{2}}\) much larger than Ο^{2}. When multiple comparisons are performed, a similar patterns remains; that is, the type S error rate for the Bayesian model is in general below 2.5%, and is lower than the conventional model with rigorous correction (e.g., Tukeyβs honestly significant difference test, wholly significant differences) for multiplicity when Ο/Ο >β1. The controllability of BML on type S errors is parallel to the usual focus on type I errors under NHST; however, unlike NHST in which the typical I error rate is deliberately controlled through an FPR threshold, the controllability of type S errors under BML is intrinsically embedded in the modeling mechanism without any explicit imposition.
The model (23) is typically seen in Bayesian statistics textbooks as an intuitive introduction to BML (e.g., Gelman et al. 2014). With the indices i and j coding the task trials and ROIs, respectively, the ANOVA model (19) or its Bayesian counterpart (23) can be utilized to make inferences on an ensemble of ROIs at the individual subject level. The conventional analysis would have to deal with the multiplicity issue because of separate inferences at each ROI (i.e., entity). In contrast, there is only one integrated model (23) that leverages the information among the r entities, and the resulting partial pooling effectively dissolves the multiple testing concern. However, the modeling framework can only be applied for single subject analysis, and it is not suitable at the population level; nevertheless, it serves as an intuitive tool for us to extend to more sophisticated scenarios.
Appendix E: Derivation of Posterior Distribution for BML (5)
We start with the BML system (5) with a known sampling variance Ο^{2},
Conditional on π_{j} and prior Ο_{i} βΌ N(0,Ξ»^{2}), the variance for the sampling mean at the j th ROI, \(\bar {y}_{\cdot j} = \frac {1}{n}{\sum }_{i = 1}^{n} y_{ij}=\theta _{j}+\frac {1}{n}{\sum }_{i = 1}^{n} \pi _{i}+\frac {1}{n}{\sum }_{i = 1}^{n} \epsilon _{ij}\), is \(\frac {\lambda ^{2}+\sigma ^{2}}{n}\); that is,
With priors Ο_{i} βΌ N(0,Ξ»^{2}) and π_{j} βΌ N(ΞΌ,Ο^{2}), we follow the same derivation as in the likelihood (23), and obtain the posterior distribution,
When the sampling variance Ο^{2} is unknown, we can solve the LME counterpart in Eq.Β 4,
We then plug the estimated variances \(\hat {\lambda }^{2}\), \(\hat {\tau }^{2}\) and \(\hat {\sigma }^{2}\) into the above posterior distribution formulas, and obtain the posterior mean and variance through an approximate Bayesian approach.
Rights and permissions
About this article
Cite this article
Chen, G., Xiao, Y., Taylor, P.A. et al. Handling Multiplicity in Neuroimaging Through Bayesian Lenses with Multilevel Modeling. Neuroinform 17, 515β545 (2019). https://doi.org/10.1007/s1202101894096
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1202101894096