# Handling Multiplicity in Neuroimaging Through Bayesian Lenses with Multilevel Modeling

## Abstract

Here we address the current issues of inefficiency and over-penalization in the massively univariate approach followed by the correction for multiple testing, and propose a more efficient model that pools and shares information among brain regions. Using Bayesian multilevel (BML) modeling, we control two types of error that are more relevant than the conventional false positive rate (FPR): incorrect sign (type S) and incorrect magnitude (type M). BML also aims to achieve two goals: 1) improving modeling efficiency by having one integrative model and thereby dissolving the multiple testing issue, and 2) turning the focus of conventional null hypothesis significant testing (NHST) on FPR into quality control by calibrating type S errors while maintaining a reasonable level of inference efficiency. The performance and validity of this approach are demonstrated through an application at the region of interest (ROI) level, with all the regions on an equal footing: unlike the current approaches under NHST, small regions are not disadvantaged simply because of their physical size. In addition, compared to the massively univariate approach, BML may simultaneously achieve increased spatial specificity and inference efficiency, and promote results reporting in totality and transparency. The benefits of BML are illustrated in performance and quality checking using an experimental dataset. The methodology also avoids the current practice of sharp and arbitrary thresholding in the *p*-value funnel to which the multidimensional data are reduced. The BML approach with its auxiliary tools is available as part of the AFNI suite for general use.

## Keywords

Null Hypothesis Significance Testing (NHST) False Positive Rate (FPR) Type S and type M errors Regions of Interest (ROIs) General Linear Model (GLM) Linear Mixed-Effects (LME) modeling Bayesian Multilevel (BML) modeling Markov Chain Monte Carlo (MCMC) Stan Priors Leave-one-out (LOO) cross-validation## Notes

### Acknowledgments

The research and writing of the paper were supported (GC, PAT, and RWC) by the NIMH and NINDS Intramural Research Programs (ZICMH002888) of the NIH/HHS, USA, and by the NIH grant R01HD079518A to TR and ER. Much of the modeling work here was inspired from Andrew Gelman’s blog. We are indebted to Paul-Christian Bürkner and the Stan development team members Ben Goodrich, Daniel Simpson, Jonah Sol Gabry, Bob Carpenter, and Michael Betancourt for their help and technical support. The simulations were performed in the R language for statistical computing and the figures were generated with the R package ggplot2 (Wickham 2009).

## References

- Amrhein, V., & Greenland, S. (2017). Remove, rather than redefine, statistical significance.
*Nature Human Behavior*,*1*, 0224.Google Scholar - Bates, B., Maechler, M., Bolker, B., Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4.
*Journal of Statistical Software*,*67*(1), 1–48.CrossRefGoogle Scholar - Benjamin, D.J., Berger, J., Johannesson, M., Nosek, B.A., Wagenmakers, E.-J., Berk, R., Johnson, É.V. (2017). Redefine statistical significance.
*Nature Human Behavior*,*1*, 0189.Google Scholar - Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing.
*Journal of the Royal Statistical Society Series B*,*57*, 289–300.Google Scholar - Carp, J. (2012). On the plurality of (Methodological) worlds: estimating the analytic flexibility of fMRI experiments.
*Frontiers in Neuroscience*,*6*, 149.CrossRefGoogle Scholar - Chen, G., Saad, Z.S., Nath, A.R., Beauchamp, M.S., Cox, R.W. (2012). FMRI Group analysis combining effect estimates and their variances.
*NeuroImage*,*60*, 747–765.CrossRefGoogle Scholar - Chen, G., Saad, Z.S., Britton, J.C., Pine, D.S., Cox, R.W. (2013). Linear mixed-effects modeling approach to FMRI group analysis.
*NeuroImage*,*73*, 176–190.CrossRefGoogle Scholar - Chen, G., Adleman, N.E., Saad, Z.S., Leibenluft, E., Cox, R.W. (2014). Applications of multivariate modeling to neuroimaging group analysis: a comprehensive alternative to univariate general linear model.
*NeuroImage*,*99*, 571–588.CrossRefGoogle Scholar - Chen, G., Taylor, P.A., Shin, Y.W., Reynolds, R.C., Cox, R.W. (2017a). Untangling the relatedness among correlations, part II: inter-subject correlation group analysis through linear mixed-effects modeling.
*NeuroImage*,*147*, 825–840.Google Scholar - Chen, G., Taylor, P.A., Cox, R.W. (2017b). Is the statistic value all we should care about in neuroimaging?
*NeuroImage*,*147*, 952– 959.Google Scholar - Chen, G., Taylor, P.A., Haller, S.P., Kircanski, K., Stoddard, J., Pine, D.S., Leibenluft, E., Brotman, M.A., Cox, R.W. (2018a). Intraclass correlation: improved modeling approaches and applications for neuroimaging.
*Human Brain Mapping*,*39*(3), 1187–1206. https://doi.org/10.1002/hbm.23909. - Chen, G., Cox, R.W., Glen, D.R., Rajendra, J.K., Reynolds, R.C., Taylor, P.A. (2018b). A tail of two sides: Artificially doubled false positive rates in neuroimaging due to the sidedness choice with t-tests. Human Brain Mapping. In press.Google Scholar
- Cohen, J. (1994). The earth is round (
*p*< .05).*American Psychologist*,*49*(12), 997–1003.CrossRefGoogle Scholar - Cox, R.W. (1996). AFNI: software for analysis and visualization of functional magnetic resonance neuroimages.
*Computers and Biomedical Research*,*29*, 162–173. http://afni.nimh.nih.gov.CrossRefGoogle Scholar - Cox, R.W., Chen, G., Glen, D.R., Reynolds, R.C., Taylor, P.A. (2017). FMRI clustering in AFNI: false-positive rates redux.
*Brain Connection*,*7*(3), 152–171.CrossRefGoogle Scholar - Cox, R.W. (2018). Equitable Thresholding and Clustering. In preparation.Google Scholar
- Cox, R.W., & Taylor, P.A. (2017). Stability of Spatial Smoothness and Cluster-Size Threshold Estimates in FMRI using AFNI. arXiv:1709.07471.
- Cremers, H.R., Wager, T.D., Yarkoni, T. (2017). The relation between statistical power and inference in fMRI.
*PLoS ONE*,*12*(11), e0184923.CrossRefGoogle Scholar - Eklund, A., Nichols, T.E., Knutsson, H. (2016). Cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates.
*PNAS*,*113*(28), 7900–7905.CrossRefGoogle Scholar - Forman, S.D., Cohen, J.D., Fitzgerald, M., Eddy, W.F., Mintun, M.A., Noll, D.C. (1995). Improved assessment of significant activation in functional magnetic resonance imaging (fMRI): use of a cluster-size threshold.
*Magnetic Resonance Medicine*,*33*, 636– 647.CrossRefGoogle Scholar - Gelman, A. (2015). Statistics and the crisis of scientific replication.
*Significance*,*12*(3), 23–25.CrossRefGoogle Scholar - Gelman, A. (2016). The problems with
*p*-values are not just with*p*-values. The American Statistician, Online Discussion.Google Scholar - Gelman, A., & Carlin, J.B. (2014). Beyond power calculations: assessing type s (sign) and type m (magnitude) errors. Perspectives on Psychological Science, 1–11.Google Scholar
- Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B. (2014).
*Bayesian data analysis*, Third edition. Boca Raton: Chapman & Hall/CRC Press.Google Scholar - Gelman, A., & Hennig, C. (2017). Beyond subjective and objective in statistics.
*Journal of the Royal Statistical Society: Series A (Statistics in Society)*,*180*(4), 1–31.Google Scholar - Gelman, A., Hill, J., Yajima, M. (2012). Why we (usually) don’t have to worry about multiple comparisons.
*Journal of Research on Educational Effectiveness*,*5*, 189–211.CrossRefGoogle Scholar - Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ”fishing expedition” or ”p-hacking” and the research hypothesis was posited ahead of time. http://www.stat.columbia.edu/gelman/research/unpublished/p_hacking.pdf.
- Gelman, A., & Shalizi, C.R. (2013). Philosophy and the practice of Bayesian statistics.
*British Journal of Mathematical and Statistical Psychology*,*66*, 8–38.CrossRefGoogle Scholar - Gelman, A., Simpson, D., Betancourt, M. (2017). The prior can generally only be understood in the context of the likelihood. arXiv:1708.07487.
- Gelman, A., & Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics15, 373–390.Google Scholar
- Gonzalez-Castillo, J., Saad, Z.S., Handwerker, D.A., Inati, S.J., Brenowitz, N., Bandettini, P.A. (2012). Whole-brain, time-locked activation with simple tasks revealed using massive averaging and model-free analysis.
*PNAS*,*109*(14), 5487–5492.CrossRefGoogle Scholar - Gonzalez-Castillo, J., Chen, G., Nichols, T., Cox, R.W., Bandettini, P.A. (2017). Variance decomposition for single-subject task-based fMRI activity estimates across many sessions.
*NeuroImage*,*154*, 206–218.CrossRefGoogle Scholar - Lazzeroni, L.C., Lu, Y., Belitskaya-Lévy, I. (2016). Solutions for quantifying P-value uncertainty and replication power.
*Nature Methods*,*13*, 107–110.CrossRefGoogle Scholar - Lewandowski, D., Kurowicka, D., Joe, H. (2009). Generating random correlation matrices based on vines and extended onion method.
*Journal of Multivariate Analysis*,*100*, 1989–2001.CrossRefGoogle Scholar - Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis.
*Science*,*355*(6325), 584–585.CrossRefGoogle Scholar - McElreath, R. (2016).
*Statistical Rethinking: a Bayesian course with examples in R and Stan*. Boca Raton: Chapman & Hall/CRC Press.Google Scholar - McShane, B.B., Gal, D., Gelman, A., Robert, C., Tackett, J.L. (2017). Abandon statistical significance. arXiv:1709.07588.
- Mejia, A., Yue, Y.R., Bolin, D., Lindren, F., Lindquist, M.A. (2017). A Bayesian general linear modeling approach to cortical surface fMRI data analysis. arXiv:1706.00959.
- Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D., Wagenmakers, E.-J. (2016). The fallacy of placing confidence in confidence intervals.
*Psychonomic Bulletin and Review*,*23*(1), 103–123.CrossRefGoogle Scholar - Mueller, K., Lepsien, J., Möller, H.E., Lohmann, G. (2017). Commentary: cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates.
*Frontiers in Human Neuroscience*,*11*, 345.CrossRefGoogle Scholar - Nichols, T.E., & Holmes, A.P. (2001). Nonparametric permutation tests for functional neuroimaging: a primer with examples.
*Human Brain Mapping*,*15*(1), 1–25.CrossRefGoogle Scholar - Olszowy, W., Aston, J., Rua, C., Williams, G.B. (2017). Accurate autocorrelation modeling substantially improves fMRI reliability. arXiv:1711.09877.
- Poline, J.B., & Brett, M. (2012). The general linear model and fMRI: does love last forever?
*NeuroImage*,*62*(2), 871–880.CrossRefGoogle Scholar - R Core Team. (2017). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
- Saad, Z.S., Reynolds, R.C., Argall, B., Japee, S., Cox, R.W. (2004). SUMA: an interface for surface-based intra- and inter-subject analysis with AFNI. In
*Proceedings of the 2004 IEEE International Symposium on Biomedical Imaging*(pp. 1510–1513).Google Scholar - Schaefer, A., Kong, R., Gordon, E.M., Zuo, X.N., Holmes, A.J., Eickhoff, S.B., Yeo, B.T. (2017). Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. Cerebral Cortex. In press.Google Scholar
- Simmons, J.P., Nelson, L.D., Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant.
*Psychological Science*,*22*, 1359–1366.CrossRefGoogle Scholar - Smith, S.M., & Nichols, T.E. (2009). Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference.
*Neuroimage*,*44*(1), 83–98.CrossRefGoogle Scholar - Stan Development Team. (2017). Stan modeling language users guide and reference manual, Version 2.17.0. http://mc-stan.org.
- Steegen, S., Tuerlinckx, F., Gelman, A., Vanpaemel, W. (2016). Increasing transparency through a multiverse Analysis.
*Perspectives on Psychological Science*,*11*(5), 702–712.CrossRefGoogle Scholar - Wasserstein, R.L., & Lazar, N.A. (2016). The ASA’s statement on
*p*-values: context, process, and purpose.*The American Statistician 70*,*2*, 129–133.CrossRefGoogle Scholar - Vehtari, A., Gelman, A., Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.
*Statistics and Computing*,*27*(5), 1413–1432.CrossRefGoogle Scholar - Westfall, J., Nichols, T.E., Yarkoni, T. (2017). Fixing the stimulus-as-fixed-effect fallacy in task fMRI.
*Wellcome Open Research*,*1*, 23.CrossRefGoogle Scholar - Wickham, H. (2009).
*Ggplot2: elegant graphics for data analysis*. New York: Springer.CrossRefGoogle Scholar - Worsley, K.J., Marrett, S., Neelin, P., Evans, A.C. (1992). A three-dimensional statistical analysis for CBF activation studies in human brain.
*Journal of Cerebral Blood Flow and Metabolism*,*12*, 900–918.CrossRefGoogle Scholar - Xiao, Y., Geng, F., Riggins, T., Chen, G., Redcay, E. (2018). Neural correlates of developing theory of mind competence in early childhood. Under review.Google Scholar
- Yeung, A.W.K. (2018). An updated survey on statistical thresholding and sample size of fMRI studies.
*Frontiers in Human Neuroscience*,*12*, 16.CrossRefGoogle Scholar