# On the Importance of Power Analyses for Cognitive Modeling

## Abstract

The high prevalence of underpowered empirical studies has been identified as a centerpiece of the current crisis in psychological research. Accordingly, the need for proper analyses of statistical power and sample size determination before data collection has been emphasized repeatedly. In this commentary, we argue that—contrary to the opinions expressed in this special issue’s target article—cognitive modeling research will similarly depend on the implementation of power analyses and the use of appropriate sample sizes if it aspires robustness. In particular, the increased desire to include cognitive modeling results in clinical and brain research raises the demand for assessing and ensuring the reliability of parameter estimates and model predictions. We discuss the specific complexity of estimating statistical power for modeling studies and suggest simulation-based power analyses as a solution to this challenge.

## Keywords

Cognitive modeling Power analysis Sample size Cognitive neuroscience Computational psychiatry Simulations## Introduction

In their target article of the current special issue, Lee et al. (2019) take the current crisis in psychological research as a starting point and motivation to propose a catalog of good research practices for cognitive modeling. This catalog features many plausible and laudable methods such as preregistering models and evaluation criteria, testing the generalizability of a model, and keeping track of all changes made during iterative model development (“postregistration”). Surprisingly, Lee et al. (2019) dismiss the potential need for a priori power analyses and sample size^{1} determination rather swiftly, stating that the respective recommendations developed for other psychological research would “not carry over to the standard practices of cognitive modelers” (p. 4). They concede that in a specific case, that is, when conducting null hypothesis significance tests (NHST) and collecting data non-iteratively, power analyses are required, but for other instances of cognitive modeling the results of classic power analyses are considered negligible.

In this commentary, we argue that analyses of power—the chance of detecting a true positive result given it exists—are as critical for cognitive modeling as they are for other research areas within and beyond psychology. We comment on the specific intricacies of performing a power analysis in cognitive modeling, stress the importance of developing new techniques to meet these challenges, and provide examples of simulation-based solutions. Finally, we claim that this issue will become even more critical, as insights from cognitive modeling are increasingly used in neuroscientific research and even sought to be implemented in clinical applications.

## Power Analyses—a Necessary Part of the Toolbox for Good Scientific Practices

Multiple sources have contributed to the current crisis of confidence in psychology and related fields, including underpowered studies, flexibility in data analysis, and publication bias (Munafò et al. 2017; Poldrack et al. 2017). We believe a systematic a priori power analysis belongs in the general toolbox of good scientific practice for cognitive modeling. Power in cognitive modeling can mean the ability to identify a data-generating model or a set of parameters, or to finding an effect. Notably, ignoring power considerations can diminish the benefits of other recommendable practices, such as preregistration.

For example, suppose cognitive modelers wish to demonstrate that their favorite model A provides a better account of a cognitive phenomenon than an alternative model B. They preregister both the “players of the game” (i.e., models A and B) as well as the “rules of the game” (i.e., the evaluation criterion), but do not conduct an a priori power analysis to determine their sample size. This leaves the researchers the freedom to test only a small number of participants or to check if the data favor model A after each participant, engaging in “optional stopping.” In statistical inference, both these questionable research practices are known to increase the probability of false positives (Button et al. 2013; Wagenmakers 2007). Consequently, we argue that the practice of preregistering cognitive models alone will not ensure improved replicability as long as a power analysis is not part of the preregistration.

## Are Concerns of Sample Size less Relevant for Cognitive Modelers?

Power plays a critical role in sample size planning. An insufficient sample size in cognitive modeling can lead to imprecise estimates of model parameters and biased inference about models themselves. Regarding model parameters, sample size can increase the precision of parameter estimates (e.g., Wagenmakers et al. 2007) and more data points facilitate parameter discriminability (e.g., Broomell and Bhatia 2014). This also affects NHST among model parameters. But even in projects not comparing estimated parameters by NHST, sample size influences inferences drawn about models themselves. Consider model comparison and the need to account for model complexity: Penalties for model complexity are not sample-size invariant. In small samples, some flexibility-penalizing goodness-of-fit measures favor relatively simple models (e.g., AIC; Busemeyer and Wang 2000), whereas other measures favor relatively complex models (normalized maximum likelihood/minimum description length). In small samples, nested models can sometimes be treated as more complex than full models, although nested models are simpler than the models they are nested in (Navarro 2004). Even for non-nested models, the complexity rank of two models can switch as sample size increases (Heck et al. 2014; Wu et al. 2010). In fact, Heck et al. (2014) put forward minimum sample size requirements for the specific case of multinomial processing tree models, akin to model-dependent rules of thumb for sample sizes. Also, sample size affects the accuracy of model recovery: Small samples decrease the proportion of correctly identified models in model comparisons across a range of fit indices, as shown in simulations (Pitt et al. 2002). The interplay between sample size, model flexibility, and task design, therefore, leads to an important but complex role of conducting power analyses in cognitive modeling.

These issues concerning sample size may suffice to illustrate why we believe that cognitive modeling benefits from power analyses. As Lee et al. (2019) state, simplistic NHST-based recipes for power may not carry over to cognitive modeling unless NHST is key; yet, it is because of the immense degrees of freedom and the rise in complexity that come with cognitive modeling that modelers need to be even more careful in justifying their design and analysis decisions. Power analyses, particularly for determining sample size, matter in more than just in corner cases. Rather, the issue with power analyses for cognitive modeling seems that quantifying a cognitive model’s power is a particularly hard computational problem. One crucial step is to develop methods and tools to advance this very problem.

Power analysis in cognitive modeling is complicated by challenges that warrant further theoretical work. The concept of “power,” for instance, is not well defined in model-based studies; model recoverability and task design affect power when using cognitive models. Given these challenges, we agree with Lee et al. (2019) that it might be impossible to develop rules of thumb for sample sizes in cognitive modeling studies. However, this does not preclude modelers from the responsibility to perform power analyses. Instead, we argue that these analyses can be based on simulating planned studies.

## Example: Power Analysis for Detecting Differences in Cognitive Strategies

*N*= 10, 20,…, 200; and assign half of each sample to each condition. Then, we conduct a simulation, simulating classification decisions across a range of model parameters and the sample sizes, given the task, the condition, the specified effect, and the noise level. Then, the models X and Y are fit to the simulated data separately (without mixing), and the winning model is determined using Akaike weights (Wagenmakers and Farrell 2004). Finally, we test (via NHST) if more participants are classified as adhering to X in condition one compared with condition two, as assumed by the true effect. The procedure is repeated 1000 times. The relative frequency of true-positive results is plotted in Fig. 1, which yield a standard power curve and suggest that we need a minimum of

*N*= 60 participants to achieve at least 80% statistical power, in the expectation, to detect the hypothesized strategy shift given our design.

## The Increasing Role of Cognitive Modeling in Neuroscience and Clinical Research

The target article’s comment on power analysis leaves the impression that cognitive modelers engage much less in NHST than other scientists. Although modelers may indeed possess a comparatively high affinity to Bayesian statistics, and simulation studies often do not require inferential statistics, we still believe that the prevalence of NHST among the cognitive modeling community is high. For instance, when qualitative predictions of cognitive models—such as context effects in decision-making (Busemeyer et al. 2019)—are validated by testing whether experimental data meet them, these tests are often performed with frequentist statistics (e.g., Gluth et al. 2017; Trueblood et al. 2014; Tsetsos et al. 2012; but see Evans et al. (2019) for an example of using Bayesian statistics). Furthermore, it should be noted that even a Bayesian data analysis approach does not exempt researchers from outlining the study design before data acquisition (Schönbrodt and Wagenmakers, 2018). More specifically, Schönbrodt and Wagenmakers criticize the absence of systematic designing of experiments that rely on Bayes factors and introduce three possible design classes: a fixed-*n* design, an open-ended sequential sampling design, and a hybrid design with sequential sampling and maximum *n*. They argue that—similar to the case of NHST—the highest probability for making a false-positive error occurs at early termination of sequential designs. To address this problem, they recommend to define a minimum sample size and to start applying the (Bayesian) inferential statistics only after this sample size has been reached. Similar to our proposal, Schönbrodt and Wagenmakers’ approach relies on simulating hypothetical experiments given that the complexity of the matter obviates analytical solutions.

^{2}With respect to model-based neuroimaging studies, conducting power analyses may appear to be a particularly challenging endeavor, as one needs to simulate not only cognitive models and experimental designs but also brain data. However, Gluth and Meiran (2019) provide an example power analysis to detect correlations between single-trial parameter values and brain activation, assuming that averaged neural signals (e.g., mean fMRI responses averaged over multiple voxels in a region of interest) are normally distributed. They show that increasing the sample size in terms of increasing the number of participants or the number of trials per participant reduces the variance of the expected brain-behavior correlations, which improves the likelihood of obtaining statistically significant effects (Fig. 2).

On a more general note, a new age of “computational psychiatry” has recently been put forward, according to which cognitive but also neurobiological and biophysical modeling should be embraced as a promising method to improve diagnoses and therapies in psychiatry (Huys et al. 2016; Montague et al. 2012). To fulfill this promise, cognitive modelers must provide robust and reliable tools to infer latent cognitive mechanisms from (potentially aberrant) overt behavior. Among other things, this will require understanding how sample sizes, task designs, and model-fitting procedures affect the recoverability of model parameters.

## Concluding Remarks

Simply applying the power analytic recommendations from statistical inference to cognitive modeling may indeed not help improve the robustness of model-based cognitive research. But sample size profoundly impacts the inferences we draw from formal models, such as those in computational psychiatry, which can harm the robustness of scientific findings despite pre- and postregistering formal models. If we subscribe to Lee and colleagues’ goal of achieving robustness in cognitive modeling, we believe in the necessity to begin studying how, when, and to which degree sample size affects the inferences we draw from and about models (beyond but also in the cases in which NHST are essential). Then we might achieve the degree of robustness we wish for in modeling in cognitive science. We may also find that cognitive modeling needs new power analyses tools and discover new and challenging avenues of methodological research on power analyses.

## Footnotes

## Notes

### Acknowledgments

We thank the members of the Decision Neuroscience and Economic Psychology groups at the University of Basel for critical discussions of the target article in our journal club. We thank Florian Seitz for his work on the power simulation.

### Funding Information

S.G. was supported by a grant from the Swiss National Science Foundation (SNSF Grant 100014_172761).

## References

- Broomell, S. B., & Bhatia, S. (2014). Parameter recovery for decision modeling using choice data.
*Decision, 1*, 252–274.CrossRefGoogle Scholar - Busemeyer, J. R., Gluth, S., Rieskamp, J., & Turner, B. M. (2019). Cognitive and neural bases of multi-attribute, multi-alternative, value-based decisions.
*Trends in Cognitive Sciences, 23*, 251–263.CrossRefGoogle Scholar - Busemeyer, J., & Wang, Y.-M. (2000). Model comparisons and model selections based on generalization criterion methodology.
*Journal of Mathematical Psychology, 44*, 171–189.CrossRefGoogle Scholar - Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience.
*Nature Reviews Neuroscience, 14*, 365–376.CrossRefGoogle Scholar - Culbreth, A. J., Westbrook, A., Daw, N. D., Botvinick, M., & Barch, D. M. (2016). Reduced model-based decision-making in schizophrenia.
*Journal of Abnormal Psychology, 125*, 777–787.CrossRefGoogle Scholar - Evans, N. J., Holmes, W. R., & Trueblood, J. S. (2019). Response-time data provide critical constraints on dynamic models of multi-alternative, multi-attribute choice.
*Psychonomic Bulletin & Review, 26*, 901–933.CrossRefGoogle Scholar - Gluth, S., Hotaling, J. M., & Rieskamp, J. (2017). The attraction effect modulates reward prediction errors and intertemporal choices.
*Journal of Neuroscience, 37*, 371–382.CrossRefGoogle Scholar - Gluth, S., & Meiran, N. (2019). Leave-one-trial-out, LOTO, a general approach to link single-trial parameters of cognitive models to neural data.
*eLife, 8*, e42607.CrossRefGoogle Scholar - Gluth, S., Rieskamp, J., & Büchel, C. (2013). Deciding not to decide: computational and neural evidence for hidden behavior in sequential choice.
*PLoS Computational Biology, 9*, e1003309.CrossRefGoogle Scholar - Heck, D. W., Moshagen, M., & Erdfelder, E. (2014). Model selection by minimum description length: lower-bound sample sizes for the Fisher information approximation.
*Journal of Mathematical Psychology, 60*, 29–34.CrossRefGoogle Scholar - Huys, Q. J. M., Maia, T. V., & Frank, M. J. (2016). Computational psychiatry as a bridge from neuroscience to clinical applications.
*Nature Neuroscience, 19*, 404–413.CrossRefGoogle Scholar - Lee, M. D., Criss, A. H., Devezer, B., Donkin, C., Etz, A., Leite, F. P., et al. (2019). Robust modeling in cognitive science.
*ArXiv*. https://doi.org/10.31234/osf.io/dmfhk. - Lefebvre, G., Lebreton, M., Meyniel, F., Bourgeois-Gironde, S., & Palminteri, S. (2017). Behavioural and neural characterization of optimistic reinforcement learning.
*Nature Human Behaviour, 1*, 0067.CrossRefGoogle Scholar - Montague, P. R., Dolan, R. J., Friston, K. J., & Dayan, P. (2012). Computational psychiatry.
*Trends in Cognitive Sciences, 16*, 72–80.CrossRefGoogle Scholar - Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E. J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science.
*Nature Human Behaviour, 1*, 0021.CrossRefGoogle Scholar - Myung, J. I., & Pitt, M. A. (2009). Optimal experimental design for model discrimination.
*Psychological Review, 116*, 499–518.CrossRefGoogle Scholar - Navarro, D. J. (2004). A note on the applied use of MDL approximations.
*Neural Computation, 16*, 1763–1768.CrossRefGoogle Scholar - Nosofsky, R. M. (1986). Attention, similarity, and the identification-categorization relationship.
*Journal of Experimental Psychology. General, 115*, 39–61.CrossRefGoogle Scholar - Pitt, M. A., Myung, I. J., & Zhang, S. (2002). Toward a method of selecting among computational models of cognition.
*Psychological Review, 109*, 472–491.CrossRefGoogle Scholar - Poldrack, R. A., Baker, C. I., Durnez, J., Gorgolewski, K. J., Matthews, P. M., Munafò, M. R., Nichols, T. E., Poline, J. B., Vul, E., & Yarkoni, T. (2017). Scanning the horizon: towards transparent and reproducible neuroimaging research.
*Nature Reviews Neuroscience, 18*, 115–126.CrossRefGoogle Scholar - Schönbrodt, F. D., & Wagenmakers, E.-J. (2018). Bayes factor design analysis: planning for compelling evidence.
*Psychonomic Bulletin & Review, 25*, 128–142.CrossRefGoogle Scholar - Trueblood, J. S., Brown, S. D., & Heathcote, A. (2014). The multiattribute linear ballistic accumulator model of context effects in multialternative choice.
*Psychological Review, 121*, 179–205.CrossRefGoogle Scholar - Tsetsos, K., Chater, N., & Usher, M. (2012). Salience driven value integration explains decision biases and preference reversal.
*Proceedings of the National Academy of Sciences of the United States of America, 109*, 9659–9664.CrossRefGoogle Scholar - Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of values.
*Psychonomic Bulletin & Review, 14*, 779–804.CrossRefGoogle Scholar - Wagenmakers, E.-J., & Farrell, S. (2004). AIC model selection using Akaike weights.
*Psychonomic Bulletin & Review, 11*, 192–196.CrossRefGoogle Scholar - Wagenmakers, E.-J., Van Der Maas, H. L. J., & Grasman, R. P. P. P. (2007). An EZ-diffusion model for response time and accuracy.
*Psychonomic Bulletin & Review, 14*, 3–22.CrossRefGoogle Scholar - Wu, H., Myung, J. I., & Batchelder, W. H. (2010). On the minimum description length complexity of multinomial processing tree models.
*Journal of Mathematical Psychology, 54*, 291–303.CrossRefGoogle Scholar