Introduction

A central goal of science is to develop theories that describe causal mechanismsFootnote 1 in sufficient detail to derive predictions but with sufficient generality to apply across different related scenarios. For example, we would like a theory of memory to explain why increased study time results in improved recognition performance. To advance this goal, we might build a formal model that connects independent variables like study time to dependent variables like accuracy via theoretical constructs like trace strength and processes of encoding and retrieval, describing mechanisms by which study time is causally related to accuracy.Footnote 2 By constructing models based on theory, we put ourselves in a position to help explain effects on other dependent variables (e.g., response time), to understand the action of other related independent variables (e.g., repetition or spacing), and to suggest new avenues of research (e.g., what factors affect trace strength). In contrast, when models are constructed based on “default” assumptions that are unconnected to theory, these scientific insights are forfeit. Trying to make inferences from atheoretical modeling is, in effect, the tail (modeling) wagging the dog (theory).

Formal models are integral to drawing inferences from data in order to achieve scientific goals, contributing in a number of different ways (Cox and Shiffrin, in press; Luce, 1995; Navarro, 2021; Kellen, 2019). An important dimension along which modeling can vary is between descriptive and causal (Fig. 1). This dimension does not necessarily refer to any particular property of the models themselves, but rather to the purpose the model is intended to serve. As illustrated in the example above, the purpose of causal modeling is to represent causal mechanisms in a theory in a way that enables quantitative predictions to be derived and tested. While causal modeling can thus be extremely productive in terms of making scientific inferences, it is not always possible if there is limited knowledge of the phenomena to be explained in a given domain. In such a circumstance, descriptive modeling can be used to identify important quantitative features of a set of data—“signals” in the presence of “noise”. Statistical modeling typically has a descriptive purpose in this sense. But to yield useful scientific insights, the “signals” detected by descriptive modeling should be indicative of the action of potential causal mechanisms or their interactions. For this to be true, even models with a primarily descriptive purpose should be designed so as to be meaningfully linked to potential causal mechanisms. Any particular modeling endeavor is likely to have a mixture of descriptive and causal purposes. This ambiguity can lead to confusion about how to best evaluate a model: Causal modeling is successful when the model provides an insightful and productive explanation of patterns in data. In turn, descriptive modeling is successful when the model “fits”, achieving a close match between predicted and observed data patterns. Descriptive modeling serves science when the description it offers provides information about potential causal factors that are relevant to explaining data patterns.

Fig. 1
figure 1

Modeling in science has different purposes that lie on a continuum between descriptive and causal. Considerations about model complexity (is it warranted in light of the data?) are tacitly assumed in this illustration

It is especially important to consider causal mechanisms—even in otherwise descriptive modeling—in the psychological sciences. Here, data exhibit a large degree of “noise” because they reflect variability from many sources, only some of which are related to the scientific questions at hand. To address this state of affairs, increasingly sophisticated statistical models are brought to bear in order to extract relevant signals from these data. One example is the general purpose statistical framework discussed by van Doorn et al. (this issue), henceforth vDAHSW, that of Linear Mixed Models (LMMs). They intend this framework to be able to describe data patterns across myriad substantive applications—to identify “signals” that represent effects and interactions regardless of the specific experimental manipulations or scales of measurement. In accord with this intent, they formulate models and compare them using a variety of “default” assumptions, including assumptions about model structure and the way prior distributions should be defined. But by adopting these defaults in order to expand the scope of their model’s descriptive abilities, they end up breaking the link between the formal representations in the model and the causal mechanisms that may be theorized to yield the observed data patterns. Admittedly, making that link is hard, but it is precisely that link that determines whether a descriptive statistical model serves its intended scientific purpose: to identify “signals” (e.g., a difference in average accuracy between conditions) that potentially reflect the action of a theorized causal mechanism. This issue is not specific to the LMMs explored by vDAHSW - it arises in any analysis entailing a complex web of assumptions. If these assumptions are chosen by “default”, then there is no guarantee that they will be at all related to the scientific theories the model is meant to embody or inform. Regardless of the goal of the model, irreflective adoption of default assumptions renders a model scientifically meaningless.

Statistical methodology that keeps science on the path toward ‘progress’ needs to be based on an appreciation of how causal mechanisms are linked to data. As such, we believe that the call by vDAHSW for general guidelines and defaults that are divorced from this connection is, at best, problematic. In the first part of our paper, we directly address the questions posed by vDAHSW, showing among other things that the differences in Bayes Factors obtained under different aggregation regimes result from the tacit adoption of “default Bayes Factors”. These rely on the specification of priors over standardized effects, but the reported differences in conclusions dissipate when using non-standardized effects. What appeared to be a deep inferential paradox can be avoided by merely introducing subject-matter considerations—namely, the scale of measurement—into a descriptive model. In the second part of this paper, we discuss how giving careful thought to different kinds of causal considerations can help descriptive models better serve scientific goals. One consideration is the “coordination function” that specifies how the components of a model are linked to theoretical constructs and observable quantities; another is “generalization”, which focuses on how a model and/or its components are intended to be applied to other related but non-identical scenarios. Models, whether intended for primarily descriptive or causal purposes, should be tailored to the empirical substrate and theoretical statements at hand, keeping “default” assumptions to a minimum.

Part I: The Perils of Letting the Tail Wag the Dog

LMMs provide a widely applicable descriptive modeling approach for data with complex dependency structures, such as those that often arise in the psychological sciences. LMMs distinguish between “fixed effects” due to experimental condition or group membership and “random effects” due to, for example, variability between individuals or items. Fixed effects are represented by regression coefficients that are constant (i.e., fixed) across all observations within a group or condition (just like the coefficients in ordinary regression models). Random effects are represented by regression coefficients that can differ across different clusters or groups of observations (e.g., individuals, items).Footnote 3 By explicitly assigning different representations to different types of variability, LMMs already represent some assumed aspects of a causal structure. But as illustrated in the examples by vDAHSW, many other choices must be made when specifying LMMs and these too should be guided by causal considerations.

The examples provided by vDAHSW rely on a representation of fixed and random effects that is commonly adopted when applying LMMs to typical experimental designs: Fixed effects are represented by an intercept representing the overall mean and a slope representing the average difference between conditions. Random effects are represented by allowing individuals to have different intercepts and/or slopes which are normally distributed around the group average. In a frequentist setting, there is a wealth of theoretical and practical knowledge on how to apply these kinds of LMMs, most of which are built into the lme4 R package (Bates et al., 2015). Much of this knowledge can be boiled down to a general recommendation of using the most complete random-effects structure that can be justified by the experimental design (Baayen et al., 2008; Barr et al., 2013; Judd et al., 2012; Westfall et al., 2014).Footnote 4 Behind this recommendation is the idea that failing to represent a random effect that is in fact present—in other words, failing to consider as many causal factors as possible—will lead to erroneous inferences.

As exemplified by vDAHSW, there is growing interest in applying Bayesian inferential methods (Jeffreys, 1961) such as Bayes Factors, in the context of LMMs. Consider data y along with two models \({\mathscr{M}}_{m}\) and \({\mathscr{M}}_{n}\) from which they are assumed to have arisen. The Bayes Factor captures the updating of the relative prior probability (prior odds) of the two models in light of the data

$$ \overbrace{\frac{P(\mathcal{M}_{m} \mid {\boldsymbol{y}})}{P(\mathcal{M}_{n} \mid {\boldsymbol{y}})}}^{\text{Posterior Odds}} = \overbrace{\frac{P({\boldsymbol{y}} \mid \mathcal{M}_{m})}{P({\boldsymbol{y}} \mid \mathcal{M}_{n})}}^{\text{Bayes Factor}} \times \overbrace{\frac{P(\mathcal{M}_{m})}{P(\mathcal{M}_{n})}}^{\text{Prior Odds}}. $$
(1)

The Bayes Factor (BF) is a ratio of marginal likelihoods, which together quantify the relative support for each model that is provided by the data. BFm,n values larger than 1 indicate a support for model \({\mathscr{M}}_{m}\), whereas values between 0 and 1 indicate a support for model \({\mathscr{M}}_{n}\) (on a logarithmic scale, positive and negative \(\log (\text {BF}_{m,n})\) indicate support for models and \({\mathscr{M}}_{m}\) and \({\mathscr{M}}_{n}\), respectively). For any model \({\mathscr{M}}\), the marginal likelihood corresponds to a weighted average of the likelihood of the data, with weights given by the parameter priors g(𝜃):

$$ P({\boldsymbol{y}} \mid \mathcal{M}) = \int f_{\mathcal{M}}({\boldsymbol{y}} \mid {\boldsymbol{\theta}})g({\boldsymbol{\theta}}) \text{d}{\boldsymbol{\theta}}. $$
(2)

Spelling out the complete definition of a Bayes Factor allows us to highlight two important points: First, the Bayes Factor represents an “adjustment” to a set of prior beliefs based on data; just like a p value does not represent the probability of the null hypothesis, a Bayes Factor does not tell us the posterior odds in favor of one model over another, except in the special case that the prior odds were one (i.e., the two models were believed equally probable a priori). Second, the Bayes Factor marginalizes over the prior distributions on the parameters of each model 𝜃. As a consequence, the specification of these priors is not arbitrary—it is a core feature of the models being compared.

In the context of LMMs, Bayes Factors can be used to contrast nested models that assume different sets of fixed and/or random effects, as illustrated in vDAHSW. They computed “default Bayes Factors” using different LMMs and using data constructed under different aggregation regimes. These Bayes Factors were computed for two different kinds of model comparisons:

  • Strict Null Comparison: The full or nesting model \({\mathscr{M}}_{A}\) included both fixed and random slopes whereas the null or nested model \({\mathscr{M}}_{0}\) included neither (Rouder et al., 2016). Both models included fixed and random intercepts.

  • Balanced Null Comparison: Equivalent to the comparison above, with the exception that the null model \({\mathscr{M}}_{0}\) included the random slopes (but not the fixed slope; see Barr et al., 2013).

Figure 2 illustrates the results obtained by vDAHSW with one of their data sets. The data in question were simulated from a model in which an effect of condition is present. This fact is reflected in the positive log(BFA,0) values obtained across the board, all supporting the alternative model \({\mathscr{M}}_{A}\). However, we also see rather drastic differences between the two types of null comparisons. In the case of the strict null comparisons, in which the null model includes neither the fixed nor random slopes, the Bayes Factors are generally more extreme and become even more so as the number of observations being aggregated decreases. In contrast, aggregation has a negligible impact on the balanced null comparisons, in which the null model does include the random slopes (but not fixed slope). Based on this pattern of results (which hold for the other examples they considered), vDAHSW raised a number of questions (see their Table 3), which we distill into the two questions below:Footnote 5

  • Which models should be compared? That is, should a hypothesis test compare models which differ in both in terms of fixed effects and random effects, or only in terms of fixed effects?

  • Should the data be aggregated if there are multiple observations per participant and cell of the design? And if so, how?

Fig. 2
figure 2

Illustration of results from vDAHSW. Different log(BFA,0) for the condition effect of the simulated data from vDAHSW (Example 2) that both shows a condition effect as well as non-zero random slopes. The two models for the “strict null comparison” differ in both fixed effects and random slopes (dashed line), the two models in the “balanced null comparison” only differ in their fixed effects (solid line). The x-axis shows the different aggregation regimes in terms of number of trials entered into the analysis (i.e., 100 trials refers to the full, disaggregated data; 50 trials refers to an analysis in which pairs of trials are aggregated, etc.). Model \({\mathscr{M}}_{A}\) refers to the alternative (i.e., full) model, model \({\mathscr{M}}_{0}\) refers to the null (i.e., reduced) model

As shown below, finding good answers to these questions is not a matter of finding a good “default”, but carefully considering the meaning of the model structure and parameters in the context of a specific research application. In fact, some of the issues raised below strongly suggest that a generic default approach is ultimately untenable.

Question 1: Which Models Should Be Compared?

Experimental Designs Involving One Factor

When using a LMM to analyze experimental data, a researcher is often hoping to answer the question, “did an experimental factor have an effect?” For example, the data in Fig. 2 shows an “effect” of condition on the dependent variable, and ideally a statistical model should detect this effect. As discussed above, the full model for this design includes a fixed intercept, a fixed slope representing the effect of condition, as well as the corresponding by-participant random effect terms—the random intercept and the random slope. In a frequentist setting, Barr et al. (2013) showed that biased results are obtained if the individual differences in the effect of condition are unaccounted for by the LMM (i.e., if the model does not include the corresponding random slopes) (see also Schielzeth and Forstmeier, 2009). This general message also holds within a Bayesian setting (see Veríssimo, this issue). Therefore, it seems generally appropriate to designate the model with maximal random-effects as the alternative model \({\mathscr{M}}_{A}\). The remaining question then is which null model \({\mathscr{M}}_{0}\) is most appropriate, when evaluating an effect of condition. vDAHSW discuss the two different variants introduced earlier, strict and balanced null comparisons, which differ in terms of whether or not they include by-participant random slopes for condition. Given that most experimental researchers appear to take the presence of individual differences as a given—they are just interested in the average effect (i.e., the fixed effect slope). Generally speaking, in such circumstances, the balanced null comparison strikes us as being the most reasonable one.

Contrary to this view, vDAHSW argue that the null model used in the balanced null comparison “seems implausible” (p. 12). In response, we raise two points: First, the support for \({\mathscr{M}}_{A}\) in strict null comparisons is always underqualified. After all, this support might be due to (i) a non-zero fixed effect, (ii) the presence of individual differences (captured by the random slopes), or (iii) both. A “strict” null conflates these three different scenarios. Second, a null model including random slopes is often a plausible contender. For example, consider a study investigating ego depletion in a within-subjects design (e.g., Lin et al., 2020). In such a study, it seems possible to observe no overall effect of ego depletion (given the known difficulties in replicating this effect; e.g., Vohs et al., in press), but individual differences among the responses to the ego depletion manipulation. For instance, some participants might have heard of ego depletion and therefore act in line with the ego depletion hypothesis whereas other participants might believe the opposite to the ego depletion hypothesis, that initial exertion energizes (Savani & Job, 2017). If the proportion of participants for which these two opposing effects hold is approximately equal, then we should observe non-zero random slopes along with a mean difference of (approximately) zero. Such a scenario doesn’t strike us as being beyond the pale.Footnote 6 Of course, one could disagree with our assessment by pointing out the implausibility of random effects cancelling each other out exactly. The problem with such an argument is that it can also be used to renders virtually any (point) null hypothesis implausible. For they state that some kind of effect attributed to a given manipulation is exactly zero (see Meehl, 1978). It is not clear to us how one could still defend any kind of statistical evaluation/testing of null hypotheses—the kind proposed by vDAHSW for instance—under these circumstances.

Despite our defense of balanced nulls, we can also think of a number of specific circumstances in which a strict null is perfectly sensible, as in the case of ESP (Bem, 2011). What is “plausible” depends on considering possible causal factors that may be at work in any specific application, and how best to represent these factors in a model that makes a number of abstractions (deliberate omissions) and idealizations (deliberate distortions; see Kellen, 2019). Contra vDAHSW and Rouder et al. (2016), we argue that the plausibility of such null models isn’t something that can be adjudicated in the abstract—a case-by-case assessment is necessary (for a related discussion, see Heathcote and Matzke, this issue). At the end of the day, researchers should have the ability to make a call on the null models that they consider to be plausible, as long as they are able to convincingly substantiate their decisions.

In cases where the appropriate null model is somehow a point of contention, the research question being posed might be better served by a model-selection effort involving more than two candidate models. In a Bayes Factor application, this would involve designating one of the models as the reference model and then calculating the Bayes Factor comparing it with all other candidates. For example, we might designate the model without fixed and random slopes for condition as the reference model and then calculate Bayes Factors with three further models: (i) a model that only includes the fixed slope of condition but not the corresponding random slopes, (ii) a model including the random slopes but no corresponding fixed slope, and (iii) a model that includes both fixed and random slopes. By computing the Bayes Factors associated with the three possible model pairings, one is able to get a fuller picture of how condition affects the dependent variable.

Experimental Designs Involving Multiple Factors

The utility of comparing more than two models is amplified by Example 3 from vDAHSW. This example is concerned with an experimental design involving two crossed within-subjects factors. The maximal model includes (i) a fixed intercept, (ii) fixed main effects for the two factors as well as their interaction, (iii) a by-participant random intercept, and (iv) by-participant random slopes for both factors as well as their interaction.

A balanced null comparison of the interaction would involve a null model in which the fixed effect for the interaction is omitted. In the case of main effects, however, balanced null comparisons can be made in two different ways: On one hand, we can specify a null model in which everything but the fixed main effect being targeted is omitted (i.e., the fixed interaction remains). On the other, we could specify a null model in which the fixed main effect being targeted as well as the fixed interaction are omitted.Footnote 7 Deciding between these possible representations is made even more difficult by the fact that, in factorial designs, what constitutes a main effect and what constitutes an interaction is itself often an arbitrary choice (e.g., Brauer & Judd, 2000).

Once again, rather than commit to a single pair of models under consideration, the researcher may be better served by a model-selection exercise. In the case of two factors for which the mapping of factors is not arbitrary this could result in five different structures (all including the fixed intercept): (i) intercept only, (ii) main effect A, (iii) main effect B, (iv) both main effects, and (v) both main effects plus interaction. If the maximal model supports random slopes for all fixed effects, we could combine all possible fixed-effect structures with all possible random-effect structures, resulting in a total of 5 × 5 = 25 models. For example, Singmann et al. (2014, Table 2) used this approach in a design involving crossed random effects for participants and items. In line with the notion that it is sometimes reasonable to allow for random slopes in the absence of their corresponding fixed effects, the model that was most strongly supported included random slopes for one factor but not the corresponding fixed effect.

Summary

In the context of LMMs, inferences are less prone to bias when models attempt to respect the potential causal factors at work by representing them as random effects, leading to so-called “balanced” null comparisons. This is the standard approach in frequentist implementations of LMMs. However, the specific model structure should be decided on a case-by-case basis. In addition, Bayes Factors can be used to evaluate more than two models, providing us with the means to conveniently investigate which candidate in a set provides a better description of the data. When establishing a set of candidates, we urge researchers to disregard blanket statements regarding which models are plausible versus not. For example, it is possible for an experimental manipulation to produce individual-level effects that cancel each other out, resulting in a virtually zero fixed effect. Finally, in multi-factor designs, what stands for a main effect or an interaction is often arbitrary, and deciding how to represent these terms requires recourse to a (hypothesized) causal structure.

Question 2: Should Data Be Aggregated?

vDAHSW give examples in which aggregating data prior to modeling apparently leads to different inferences than when modeling the unaggregated data. A visual inspection of Fig. 2 might suggest that only the Bayes Factors for strict null comparisons are sensitive to aggregation, but as shown in Fig. 3, even the balanced null comparison exhibits a non-monotonic pattern. This figure also shows the Bayes Factors obtained with an alternative LMM (see dashed line). Importantly, these alternative Bayes Factors are invariant to aggregation.Footnote 8

Fig. 3
figure 3

Two Bayes Factor variants for models differing in fixed effects. Different log(BFA,0) for the condition effect of the simulated data from vDAHSW (Example 2) in which the only difference between both models is the presence or absence of the fixed effect (i.e., both models contain the corresponding random slopes). The solid line shows the default Bayes Factor (Rouder et al., 2012) used by van Doorn et al. in which the observed effect is standardized by the residual SD and which is affected by the level of aggregation in a non-monotonic way. The dashed line shows a Bayes Factor in which the effect is not standardized and which is not affected by the level of aggregation

When asking about the effect of aggregation, vDAHSW overlook a fundamental question: why is aggregation impacting Bayes Factors in the first place? This omission implies that one should somehow try to remedy or solve the issue at hand without first knowing what is behind it. Fortunately, we are not dealing with a “black box” problem but transparent formal models (Ulrich, 2009). When investigating the inner workings of the LMMs employed by vDAHSW, one can clearly identify the culprit: the specification of Bayes Factors at the level of standardized effects—default Bayes Factors (Ly et al., 2016). For when Bayes Factors are specified at the level of unstandardized effects, they are no longer affected by aggregation biases, as demonstrated by our reanalysis of vDAHSW’s data (see Fig. 3, dashed line). In the case of default Bayes Factors, aggregation implicitly affects the priors embedded in the models being compared, effectively changing the meaning of the model parameters. vDAHSW appear to think that these model specifications can be treated as equivalent, as evidenced by the fact that they formally describe models in terms of unstandardized effects (see their Eqs. 1–6) but end up employing their standardized counterparts in the actual analyses. As shown in Fig. 3, these alternative specifications are not equivalent.

The Standardized Model: Default Bayes Factors

As described in Eq. 2, the marginal likelihood—the basis of the Bayes Factor—is the average likelihood weighted by the parameter priors g(𝜃). One consequence of this weighting is that different priors can lead to radically different Bayes Factors. This feature can complicate things in practice, given that it introduces a considerable degree of “subjectivity”. vDAHSW attempt to remedy this issue by using default Bayes Factors (Ly et al., 2016). In order to compute default Bayes Factors, models are specified in such a way that parameters representing effects—and their associated priors—are cast on a standardized scale (Rouder et al., 2012). What this means is that the effect-parameter priors included in g(𝜃) do not speak directly to the scale of the “raw” dependent variable. This detachment allows default priors to be used regardless of the original measurement scale, ostensibly removing some of the subjectivity that is characteristic of Bayes Factor applications.

To enable this this specific model parameterization, it is necessary to define the scaling factor for the observed effects. Rouder et al. (2012) adopted the residual standard deviation σε for this role. Specifically, consider the case of a mixed model with fixed-effects model matrix X, fixed-effects parameter vector 𝜃, random effects model matrix Z, and random-effects parameter vector b. The linear prediction for the vector of the dependent variable y is then given byFootnote 9

$$ \begin{array}{@{}rcl@{}} \boldsymbol{y} = \mu + \sigma_{\varepsilon} (\boldsymbol{X} {\boldsymbol{\theta}} + \boldsymbol{Z}\boldsymbol{b}) + \boldsymbol{\varepsilon}. \end{array} $$
(3)

The main difference from the way in which LMMs are typically formulated is that σε is part of the linear prediction of the model (e.g., Singmann and Kellen, 2019).

The problem with these default Bayes Factors is that σε is only one of the many different variance components being estimated. In addition to the residual standard deviation, every random effect term in the model (i.e., each column in Z) has its own variance component (see also Rights & Sterba, 2019). In the example data considered here, the maximal LMMs estimate three variance components, (i) the residual standard deviation, (ii) the standard deviation of the random intercept, and (iii) the standard deviation of the random slopes. The default Bayes Factor approach proposed by Rouder et al. (2012) ignores these additional variance components, standardizing the observed effect solely on the basis of the residual standard deviation σε.

The Unstandardized Model

As an alternative to default Bayes Factors, we calculated Bayes Factors using a pair of models in which the parameter priors are specified on an unstandardized scale matching the original scale of measurement (i.e., the linear prediction is given by Eq. 3 without σε). In these models, we followed Rouder et al.’s (2012) contrast coding of the fixed effects—that is, we used the orthonormal contrasts with contrasts weights of \(\pm 1/\sqrt {2}\) for the condition effect. The prior on this effect was a zero-centered t distribution with df = 3 and scale = \(1/\sqrt {2}\). This implies a t-distribution prior for the difference between the two conditions with df = 3 and scale = 1. In other words, under the alternative hypothesis we expected the difference between the two conditions to be between − 1 and 1 in units of the dependent variable with ≈ 60% probability.

Model Posteriors

Figure 4 shows the mean and 95% credibility intervals of the posterior distributions of different model components for the alternative model \({\mathscr{M}}_{A}\) underlying the standardized model (i.e., the default Bayes Factor, solid line) and the unstandardized model (dashed line). These posteriors distributions explain why the Bayes Factor of the standardized model is affected by different levels of aggregation and also why this is not the case for the unstandardized model.

Fig. 4
figure 4

Parameter estimates of standardized and unstandardized model. Data points show the posterior means and error bars the 95% credibility intervals of the posterior distributions of different model components for the standardized (i.e., Rouder et al.,’s default Bayes Factor) and unstandardized model. A: Estimated difference between both conditions on the original scale. B: Estimated standardized difference between the conditions, which is only part of the standardized model. C: Estimates of the three variance components estimated for this data set (residual = residual SD, intercept = random intercept SD, slope = random slope SD)

As shown in Fig. 4C (left most panel) aggregation has exactly the same effect on σε in both models. Under an intense data aggregation regime (i.e., 2 trials), σε takes on a lower value. This value increases when there is no aggregation and all 100 trials per condition are considered. Aggregation has no effect on any other model component of the unstandardized model. Both additional variance components—the standard deviations of the random intercepts and slopes—as well as the estimated mean difference between the two conditions (see Fig. 4A) are unaffected by aggregation. This pattern is to be expected, given that the observed mean difference, as well as individual differences in the overall mean and condition effects are the same across the different levels of aggregation. Which is exactly what the estimates of the unstandardized model show.

In contrast, all components of the standardized model are affected by aggregation. To understand why, note that in the formulation for the linear predictor of the standardized model (3), all estimated coefficients (with the exception of the fixed intercept) are scaled by σε. This means that, when σε becomes smaller, the estimated coefficients need to become larger in order for the model to preserve its predictions. Given that σε scales all the random components being estimated (i.e., including the random intercept), the estimates of their respective standard deviations will be larger when σε is reduced.

Figure 4B shows the standardized difference—the standardized effect size estimate—which is only part of the standardized model and is given by the “raw” difference divided by σε. As the raw difference is only mildly affected by the number of trials compared to σε, we see the same pattern: the estimated standardized effect size is largest when σε is smallest. The estimates of the standardized difference also explain why the estimated “raw” differences in Fig. 4A are affected by the aggregation regime. The reason is that the prior is specified on the standardized-effect size scale and fixed across the different aggregation levels (it is a zero-centred Cauchy distribution with scale = 0.5). Therefore, the larger the estimated standardized difference, the larger the shrinkage due to the prior. Note that the model is closest to “true” mean difference of 0.5 only when there is no aggregation.

Explaining the Non-monotonic Bayes Factors

The only result left unexplained is the non-monotonic pattern observed in Fig. 3. The reason behind this pattern is that the posterior of the standardized mean difference exhibits two patterns that affect the estimated Bayes Factor independently. Their interplay leads to the observed non-monotonicity. As shown in Figure 4B, the level of aggregation affects the mean of the standardized difference distribution, as well as its width.

In the present case, the computed Bayes Factor is equivalent to the Bayes Factor obtained through the Savage-Dickey method. That is, the ratio of posterior density versus prior density for an estimated standardized mean difference of 0; i.e., BF\(_{A,0} = \frac {f_{\text {prior}}(0)}{f_{\text {posterior}}(0)}\). This Bayes Factor will increase along with the size of the standardized difference, for a fixed posterior-distribution width. Conversely, for a fixed standardized mean difference, increasing the width of the posterior distribution will decrease the Bayes Factor. Figure 5 plots both prior and posterior distribution of the standardized mean difference, and shows how these two components can conspire to produce a non-monotonic pattern. When the aggregation is maximal (number of trials = 2) the width of the posterior is largest with the consequence that the density at 0 is comparatively large even though the mean difference is also quite extreme. Decreasing the level of aggregation leads to a tighter posterior with a lower density at 0. This lower density is counteracted by a decrease in the standardized mean difference, which ultimately results in the non-monotonic pattern shown in Fig. 3.

Fig. 5
figure 5

Prior (line) and posterior (histogram) of standardized mean difference across different numbers of trials. The standardized Bayes Factor is given by the ratio of posterior density and prior density at 0. To show this more clearly, the x-axis is restricted in range

The Bayes Factors for the unstandardized models can also be obtained through the Savage-Dickey method. In this case, the relevant posterior refers to the raw mean difference in Fig. 4A. As this posterior is invariant to the level of aggregation, the resulting Bayes Factor is also constant. This invariance to aggregation is not something that is unique to the unstandardized Bayes Factor. When estimating a frequentist LMM and p-values across the different levels of aggregation we observe the same inferential results in each case, F(1,19) = 16.27, p < .001. Likewise, the fixed-effect coefficient representing the difference, its standard error, and the random-effect variance component estimates are also unaffected by aggregation—only the residual standard deviation is affected. Parameterizing the Bayesian LMMs in the same way as their frequentist counterparts leads to similar results.

An Answer to van Doorn et al.’s Question

The analysis above explicates the relationship between the standardized model and the default Bayes Factor, and how they are affected by different levels of data aggregation. One might infer from this analysis, together with vDAHSW’s work, that the use of default Bayes Factors is generally recommended when dealing with disaggregated data. We wish to push back on this interpretation: The extent to which this approach is reasonable ultimately depends on how sensible (i.e., meaningful or interpretable) the standard effect size is.

For the disaggregated data of vDAHSW, the LMM has both by-participant random intercepts and random slopes, with the residual variance representing the within-participant trial-by-trial variance (which is assumed to be the same for all participants). It follows that the standardized effect size and the corresponding prior reflect a difference between conditions that is standardized by the level of within-participant variability. These standardized differences are therefore represented in terms of within-participant variance units. It is perfectly reasonable to assume that researchers can have enough substantive knowledge to specify sensible priors in terms of such units. Aggregating trials changes the interpretation of the residual variance and thereby the interpretation of the standardized effect size. For instance, in the example case in which two trials are aggregated (number of trials = 50), the residual variance reflects the within-participant variability that is associated to the mean of trial pairs. We cannot see how researchers could ever be expected to specify sensible priors in terms of these scale units, and therefore cannot recommend the use of default Bayes Factors here.

However, it is important to understand that the use of default Bayes Factors on disaggregated data is not immune to interpretability problems. After all, there are numerous circumstances in which the residual variance terms are difficult to interpret. For instance, difficulties arise when the other random effects variances differ in terms of how well they can be estimated: In an LMM, hierarchical shrinkage is imposed over the individual-level deviations, which in turn reduces the variance estimates (e.g., Baayen, 2008). This means that if the data is relatively sparse, and therefore shrinkage is considerable, then the residual variance will not exclusively capture within-participant trial-by-trial variability. It will also capture some of the other variance components, which were not estimated with enough precision. In such situations, the residual variance estimate will be biased upwards, which in turn leads to an underestimation of the standardized effect size. In other words, whenever the random-effect terms are affected by non-negligible shrinkage, the scale of the standardized effect size will be “contaminated” by other variance components, which affects their interpretability and therefore the status of any default Bayes Factor obtained with it.

A second problematic scenario involves LMMs with more than one random-effect grouping factor. In the simulated data used by vDAHSW, all random-effects terms are by-participant terms, which makes the interpretation of the residual variance term clear. However, one of the main reasons for employing LMMs is their ability to simultaneously account for multiple sources of stochasticity using so-called crossed random effects (Baayen, 2008; Judd et al., 2012, 2017). In a model with crossed random effects, the residual variance term reflects all the variability that is not accounted for by the other random-effects terms. For example, in a model with by-participant and by-item random effects terms, the residual variance can be interpreted as the within-participant variability after accounting for all item-related variability. At first glance, this residual variance estimate can strike one as being a more principled option in the sense that the standardization obtained with it is “more pure” by virtue of excluding item-related variability. The problem is that this standardization, if used responsibly, demands an amount of background knowledge that researchers rarely have. Specifically, it requires researchers to have a strong grasp on the expected magnitude of item-related variability, which is something that can only be achieved if one has an extensive background knowledge on some fixed experimental design. Otherwise, it becomes quite difficult, if not impossible, to assess the plausibility of any prior over standardized effects. Although it is certainly possible for researchers to obtain such background knowledge, its requirement flies in the face of the main motivation behind standardized effect sizes—specifying effects in a general way that can be easily deployed.

A similar problem arises when including additional fixed-effects covariates: Their inclusion will reduce the residual variance and therefore lead to inflated standardized effect sizes that are difficult to compare across studies that differ in terms of the covariates considered. This problem is well documented in the technical literature on effect sizes, in which we find methods that try to remedy it (Olejnik and Algina, 2003). Unfortunately, current methods for computing default Bayes Factors do not address this issue, which means that one risks obtaining inflated effects that are not easily generalizable. Again, this problem could be sidestepped if one has a strong grasp of the relative impact that different covariates have in terms of the overall variability, a requirement that flies in the face of the notion of default Bayes Factors as a widely applicable go-to procedure.

Our answer, finally: The only scenario in which we find the use of default Bayes Factors appropriate is when dealing with LMMs with only by-participant random effect terms, a maximal random effects structure (Barr et al., 2013) that can be estimated relatively precisely, disaggregated data, and no additional fixed-effect covariates. Only in this scenario does the residual variance term reflect the within-participant trial-by-trial variability—a sensible unit for a standardized effect size estimate. As soon as any one of the above-stated conditions is violated, the standardized effect size estimates lose their clear interpretation, leading to Bayes Factors of questionable status. Because of these issues, we generally recommend computing Bayes Factors for unstandardized LMMs. Although this is far less convenient, as it requires researchers to specify priors that speak directly to the “raw” dependent variable scales, it is the only sure way to avoid the problems we discussed above.Footnote 10

Before moving on, one aspect that we want to stress is that there is no “easy fix” for default Bayes Factors. For instance, one might be tempted to try to fix the problem by changing the standardization factor—replace the residual variance with some other variance term. In the simulated data of vDAHSW, it would be tempting to standardize the difference between conditions by the corresponding random-slope estimate. In other words, standardize the within-subject effect in terms of the across-participant variability of the effect. The problem with this approach is that all but the residual variance term are reduced by the hierarchical shrinkage, which means that the effect size estimates will be inflated. In fact, it is not uncommon for random-slope estimates to approach zero, which would result in absurdly large standardized effect size estimates. It is useful to demonstrate this problem with a concrete example: Let us consider the data reported by Freeman et al. (2010), which included a lexical-decision task with high- and low-frequency words (N = 25). In this example, we focus on the response times for word stimuli and the effect of word frequency. As in the example of vDAHSW, the effect of word frequency is a within-subjects variable. Descriptively, the standardized effect size is 0.84 and the observed effect of word frequency is 0.074 (i.e., RTs for low-frequency words are around 75 ms slower than RTs for high-frequency words). The standard deviation for the observed difference is 0.088. If we apply an LMM to this data, the estimated effect is equal to the observed effect (i.e., no shrinkage at the level of fixed effects). However, due to the hierarchical shrinkage, the standard deviation (i.e., the random slope estimate of the frequency effect) is estimated to be only 0.027. Adopting this estimate as the standard would result in a standardized effect size of 2.69, which is absurdly large.

Summary

Default Bayes Factors were developed with the purpose of avoiding the apparent subjectivity that is inherent to the specification of priors on the models being compared. In the case of LMMs, this approach can have unintended consequences for aggregation. By deciding whether or not to aggregate the individual data points, researchers can (strategically or unwittingly) affect the estimate of the residual standard deviation, and in turn the resulting Bayes Factor. In contrast, Bayes Factors or p-values obtained with unstandardized effects are invariant to aggregation. This illustrates how the adoption of a “default” assumption fundamentally changes the meaning of a model parameter, ultimately compromising the ability of the model to provide a clear, unambiguous characterization.

Part II: Letting the Dog Wag Its Tail

vDAHSW’s general concern was how to best use a particular class of models, LMMs, to detect experimental effects and interactions within a mixed model design. On the one hand, the structure of LMMs provides a convenient way to characterize the data that disentangles the general consequences or “fixed effects” of varying conditions within an experimental design from the variability or “random effects” associated with different observational units such as people and items. On the other hand, the LMM characterization relies on components—linear regression coefficients—that may or may not refer to any theoretical construct of interest or how it would manifest in data. And when using Bayes Factors, assumptions must be made in order to arrive at sensible priors over these model components (i.e., parameters). The more these assumptions are adopted by “default”, the less likely they are to be meaningfully related to the constructs they are intended to represent. The adoption of “default null model comparisons” and “default priors” can lead to misleading inferences and (apparent) paradoxes.

To understand how models can move beyond defaults and contribute to broader scientific goals, we return to the continuum between causal and descriptive models (Fig. 1). The focus of vDAHSW was on descriptive modeling that typically falls under the purview of applied statistics. Statistics has been called the “science of defaults” (Gelman, 2014) and so it is not surprising that statistical models should try to find default assumptions that work well enough much of the time. But like the default settings on any piece of technology (e.g., audio balance or chair heights), the defaults are meant to be a starting point rather than an end to themselves. Even for models that are meant to be primarily descriptive, moving beyond defaults requires considering the links between theoretical constructs and observables, and between model components and the general causal mechanisms that they are meant to represent.

Consider an experiment in which participants are shown a list of words of different frequencies and then asked to discriminate between members of this set and other words. The dependent variable may be the proportion of correct classifications (i.e., accuracy). The goal of applying a descriptive model like a LMM to this experiment would be to identify whether accuracy changes as a function of frequency. Any “effects” on accuracy that are identified (e.g., by a Bayes Factor supporting inclusion of a parameter representing a difference) are of interest only to the extent that they measure the actions of causal mechanisms related to the theoretical construct of “memory”.Footnote 11 It is the purpose of causal models to represent these mechanisms and how they generate patterns in data. For example, a causal model might represent the processes by which words are encoded in memory, such that words of different frequencies are recognized at different rates (e.g., the mirror effect; see Glanzer and Adams, 1985; Shiffrin & Steyvers, 1997). The scientific value of a causal model arises from how well it explains the effects and its ability to generalize to other situations in which the same mechanisms are presumed to be operating. Meanwhile, the scientific value of a descriptive model lies in its ability to inform the development of causal models by detecting effects that are indicative of the actions of more general causal mechanisms. This is why it is crucial that descriptive models be built and applied with an appreciation of the causal models they are meant to inform.Footnote 12

In the remainder of Part II, we discuss two important considerations in the development of models that better serve a scientific purpose. First, we delve more deeply into how model components are related to observable quantities in terms of the model’s coordination functions (Kellen et al., 2021; Chang, 2004; Van Fraassen, 2008). A coordination function is just as much a part of a scientific model as the model structure or its priors and therefore should be carefully scrutinized. Second, an important factor when evaluating causal model is its generalizability to related scenarios. Measures of relative fit, which are designed to distinguish between models based on their descriptive ability, are not often suited to assess or promote generalizability of models with a causal purpose. By appreciating the connections between models and theory and how we can assess them, we help ensure that the dog wags its tail, rather than the other way around.

Coordination Functions

Measurement requires the (conditional) acceptance of a theory from which a coordination function is derived. In this way, the theory, data, and experimental design are tightly interwoven and the interpretation of effects and interactions are dependent upon this relationship. Consider the illustration in Fig. 6, in which an observable variable X is linked to a quantitative theoretical construct Y by means of a coordination function f (and its inverse f− 1) such that Y = f(X). The coordination function, which can take on many different forms, cannot be established empirically, as, by definition, we cannot observe instances of both the observable variable and the theoretical construct. The host of challenges surrounding the establishment of a functional relationship between observables and theoretical constructs has been referred to as the “the problem of coordination” (Kellen et al., 2021; see also Chang, 2004; Van Fraassen, 2008).

Fig. 6
figure 6

Illustration of the problem of coordination. In the left panel, we see an individual with theoretical attribute Y generating a behavioral outcome X. In the right panel, we illustrate some of the different possible shapes that the functional relationship between X and Y might take

We highlight two problems that arise when “the problem of coordination” is ignored: (1) that different dependent variables or different levels of an independent variable may have different coordination functions, and (2) that even if there is only one coordination function, it need not be linear as assumed by LMMs. We illustrate both problems using the example of a pendulum. The period of a pendulum is the time taken for a complete swing from one position and back again. The amplitude of a swing is defined as the maximum angle subtended by the arm with respect to the pivot point. Because the period is largely independent of the amplitude when the amplitude is small (i.e., it is isochronous), a pendulum can be used to measure time and the coordination function is linear. Let N be the number of swings of a pendulum in time t. Let T be the period. Then \(N = f(t) = \left (\frac {k}{T} \right ) t\), where k is a constant that depends upon the length of the pendulum arm and the strength of gravity.Footnote 13 The function f is the coordination function linking the count of the number of pendulum swings (the dependent variable) with time (the theoretical construct).

To illustrate the first problem, suppose we conduct an experiment involving two pendula with different arm lengths, keeping in mind that the true coordination is a function of pendula arm length. On one occasion, we set them swinging for, say, 100 beats of our pulseFootnote 14 and on another occasion, they swing for 200 beats. We then count the number of completed swings and plot the results in Fig. 7A. This shows main effects of both independent variables (arm length and number of pulse beats) and their interaction. We can assume that these results are supported by very large Bayes Factors. If we ignore the coordination and naïvely identify the number of swings with the passage of time, we would interpret these results to mean that more time has passed over 200 pulse beats than over 100, which is a sensible conclusion. We would also conclude that more time has passed for the short pendulum than the long one under each pulse condition, which is absurd. Further this increase in “time” is greater for longer pulse sequences. Either we have created a clock that can manipulate time itself or we have made the mistake of ascribing the differences between the short and long pendula to the theoretical construct and not to the fact that they have different coordination functions.

Fig. 7
figure 7

A) The number of pendulum swings as a function of pendulum length and number of pulse beats. B) A transformation of the number of pendulum swings (50 minus the number of swings) used as a measure of the total length of the pendulum arm, as an apparent function of the length of the pendulum arm and the length of an extension added to the top of the arm

To illustrate the second problem, suppose we have learned from our earlier experiences that there is a negative relationship between the number of swings of a pendulum over a time interval and the length of its arm (see Fig. 7A) but, retaining some residual naivety, we choose as our dependent variable the difference between the number of swings and a large value (50 in this case). We then decide to investigate the effect of two independent variables on what we presume to be a measure of pendula length. We attach either a short or long extension to the top of the pendulum arm (where it adjoins the pivot point) and either a short or long extension to the bottom of the pendulum arm (where it swings freely). What is the effect of these variables on our measure of the total length of the arm?

Examining Fig. 7B, we see that there is an effect of adding a length to both the top and bottom ends of the arm as well as an interaction. Apparently, the effect on total length of adding an extension to the bottom of the arm is systematically reduced by the length of the extension added to the top of the arm. Publication awaits! The absurdity of the conclusion rests on the fact that we have ignored the possibility of a nonlinear coordination function. In fact, the number of swings (N) is a function of the reciprocal of the square root of the arm length (L). That is, \(N = K / \sqrt {L}\), where K is a constant that depends on the total length of time and the strength of gravity.

As these simple examples illustrate, a purely descriptive statistical framework predicated on the identification of main effects and interactions, such as the one laid out by vDAHSW, glosses over the problem of coordination (cf. Loftus, 1978; Wagenmakers et al., 2012). This creates the risk of the components of LMMs being illegitimately reified as the “things” that researchers are ultimately targeting. The only way to avoid this risk is to place the understanding of theoretical constructs and their presumed coordinations at the forefront, a move that runs counter to the adoption of general defaults.Footnote 15 Doing so amounts to letting causal modeling considerations inform descriptive statistical models.Footnote 16

Assessing Model Generalizability (in Its Different Senses)

Modeling in science, whether primarily causal or descriptive, is ultimately in the service of the production of theories that can explain patterns of data using causal mechanisms that are expected to generalize across similar scenarios. Bayes Factors are often promoted as a method for assessing model generalizability in terms of the ability to predict new, unseen data from the same generating process.Footnote 17 This form of generalizability is, however, inextricably linked to the goals of descriptive rather than causal modeling. That is, the Bayes Factor assesses the relative ability of two models to predict (assign high likelihood to) data y obtained from a particular design/population, when considering the range of predictions associated with the parameter values deemed probable a priori (i.e., g(𝜃)). In other words, Bayes Factors refer to generalizability in the specific sense of how general the prediction of what we just observed is under each model. For any given model, the more likely y is on average—when sampling parameter values from g(𝜃)—the more generalizable that model will be deemed. This is a restrictive form of generalizability that does not encompass the many other ways in which generalizations are made in science (for a related discussion, see Liu & Aitkin, 2008).

Any time we design an experiment, we engage in an act of simplification. We isolate and manipulate a small set of variables that may be relevant to a phenomenon, we control the stimuli presented to participants, and we restrict the manner in which those participants may respond. We do so to make our scientific task tractable and/or facilitate access to a phenomena of interest. Our hope is that, by designing many such experiments, each simplifying the phenomenon in different ways, we can construct causal models that simultaneously accommodate all experiments and by doing so, provide a better theory-based characterization. But for most causal models in science, we are less interested in how well it accounts for the data that we have just observed than we are in how well it might accommodate new data from a different experiment that relates to the same phenomenon. A different experiment that perhaps measures or simplifies it in a distinct way (Busemeyer & Wang, 2000; Schat et al., 2020) or engages with it through a different set of experimental manipulations (Baribault et al., 2018). In this sense, every meaningful scientific inference is necessarily an act of generalization. Even within the same experiment for a given dataset, we may be for seeking for good generalization and predictive performance at different levels. For instance, we may care about the generalization with respect to an individual for a particular session and a particular task, or we may be interested in generalization among particular groups of participants across sessions and tasks, or we may be interested in making estimates at the population level, and so on.Footnote 18 Figure 8 illustrates a couple of examples. Depending on the context, we may adjust the way we compare model predictions with data, which can range from coarse qualitative patterns to fine-grained quantitative predictions (Shiffrin & Steyvers, 1997). To fully engage with generalizability in all its relevant senses, researchers need a large methodological toolbox as well as the substantive background knowledge that allows for sensible decisions to be made—nothing that could ever be replaced by some default application of Bayes Factors.

Fig. 8
figure 8

Two examples of generalization. In this example of “within-experiment generalization”, a model characterization obtained with a portion of the data observed in Experiment 1 (Exp. 1) is used to predict the remainder of the data. In the example of “between-experiment generalization”, a model characterization obtained with the (complete) data observed in Experiment 1 is used to predict the data observed in Experiment 2 (Exp. 2)

But even when restricting ourselves to generalizability as understood in the context of Bayes Factors, it is important to understand specific circumstances in which they are most informative. As is well known, Bayes Factors work best when there is at least one of the considered models that well approximates the data-generating process (often referred to as the “\({\mathscr{M}}\)-closed” setting; Navarro, 2018).Footnote 19 In a scenario where a large amount of model misspecification is likely for both models (often referred to as the “\({\mathscr{M}}\)-open” setting) it often behaves poorly. The large-sample behavior of the Bayes Factor in the latter \({\mathscr{M}}\)-open setting is to select with absolute certainty that model which is closest to the truth in terms of Kullback-Leibler divergence, and as noted by Navarro (2018) this is not always well-matched to the practical goals of the scientist. In the pre-asymptotic case, recent results by Oelrich et al. (2020) highlight how this “overconfidence” can be dangerous: a Bayes Factor analysis can often appear to lend overwhelming support to a given model (when other model selection criteria do not) yet this extreme confidence can rapidly shift with only a small amount of new data. More precisely, they derive results for the expected value and variance of the Bayes Factor in the linear regression setting (which, admittedly may differ somewhat in the LMM setting), and the conditions under which the Bayes Factor is most prone to overconfidence. Of particular note to psychologists who hope to make robust inferences from data, Theorem 1 from Oelrich et al. (2020) suggests that the Bayes Factor has highest variance (and is thus most prone to overconfidence) in situations where the models under consideration have little shared complexity, and explain the same phenomenon in vastly different ways.

From a theoretical perspective, this property of the Bayes Factor is dangerous. As scientists, we want our causal models to be highly distinguishable from one another and consequently design our experiments to make them distinguishable. That is, we choose experiments that render model predictions as different as possible and minimize shared complexity in the measurement context. We do so because we are hoping to end up in a situation in which one model is clearly better than the others. Sometimes we succeed, and the data are so overwhelmingly dispositive that every model-selection criterion gives similar answers (and accordingly we have no practical reason to prefer Bayes Factors over other criteria). However, when our efforts end up producing ambiguous results, the Bayes Factor can be prone to overconfidence, leading the researcher to falsely conclude that the results are decisive. As noted by Oelrich et al. (2020), this risk is most severe in situations when both models can only provide gross mischaracterizations, and yet there is a grain of truth in both: when two models are vastly different to each other but they’re both capturing something relevant to the phenomenon. Arguably this situation is the norm in psychological research, and yet the formal results reported by Oelrich et al. (2020) suggest that this situation is exactly when the Bayes Factors are the most overconfident, yielding extreme preferences for a specific model candidate (the one with minimal Kullback-Leibler divergence) that can swing wildly under minor changes to the experimental design or data.

Endeavors in mathematical psychology, and science more broadly, are concerned with building increasingly better models of the causal mechanisms underlying phenomena of interest. When trying to distinguish two equally well-performing models, there are two options that we can pursue: The first option is to expand the domain of application of the models by adding new variables to be explained (e.g., confidence as well as response times and accuracy) and/or new tasks to which they can be applied; we keep expanding the domain until the models become distinguishable once more. The second option is to restrict the domain of application by focusing on portions of the data for which the models do make diverging qualitative predictions that do not hinge on auxiliary assumptions (Birnbaum, 2010; Kellen et al. in press; Stephens et al., 2018). In either case, Bayes Factors—and quantitative model comparisons generally—are not very amenable to such a workflow. In addition to the illusion of confidence that a Bayes Factor can portray when dealing with equally good models, its handling becomes increasingly cumbersome as the models under consideration become more complex—not only does the computation of the Bayes Factor become increasingly hard, it also becomes increasingly sensitive to priors over additional parameters, as noted above. Treating the Bayes Factor as a “gold standard” for model selection can, in the worst case, discourage the use of more complex models which would in turn discourage researchers from expanding the domains of application. As a consequence, psychologists would be relegated to dealing exclusively with toy-like problems and simplistic studies rather than ecological contexts where models can be most impactful. As important as generalizability is to a successful causal model in science, many forms of generalizability cannot be assessed via quantitative model comparisons based on criteria like the Bayes Factor, which are tuned to discriminating between descriptive models.

Final Thoughts

Ultimately, our message is simple: Don’t let the tail wag the dog. If we want descriptive modeling to serve the goals of science, critical thinking cannot be outsourced to statistical methods based on default assumptions. A main goal of science is to explain phenomena in terms of causal mechanisms that are expected to generalize across related scenarios. In service of this goal, we construct formal models which exist on a continuum from descriptive to causal. While causal models explicitly represent mechanisms and illustrate how they yield patterns in data, descriptive models are designed to identify those very same patterns. Statistical modeling, such as the LMMs illustrated by vDAHSW, serves a primarily descriptive function, but in order for inferences from descriptive models to aid scientific goals, their assumptions must be aligned with potential causal mechanisms. Adopting default assumptions impedes this alignment, leading to erroneous inferences and apparent paradoxes which are only resolved by considering how a model instantiates particular causal mechanisms. This instantiation requires one to think, among other things, about the coordination function that relates model components to observables. Finally, although statistical methods like Bayes Factors can distinguish between models in a descriptive sense, they are not often suited to distinguishing between causal models which should be assessed based on broader criteria of generalizability.

In the last decade or so, there has been a tremendous shift toward the adoption of Bayesian methods. This shift is in part motivated by a perceived association between the use of frequentist methods and the prevalence of questionable research practices that lead to unreplicable and ungeneralizable results (e.g., Wagenmakers et al., 2011). However, there is a lingering concern that this shift will not bring an end to mindless “statistical rituals” (Gigerenzer, 2018), only that these rituals will be moved to another temple. This wouldn’t be surprising at all, when taking into consideration the numerous historical surveys arguing that psychology as a field as long been bound together by methodological standardization, and the belief that the latter is sufficient for the establishment of scientific enterprises (e.g., Danziger, 1994; Flis and van Eck, 1999; Mackenzie, 1977). One response to this concern is that Bayesian modeling requires a careful specification of one’s prior beliefs, and that the transition between prior and posterior beliefs can occur in a transparent manner. The hope is that, under the aegis of Bayes, the researcher will be strongly encouraged to think more carefully about models (e.g., their relationship with theoretical statements) and the evidence that can be derived from their contact with data. As much as we may hope that this shift in attitudes will come to pass, the reliance on default methods and standardized effects that has accompanied the recent surge in Bayesian modeling prevents us from being optimistic. The different issues raised by vDAHSW have provided us with a valuable opportunity to articulate some of our concerns in more concrete terms.

It would be a mistake to interpret our critique as an attack on Bayesian modeling in general, the use of Bayes Factors as an inferential tool, or vDAHSW for that matter. We believe that, with informed priors, Bayes Factors can help adjudicate questions about the relative descriptive value of different models. Moreover, we believe that Bayesian methods are ideally suited to representing and propagating uncertainty in a principled way; this is especially useful in causal modeling, where joint posterior distributions over model parameters can be helpful in understanding the relationships between components of a potentially complex model and data. Rather, our target here is the often-tacit understanding of modeling as something that can be practiced fruitfully when distanced from the data and bound by constraints that are divorced from substantive theoretical considerations. Progress cannot be achieved by global, one-size-fits-all application of methods, regardless of their particular merits: stricter p-values, standardized effect sizes, prep, Bayes Factors, multiverse analyses, pre-registrations, etc. Regardless of the statistical framework adopted, progress can only come from a careful (and often piecemeal) scrutiny of the relationship between the different theoretical components, experiments, and data, along with clear statements of goals, values, and compromises (see Mayo, 1996; Mayo, 2018; Navarro, 2018; Kellen, 2019). W. W. Rozeboom, one of the earliest critics of p-values and proponents of Bayesian reasoning (Rozeboom, 1960) summarized it best in a related discussion:

... it’s important for those who advance to graduate science to learn that holistic accept/reject statistical-test appraisals of a target theory is no more creative science than warming a frozen supermarket dinner is gourmet cooking. An alpha-grade theoretical scientist should be able to search out and devise diagnostics for basic features in a probated theory’s conceptual structure that haven’t yet been linked to observable consequences, and to rejoice when new procedures of data production disclose hitherto unknown data patternings that deepen our abductive access to the explanatory sources of these phenomena. [...] astute evidence appraisal focuses on select features of the hypothesis at issue with only secondary confidence adjustments, if any, in its remainder. Holistic acceptance/rejection is for amateurs. Rozeboom (2008, p. 1123).