Many fields in psychology, including social, developmental, clinical and organizational, describe phenomena that are dynamic, meaning that they evolve over time (Kozlowski et al., 2013; Lord et al., 2010; Wang et al., 2016). Examples include the C-Escalation and D-Escalation Theory of Conflict Escalation (Collins, 2012), Self Regulation Theory (Carver and Scheier, 1998), Conservation of Resources Theory (Hobfoll, 1989), and the Reoccurring Phase Model of team processes (Marks and Mathieu, 2001). However, the dynamic processes described by these theories can be difficult to directly observe or measure. They may consist of a mixture of different processes that occur in various ways in different contexts or for different people. To further complicate matters, there may be different processes acting at different levels, that unfold over different time scales.

To test a dynamic theory, a researcher needs a temporal design that allows for the theoretical processes to be observed as they unfold over time, and a statistical model that operationalizes the processes described by that theory (Collins, 2006). Intensive longitudinal designs are becoming more widely used, and a range of statistical models have been used to analyze the data collected using these designs. Common examples include multilevel regression, latent growth curve models, and latent change score models. However, it can be difficult to test a dynamic theory using these models. As noted by DeShon (2012), multilevel and latent growth models are poorly suited for testing dynamic theory, because they do not describe the process by which the variables change over time. Latent change score models can be used to test certain types of dynamic theories, but their implementation is complex. The complexity increases dramatically as the number of variables and time points increases. This can make an even relatively simple conceptual model prohibitively difficult to specify (Wang and Meng, 2016; DeShon, 2012).

In this paper, we describe a Bayesian approach to modeling dynamic processes that is more flexible than common extensions of the Structural Equation Modeling (SEM) framework. The approach that we illustrate is based on methods often used in cognitive psychology. However, dynamic theories are relevant to a range of fields in psychology, including organizational, social, developmental, and clinical psychology. We believe there is a potential for these methods to have a broader impact. Tutorial papers outlining similar methods exist but are primarily aimed at a cognitive psychology audience (Lee and Wagenmakers, 2013). Our aim is to present a tutorial that is accessible and applicable to researchers working in a range of sub-disciplines, particularly those who are new to Bayesian analysis. Under the Bayesian approach that we describe, there is virtually no limit to the form of the model that can be specified. The flexibility of this approach makes it possible to construct statistical models that map more directly onto psychological theory. This enables a closer coupling between theory and model, and allows for a clearer test of the theory. In the next sections, we describe some of the challenges associated with analyzing dynamic processes, and discuss how these can be overcome using the Bayesian approach. After that, we demonstrate the Bayesian approach to modeling dynamic processes using data from a recent study to facilitate understanding. We begin by showing how this framework can be used to model the change process at the level of the sample, where one assumes the same process is operating for all participants. We then show how this framework can be extended to model variability across people, known groups (e.g., experimental conditions), or latent subgroups of participants.

Throughout the paper, we have aimed to keep the level of exposition accessible to the researcher who has some experience using methods such as multilevel or structural equation modeling, but who does not necessarily have any experience with platforms such as R or Stan, or with Bayesian analysis more generally. We have kept the mathematical equations to a minimum, using them only where we believed they are required. We strongly believe that for a paper such as this to be useful to a non-method-oriented reader, there must be enough practical content for the reader to apply this framework to their own work. We have therefore presented the computer code required to specify every model that we demonstrate. The models are implemented in Stan, a program with which we expect many readers will be unfamiliar, so we have provided comprehensive descriptions of the code in text. Stan also has a user guide available online (http://mc-stan.org/users/documentation/). We have also included all the data and code required to implement the models presented in the paper, and many additional models, on the OSF https://osf.io/nx2fc/.

Understanding Dynamic Processes

Dynamic processes are typically represented in a closed loop, whereby the outputs of the system influence the inputs, causing the components of the system to change over time (Neal et al., 2017; Wang et al., 2016). A simple example of a dynamic process is a team raising money for a charity. The input to the system is the amount of money raised each month. These inputs accumulate over time and the cumulative effect of these inputs is reflected by the total amount of money raised. One output of the system is the amount of interest earned from the bank each month on the total amount. This output feeds back into the system because it is added to the total amount, which increases the interest earned in the future, and makes the total amount of money earned by the charity grow even faster.

An important feature of a dynamic process is that at least one variable has memory of its value at previous timestep(s). This type of variable, which is often referred to as a “dynamic”, “stock”, or “level” variable retains its value over time. It can be added to or subtracted from, but it cannot take on new values that are implausible given its previous state (Weinhardt and Vancouver, 2012; Vancouver et al., 2008). In the example above, the amount of money in the bank is a dynamic variable. This variable might increase (e.g., after more donations are deposited) or decrease (e.g., after account fees are paid), but the current value of the variable always depends on its previous value.

Theorizing about dynamic processes requires a theory of change. A theory of change describes how the system evolves from one time point to the next. Here, we model the change process as unfolding over discrete time points. Modeling such a process involves identifying the dynamic variables in the system and explaining the rules governing how these variables change over time. There are many highly influential theories in psychology that do just that (Carver and Scheier, 1998; Collins, 2012; Hobfoll, 1989; Marks and Mathieu, 2001; Powers, 2005; Hatch, 1993), and computational models are increasingly being used to formalize these theories and generate quantitative predictions (Grand et al., 2016; Luzardo et al., 2017; McClelland, 2014; Powers et al., 2011; Vancouver et al., 2010). Bayesian models describing dynamic processes are often used in cognitive psychology. These include reinforcement learning models that capture how people act on their environment to maximize cumulative reward (Steingroever et al., 2014) and diffusion models representing how people make rapid decisions (Miletić et al., 2021).

However, researchers typically wish not only to use their theories as predictive tools, but also as the basis for making inferences from a set of observations. For example, having observed the growth in the amount of money in the bank over time, one might wish to estimate the monthly donations that a team member can be expected to solicit. Making such an inference requires translating the theory into a statistical model, and estimating parameters of the model based on data. However, this process can be complicated by a number of factors. First, there is uncertainty in the quantity being estimated, arising from the fact that the inference is being made based on observations of a limited number of team members over a limited number of months. Second, there is likely to be individual differences in the abilities of the team members to solicit donations (e.g., some team members may be more experienced in the role than others). Third, there may also be group differences in donation solicitation (e.g., there may be differences between subgroups within distributed teams due to their locations). Finally, dynamic variables are difficult to implement using standard statistical models. These complications can make it difficult to implement statistical models that accurately represent the process described by the theory, which ultimately compromise the ability of the model to test the theory.

Opportunities for Broadening Applicability

In this section, we highlight avenues that we believe have a great deal of untapped potential to add value to the way researchers model longitudinal data.

Bayesian Inference

Bayesian methods have become more widely used in recent years, in part due to the emergence of Bayesian analogues of standard statistical tests, and open-source software that makes them easy to implement (e.g., JASP; JASP Team 2018). However, the models required to analyze longitudinal data are typically more complex than these standard tests, and usually cannot be applied in an off-the-shelf manner. This makes it difficult to develop standard Bayesian implementations for models of longitudinal data. As a result, the uptake in Bayesian methods for modeling longitudinal data within psychology has been slower.

Bayesian analysis has some useful features that make it an attractive option for modeling dynamic processes. At its core, Bayesian inference is based on probability theory (Jeffreys, 1939; Jaynes, 2003). This connection to probability theory allows one to make probabilistic inferences (e.g., “there is a 99% probability that the constant change in a variable is positive”) which are not possible within the classical frequentist approach. Another useful feature is the incorporation of prior information into the analysis, which provides a natural way of developing hierarchical models and for reducing measurement error (Boehm et al., 2018). Bayesian analysis also allows access to information regarding uncertainty in a parameter estimate by providing the full posterior distribution on each parameter. Further, the sophisticated way in which model complexity is defined in Bayesian analysis provides a powerful tool for comparing alternative models.

Hierarchical and/or Mixture Modeling

Multilevel modeling has become commonplace in psychology, due to the frequent need to account for the nesting of individuals within groups. However, most existing applications of latent change score models are implemented as single-level analyses (Wang et al., 2016). This approach assumes that participants are independent from each other and not nested within groups.

The single-level framework generally restricts the inferences that can be made to those involving group-level effects. The single-level approach limits the options a researcher has for exploring the higher level structure of the dynamic process under investigation. For example, one might wish to obtain parameter estimates for each individual to clarify the nature of the between-participant variation in different components of the process. This requires the use of a hierarchical model. One might also wish to examine whether participants cluster into distinct subgroups, which requires the use of a mixture model. Hierarchical and/or mixture models are straightforward to implement within a Bayesian framework, and offer a number of advantages over single-level models (Rouder and Lu, 2005; Kruschke, 2010; Lee, 2011).

By using hierarchical Bayesian models, parameters can be estimated with greater accuracy than with single-level models (Boehm et al., 2018; Rouder and Lu, 2005). This is because individual and higher level (i.e., population-level) parameters are estimated simultaneously (Kruschke and Liddell, 2018a). Data from individuals inform the population-level parameters, which in turn inform and constrain individual-level parameters. This causes “shrinkage” of the individual-level parameters closer toward the mode(s) of the group, meaning that outliers are more strongly pulled toward the group mean. Essentially, the population-level parameters act as priors, informing the individual-level parameters (Farrell and Ludwig, 2008; Rouder and Lu, 2005). This process not only improves the estimation of individual participants parameters, but also reduces the influence of outliers and random sampling noise on population-level estimates, therefore allowing researchers to make more accurate inferences about the population from which the sample was drawn (Boehm et al., 2018; Driver and Voelkle, 2018; Kruschke, 2014). In a number of studies, hierarchical Bayesian methods have been found to produce more accurate parameter estimates than single-level analyses, particularly with just a limited number of trials per participant (Ahn et al., 2011; Rouder and Lu, 2005).

With hierarchical Bayesian analyses, these parameter estimates are also more informative than with single-level analyses, as the posterior distributions provide the entire probability distribution of parameters given the data, and tell us how much certainty we can have in the parameter estimates (Haines et al., 2020; Baldwin and Fellingham, 2013). Finally, Bayesian mixture models also offer advantages over single-level analyses. Mixture models are implemented by assuming that observations are the result of a combination of data-generating processes (e.g., multiple-group membership). This analysis enables one to examine between-group differences in group-level effects and to identify the influence of each group on the individual. In turn, inferences can be made about the existence of latent subgroups. Bayesian hierarchical and/or mixture modeling also allows researchers to better understand individual differences than with single-level analyses whereby participant data is aggregated (Bartlema et al., 2014; Farrell and Lewandowsky, 2018).

Increased Flexibility

Although the latent change score model (LCSM) is currently regarded as a general approach for implementing longitudinal models, applications of this framework thus far have been limited to a relatively narrow set of research questions. These research questions typically center around identifying factors that covary with subsequent change in a given variable (e.g., Taylor et al. 2017; Sanford 2014). We argue that the restricted range of questions that these models have been used to answer may in part be due to how these models are implemented.

Models such as the LCSM must be constructed within the SEM framework, which can be implemented using programs such as LISREL, Mplus, or in R via packages such as lavaan (Rosseel, 2012), sem (Fox, 2006), and OpenMX (Neale et al., 2016). Although some Bayesian analyses can be implemented in programs such as Mplus (Kaplan and Depaoli, 2012), there are many challenges associated with using SEM for longitudinal data modeling. This framework requires the specification of a unique variable for each observation in the time series. This limits the number of observations that can be analyzed to a relatively small amount (around 20; Wang et al. , 2016) before it becomes cumbersome to code and can result in less efficient parameter estimation. Even if the number of observations being analyzed is small, the models implemented within the SEM framework can still be difficult to code. In many cases, applying this type of analysis requires a complex model specification that involves “tricking” the program into estimating the model (Rabe-Hesketh et al., 2007). Multiple parameters must be specified for each observation being analyzed, and many of those parameters must be fixed to particular values in order for the model to be identified. The complexity of implementing these models within the SEM framework increases the likelihood of misspecification, which can invalidate results (Clark et al., 2018). These practical challenges make it difficult for researchers to venture outside the small set of commonly used models.

Model specification issues aside, the questions that can be answered within the LCSM framework represent only the tip of the iceberg when it comes to theorizing about dynamic processes. There are a range of interesting questions that cannot easily be answered using this approach. For example, one might want to test theoretical assumptions that involve conditional relationships (e.g., a salesperson might raise their sales target by some amount if the target is achieved, but otherwise leave it unchanged). Additionally, one might wish to “close the loop” by modeling the process by which changes in the system feed back and influence performance at later time points. These types of questions are often outside the scope of what can be examined using conventional methods.

Until recently, the methods available for theorizing about such nuanced aspects of a dynamic process were limited to simulation studies (e.g., Davis et al. 2007; Harrison et al. 2007; Ballard et al. 2017; Vancouver et al. 2008). A simulation study involves instantiating a theory in the form of a computational model, selecting parameters values for the model, and then simulating the model using the selected parameter values to examine how the theorized process plays out over time. While simulation studies are useful for understanding and developing the predictions of a theory, we can go a step further by testing the theory. This involves fitting the model to data and estimating the model’s parameters. In the following sections we demonstrate this process, using a framework that is flexible enough to implement not only commonly used longitudinal models, but also estimate the parameters of computational models that make relatively specific assumptions.

Modeling a Single Group

In the sections that follow, we demonstrate an alternative, Bayesian framework for modeling change that provides the opportunities described above. The aim of a Bayesian analysis is to determine the inferences that can be made in light of some empirical observation (Kruschke et al., 2012; Etz et al., 2018; Lee and Wagenmakers, 2013). The researcher’s a priori beliefs regarding each parameter, referred to as its prior distribution (henceforth, prior), are combined with information about the likelihood of the data given the model. This produces a posterior distribution (henceforth, posterior) on each parameter. The posterior distribution represents the credible range of parameter values given the researcher’s prior and the observed data under the assumptions of the model. The posterior may be highly dispersed and cover a broad range of possible values. In this case, the true parameter value is highly uncertain. On the other hand, the posterior may be narrower and cover only a small range of values, meaning the researcher can be more certain about the true parameter value.

While the Bayesian approach can be understood conceptually, implementing Bayesian methods can often be complex. For most models, especially those likely to be of interest in psychology, there is no closed-form analytic solution for calculating the posterior. Therefore, the posterior must be approximated using Markov chain Monte Carlo (MCMC) methods. MCMC methods refer to a class of algorithms that can be used to generate a large number of representative samples from the posterior distribution (see Kruschke 2010; Van Ravenzwaaij et al. 2018). These methods use an algorithm that approximates the posterior given a sufficient number of samples, to produce sequences of sample parameter values. Bayesian models can be implemented using MCMC methods with many open-source platforms, including WinBUGS (Lunn et al., 2000), JAGS (Plummer, 2003), OpenBUGS (Thomas et al., 2006), and Stan (Carpenter et al., 2017). These programs allow the user a great deal of flexibility with model specification, so they can focus on the development of the model itself, while the MCMC algorithm is implemented under the hood. We use the Stan platform in this paper, because its MCMC algorithm, the No-U-Turn Sampler (Hoffman and Gelman, 2014), is particularly efficient for implementing more complex models with many parameters (e.g., hierarchical models; Stan Development Team 2017).

To facilitate understanding, we use an example of goal regulation which is relevant to many areas within psychology. Goal regulation is a dynamic process whereby people assess and revise their goals over time based on previous performance. To illustrate the dynamic process of goal regulation, we use real data from an air traffic control simulation study by Gee et al. (2018. In the study, 60 participants each completed ten 10-minute trials where they classified aircraft pairs as a conflict or non-conflict based on their minimum distance of separation. Before each trial, participants set a goal regarding the number of aircraft pairs correctly or incorrectly classified. The aim of the study was to examine how people revised their goals over time.

Gee et al. (2018) proposed a formal model of goal revision, which describes the dynamic process by which people revise their goals in response to previous performances. The model is discrete, meaning that the process is assumed to change at set points in time. Throughout the remainder of the paper, we use a version of this model as a running example to facilitate understanding. The model predicts that a person changes their goal based on the discrepancy between their previous goal and previous performance. The model has an auto-regressive component whereby it retains a “memory” of the previous time step.Footnote 1 This process can be expressed formally as follows:

$$\begin{aligned} G_{t} = G_{t-1} + \alpha (P_{t-1} - G_{t-1}) + \beta , \end{aligned}$$
(1)

where G is the goal level, and P is the actual performance. The \(\alpha \) parameter represents the learning rate. Higher values of \(\alpha \) mean that goal revision is more responsive to discrepancy between previous goal and previous performance. The \(\beta \) parameter represents the component of goal revision that occurs irrespective of the discrepancy between goal and performance.

Testing this goal revision model requires a highly flexible statistical framework. To adequately represent this model, goal level must be treated as a dynamic variable that retains a “memory” of its previous state. This type of dynamic process is difficult to represent using standard statistical approaches. However, they are straightforward to implement within the Bayesian framework we demonstrate below. In the following subsection, we demonstrate how the goal revision model can be implemented as a sample-level model, in which a single set of parameters is estimated for the entire sample. We then extend this framework to construct more sophisticated models of inter-individual variation.

Sample-Level Model

In this section, we show how to implement the goal revision model in Stan. The model has two theoretically meaningful parameters: \(\alpha \) and \(\beta \). To implement the goal revision model as a statistical model, a third parameter is required that represents the residual standard deviation of the goal level (\(\sigma \)). We begin by estimating these parameters at the sample level. In other words, we describe the behavior of the entire sample using a single set of parameters. The sample-level model assumes that the goal revision process plays out in the same way for each participant. The top-left panel in Fig. 1 shows the model depicted as a graphical plate model.

Graphical plate models, depicted in Fig. 1, are useful to illustrate the dependencies between the model variables. The variables are represented as nodes. Shaded nodes reflect variables observed from the data, whereas unshaded nodes reflect latent variables and parameters estimated by the model. The rectangular plates represent the sources of variability in the model. The observed goal and performance variables (denoted G and P respectively) are inside both plates, which indicates that they can vary across people and time. The three estimated parameters are outside both plates, which indicates that they do not vary. The arrows drawn between the nodes highlight the dependencies between variables. The precise nature of these relationships is summarized in Eq. 1, but the arrows provide a useful summary. The straight arrow from P, \(\alpha \), \(\beta \), and \(\sigma \), indicate that G is in part determined by these variables. The arrow that loops from G back into itself indicates that G is also influenced by itself. The predicted value of G at a certain time feeds back to influence the predicted value at the next time point. In other words, G is a dynamic variable.

Fig. 1
figure 1

Graphical and mathematical representation of the Bayesian group-level model

Like programs such as Mplus or Stata, Stan is a syntax-based platform, which means that the model must be expressed via computer code. The code required to specify the sample-level model is presented below. For simplicity, the code presented throughout this paper treats the observed_goal variable as continuous, despite it being measured on a discrete rating scale. However, we have included code for a sample-level model where the observed_goal variable is treated as ordinal in the supplementary material. The two models produce similar patterns of parameter estimates.

As can be seen, the program requires three unique blocks of code. The first is the data block. Here, the user must declare the data to be used as inputs to the model. The Ntotal variable is a single value that indicates the total number of data points. In this dataset, Ntotal \(=600\) (60 participants \(\times \) 10 trials). The trial, observed_goal and performance variables are columns of values representing the trial number, the goal set by the participant at the start of that trial, and the performance on the trial respectively. The [Ntotal] after variable name indicates that the number of data points in the variable is equal to Ntotal. Note that Stan requires the user to declare an object type for each input variable. In the above, the int identifier is used to declare Ntotal as integers. The other variables are declared as real, indicating that they can take on any real value.

The second block is the parameters block. Here, the user must declare the parameters that are to be estimated. In this model, alpha, beta, and sigma indicate \(\alpha \), \(\beta \), and \(\sigma \) respectively. All three parameters are declared as real which means they can take on any real value. The sigma parameter is declared with the \(\mathtt {<lower=0>}\) constraint, which imposes a lower bound of 0 on this parameter (which is necessary because a standard deviation, by definition, must be positive).

The third block is the model block. This is where the details of the model implementation are specified. Line 16 initializes a new variable called predicted_goal. This predicted_goal will be used to store the goal level predicted by the model. Lines 19–21 specify the priors for the three parameters. The \(\sim \) operator is used to define a variable or parameter that has a distribution, and can be read as has a distribution of. When an unknown parameter (as opposed to a known variable) is specified on the left side of \(\sim \) operator, the distribution given on the right side becomes the prior for that parameter. The alpha and beta are assigned parameters normally distributed priors with a mean of 0 and standard deviation of 1. In the context of this dataset, these are uninformative priors that do not place a high degree of prior belief on any particular region of the parameter space. The sigma parameter is also assigned a normally distributed prior. However, because we imposed a lower bound of 0 on this parameter in the parameters block, the algorithm will only sample positive values from this distribution. The process of choosing and assessing priors is discussed further on page 24 and by Rouder et al. (2009), Wagenmakers et al. (2011) and Vanpaemel (2010).

figure a

The rest of the model block is where the model itself is defined. By “model”, we mean the sequence of operations that are required to determine the likelihood of the data given the sampled parameter values. As can be seen, the model is constructed using a for-loop. The line for(i in 1:Ntotal) initializes a for-loop that iterates through the series of consecutive integers from 1 to Ntotal. The looping variable, i, is reassigned with each execution of the loop. In the first execution, i \(=1\); in the second, i \(=2\); and so on. The for-loop therefore iterates through data point, performing a series of operations on each one.

For each data point, the model first calculates the predicted goal level and assigns the prediction to the predicted_ goal variable. The trial variable resets at 1 for each participant’s first trial, and because there is no trial preceding the first one, the predicted goal on the first trial is assumed to be equal to the observed goal level (Gee et al., 2018). The predicted goal level on trial 1 is specified on lines 25–27. Line 25 contains an if statement, which carries out a particular operation only if a condition is met. In this case, if the value of trial the for data point i is equal to 1, the code on line 26 will be executed. If the condition is not met, line 26 will be skipped.

On lines 28–30, a separate if statement is used to specify the predicted goal level for trials 2 through 10. In this case, the code on line 29 will be executed only if the value of trial for data point i is greater than 1. Line 29 specifies the change in the predicted goal level to be determined according to Eq. 1. The += operator treats predicted_goal as a dynamic variable. This operator changes the current value of the variable by the amount given on the right-hand side of the expression (e.g., if \(x = 2\), then x += 3 will change the value of x to 5).

As a final step, the model compares the predicted goal level to the observed goal level. On line 31, observed_goal is treated as a dependent variable. When a known variable (as opposed to an estimated parameter) is specified on the left side of the \(\sim \) operator, it is treated as an outcome variable, with the arguments on the right side representing the predicted distribution of that variable. On line 31, we assume that the value of observed_goal for data point i comes from a normal distribution, with a mean equal to predicted_goal and standard deviation equal to sigma. The algorithm will therefore evaluate the likelihood of the observed goal given the current values of these variables, and update the posterior accordingly. In other words, we are effectively regressing the observed goal on the predicted goal while fixing the intercept to 0 and the slope to 1, with sigma representing the residual standard deviation of the observed goal.

To run the model, Stan can be called using R, Stata, MATLAB, Python, Julia, or the command line. In this paper, we run Stan via the RStan package (Stan Development Team, 2016), which provides an R interface to Stan (see supplementary materials on the OSF for R code used to run the model). Stan returns an object that contains samples from the posterior distribution for each parameter in the model. As shown below, The RStan package provides a summary of each posterior, including the posterior mean, the standard error of that mean, and the standard deviation of the posterior. It also displays the 2.5%, 25%, 50%, 75%, and 97.5% quantiles of the distribution.

The output also provides some information for assessing model convergence. The number of effective samples (n_eff) is a measure of the information contained in the posterior samples that accounts for the autocorrelation between neighboring samples in each MCMC chain. Specifically, the number of effective samples represents the number of independent samples that would produce an equivalent amount of information to the set of samples obtained. If there is no autocorrelation and samples are effectively independent, the number of effective samples will be equal to the total number of samples (4000 in this example, which is the RStan default). For parameters that are more difficult to estimate, samples may be autocorrelated. Autocorrelation reduces the information value of each individual sample because it creates dependencies between consecutive samples. In this case, the number of effective samples will be fewer than the actual number of samples, indicating the information contained in the posterior is less than it would be if the samples were completely independent. It is important to make sure the effective sample size is large enough that the approximated posterior will be representative of the underlying distribution. Here, the effective sample sizes are sufficiently large that we can be confident in the obtained posteriors, with n_eff for alpha and beta close to 4000 and sigma equal to 4000. Finally, the output also provides the potential scale reduction factor (Rhat; Gelman and Rubin 1992; Brooks and Gelman 1998 which gives an indication of the extent to which the chains converged on the same region of the parameter space. As the chains approach convergence, Rhat approaches one. It is commonly accepted that Rhat should be less than 1.1 before any inferences are made on the posteriors (Kruschke et al., 2012).

Fig. 2
figure 2

Posterior distributions on the constant and proportional change parameters for the group-level, person-level, and hierarchical models

figure b

The information regarding the parameter posteriors can easily be broken down for more detailed exploration. It is good practice to examine the full posterior distribution on each parameter, as well as the joint posterior distribution between pairs of parameters. For example, the first row of Fig. 2 contains the posteriors on the alpha and beta parameters. The first two columns show an approximation of the full posterior distribution for the two parameters. As can be seen in the top-left panel, the most probable values for alpha lie within the range of 0.4 to 0.6, suggesting that learning rates within that range are more probable. As can be seen in the top-middle panel, the most probable values for beta lie within the range of about 0.05 to 0.3. This suggests that people increase their goal by an extra—although likely small—amount regardless of the discrepancy between previous goal and previous performance. The top-right panel in Fig. 2 shows the joint posterior on alpha and beta. Here, the inner most contours represent the most probable parameter values.

Once the researcher is satisfied that the model has converged and that the parameters are well estimated (which can be further assessed through a visual inspection of model fit - explained below), they will likely wish to systematically determine the most probable values for each parameter. To do so, researchers can calculate the 95% credible interval (CI). The 95% CI spans the 2.5% and 97.5% quantiles of the posterior distribution. In the above model, the CI on alpha is 0.45 to 0.57, and the CI on beta is 0.11 to 0.24. The 95% highest density interval (HDI) is another commonly used interval for summarizing the posterior distribution. It represents the shortest possible interval containing 95% of the posterior density, so will always include the most probable parameter values (Kruschke et al., 2012). The CI and HDI both have strengths and limitations (e.g., see Kruschke 2010), but for symmetrical distributions the two intervals will be very similar.

The CI or HDI is often used to decide whether to accept or reject parameter values. For example, if intervals exclude zero, this can be evidence that the parameter value is different from zero (Kruschke et al., 2012). To quantify the evidence for a parameter being either greater than or less than a particular value, researchers can also examine the percentage of the posterior density that is above or below that value (Kruschke, 2010). For example, 100% of the posterior distributions on both alpha and beta are above 0. This provides strong evidence that both of these parameters are positive. It is worth noting, however, that some researchers have argued that making inferences based only on a parameter’s posterior distribution is not sufficient (e.g., Rouder et al. 2016), and instead, model comparison should be conducted. We return to the issue of model comparison later in the paper.

Although estimating parameters at the sample level can be informative, this approach can only be used to examine average effects. In this model, the alpha and beta parameters represent average parameter values across participants in the sample. These parameters tell us nothing about how these components vary across individuals. To do this, we require a model in which unique parameters are estimated for each person. We present such a model in the next section.

Person-Level Model

There are likely to be instances where one wishes to examine effects at the level of the individual participant. For example, one might assume that a process plays out differently for different individuals. Fortunately, the model presented above can be easily extended to model processes at the person level. To illustrate, the top-right panel of Fig. 1 shows the graphical representation of a model that estimates unique alpha and beta, and parameters for each participant (while still estimating sigma at the sample level). As can be seen, the \(\alpha \) and \(\beta \) nodes are inside of the person plate and subscripted with i. This indicates that these parameters vary across individuals.

The code below shows the specification of the person-level model. Only three simple changes are required. First, two new variables have been added to the data block. Nsubj is a single value representing the number of participants (in this example, \(Nsubj=60\)). subject is a column of integers representing the participant number. Second, in the parameters block, the clause [Nsubj] has been added after the declarations of the alpha and beta parameters (lines 11 and 12). This indicates that alpha and beta are now parameter arrays with lengths equal to the number of participants (as opposed to scalar values). Each element of the array corresponds to one participant, so there will be a unique parameter estimated for each one. The third change is in the model block on line 31. As can be seen, [subject[i]] has been added after alpha and beta. This index means that the predicted score will be calculated based on the alpha and beta parameters associated with the participant whose data is located in the i-th row of the dataset.

figure c

This model produces a set of posterior alpha, and beta samples for each participant. The middle row of Fig. 2 shows the results for the alpha and beta parameters. The first two plots show the 95% credible intervals on alpha and beta for each participant. Note that it is possible to access the full posterior for each participant, but the credible interval is more visually compact and makes it easier to compare a larger number of participants. As can be seen, there is heterogeneity between participants in both \(\alpha \) and \(\beta \). The right panel shows the \(\alpha \) and \(\beta \) parameters for each participant plotted against each other (with the crosses representing the credible intervals for each parameter).

Hierarchical Model

While parameter estimation at the person level has advantages over the sample-level model, one problem with the person-level model is that it can be difficult to make inferences about the population from which participants are drawn. Analyzing a person-level model is akin to running a separate, single-level analysis on each participant in a sample. This approach has less power because it only considers data from one participant at a time, and ignores the rest of the sample. It also fails to capture commonalities between individuals that may arise from participants being members of the same population.

It is quite often the case that researchers wish to examine variation between participants, but also to make inferences about the population as a whole. A hierarchical modeling approach is extremely useful for this purpose (Turner et al., 2013; Vincent, 2016; Kruschke and Vanpaemel, 2015). Hierarchical Bayesian models allow researchers to “have their cake and eat it too” by modeling the individual and population levels simultaneously (Lewandowsky and Farrell, 2011). As with the person-level model, the hierarchical model estimates unique parameters for each individual. However, unlike the person-level model, the hierarchical approach also models the distribution of the person-level parameters at the population level. As a result, person-level parameters are informed not only by the data for that one individual, but also by the population-level distribution on the relevant parameter, which lessens the influence of outliers on parameter estimates (Boehm et al., 2018; Rouder and Lu, 2005).

The bottom-left panel of Fig. 1 shows the graphical representation of a model that estimates the alpha and beta parameters hierarchically. As can be seen, this model assumes that \(\alpha \) and \(\beta \) are each informed by two higher level parameters—\(\mu \) and \(\sigma \)—which together describe the nature of the variability in these parameters at the population level.

The code below shows the parameters and model blocks required to implement this model (the data block does not need to be changed from the previous example).

figure d

As can be seen, the hierarchical model requires four new parameters. The first two are alpha_mean and alpha_sd (lines 14 and 15), which represent the mean and standard deviation of alpha. These parameters characterize the distribution of alpha at the population level. The beta_mean and beta_sd parameters (lines 16 and 17) represent the mean and standard deviation of beta. These parameters characterize the distribution of beta at the population level.

As can be seen on lines 25–26, the priors on alpha and beta differ in the hierarchical model, compared to the other models we have thus far demonstrated. Whereas in the sample-level and person-level models these priors are set using fixed values that produced broad, uninformative distributions, in the hierarchical model these priors are set using the population-level means and standard deviations. In other words, the hierarchical model uses information about the population-level distribution as the prior for the person-level parameters. The parameters that define the population-level distributions—alpha_mean, alpha_sd, beta_mean, and beta_sd—are then given their own priors on lines 27–32. The priors on the parameters that define higher level distributions are commonly referred to as hyperprior or parent distributions.

The above hierarchical model produces unique posteriors on alpha and beta for each participant and also population-level posteriors for alpha_mean,alpha_sd, beta_mean, beta_sd, and sigma. The bottom row in Fig. 2 shows a breakdown of the posteriors on the \(\alpha \) and \(\beta \) parameters. In the first two columns, the red lines represent the credible intervals on alpha and beta for each participant. The blue densities represent the full posterior on alpha_mean and beta_mean.

As can be seen, the posteriors on the population-level means in the hierarchical model occupy a similar region of the parameter space as the posteriors in the sample-level model (shown in the top row of the figure). The reader may note, however, that the credible intervals on the person-level parameters are less dispersed in the hierarchical model than in the person-level model (shown in the middle row of the figure). This is because in the hierarchical model, the population-level distribution imposes an additional constraint on the person-level parameters, which pulls the person-level parameters closer to the group mean. This process is called shrinkage, and it can also be seen in the bivariate plot in the bottom-right panel. Shrinkage reduces measurement error because it means that outlying parameter values are considered to be less likely compared to values closer to the population mean, causing parameter estimates to be pulled toward that mean (Boehm et al., 2018).

Having demonstrated the advantages of a hierarchical model compared to a person-level model, it should be noted that a hierarchical model is not always the more desirable choice. For example, a hierarchical model requires some knowledge regarding the variation in person-level parameters (e.g., whether they are normally distributed). Without any knowledge about this variation, it would be difficult to know whether the model is appropriately specified. In this case, it may be useful to first implement a person-level model in order to ascertain the nature of the variation in the parameters, and then to use these results to inform the specification of a hierarchical model.

Modeling Multiple Groups

The models we have addressed thus far examine variation within a single group or population. However, researchers are often interested in variation among multiple groups, such as whether a dynamic process differs across experimental conditions. In this case, they need to examine whether the parameters of interest differ between two or more predefined groups. Alternatively, a researcher may wish to examine whether participants naturally cluster into distinct subgroups. For example, perhaps some participants have a relatively high learning rate whereas others have a relatively low rate.

In this section, we describe two approaches that can help to answer the types of research questions described above. The first is a multiple-group model, in which the aim is to examine whether parameters of interest differ among known groups. The second is a mixture model, in which the aim is to identify naturally emerging latent subgroups into which participants cluster. For simplicity, we demonstrate sample-level models only, where parameters vary across conditions or subgroups but not across people. However, both of the models we present can also be implemented as hierarchical models, where person-level parameters are drawn from different group-level distributions that each represent a unique condition or subgroup. We include code to implement such models in the OSF supplementary material.

Known Group Membership

In the study that generated the example dataset, the researchers manipulated goal framing between participants. Half of the participants had their goal framed in approach terms, depicted as the minimum number of correct decisions the participant needed to achieve. The other half had their goal framed in avoidance terms, depicted as the maximum number of incorrect decisions the participant was allowed to make. In the next example model, we examine whether the goal revision processes differed between these groups by estimating unique \(\alpha \) and \(\beta \) parameters for each group and then comparing them.

The code that specifies this model is shown below. As can be seen, there are three changes required to convert the sample-level model described above to the multiple-group model. First, one new variable needs to be added to the data block. The condition variable (line 6) indicates the condition for each participant (\(1=\) Approach, \(2=\) Avoidance). Second, in the parameters block, the alpha and beta parameters are declared as arrays that each contain two elements. This means that two alpha and beta parameters will be estimated, one for each condition. The final change is on line 32. As can be seen, the indexing statement [condition[i]] has been added to the alpha and beta parameters. This means that the change in predicted goal level will be calculated using the parameter values associated with the experimental condition represented by the i-th row of the dataset.

figure e

The top row of Fig. 3 shows the posteriors on the \(\alpha \) (left panel) and \(\beta \) (middle panel) parameters for the approach and avoidance conditions. As can be seen in the top-left panel, there is separation in the posterior distributions on \(\alpha \) between the approach and avoidance conditions. This result suggests that people are more responsive to the discrepancy between previous goal and previous performance when pursuing approach goals compared to avoidance goals. As can be seen in the top-middle panel, there is also separation in the posteriors on \(\beta \) between the two conditions. This suggests that the component of goal revision that is independent of the discrepancy is larger when pursuing approach goals compared to avoidance goals.

To determine how parameter values differ between conditions, we assess the posterior distribution on the difference between parameters in the approach and avoidance conditions. For each posterior sample, we calculate a variable that is equal to the sampled parameter value for the approach condition minus the sampled value for the avoidance condition. This generates a difference score for each sample, which can be used to approximate the posterior distribution of the difference score. The posterior on the difference scores for each parameter is shown in the top-right panel. As can be seen, both parameters have difference scores that are positive. The credible interval on the difference in \(\alpha \) between approach and avoidance ranges from 0.09 to 0.33, and approximately 99.9% of this distribution is greater than 0. The credible interval on the difference in \(\beta \) between approach and avoidance ranges from 0.12 to 0.36, and approximately 100% of this distribution is greater than 0. These results provide strong evidence that both \(\alpha \) and \(\beta \) are positive.

Fig. 3
figure 3

Posterior distributions on the constant and proportional change parameters for the known group and unknown group models

Unknown Group Membership

In the above example, the researcher knew beforehand the group to which each participant belonged. However, this is not always the case. Sometimes, the researcher may wish to identify subgroups of participants who behave similarly, without any prior knowledge of group membership. To do so, they can use a mixture model - a model that is used to capture behavior that results from different processes (Bartlema et al., 2014). For example, there may be subgroup of participants who behave according to one process, and another subgroup whose behavior is governed by a different process. These different processes are referred to as ’mixtures’. The goal of the analysis is to determine the parameters that best characterize each mixture, and to make inferences about the relative influence of each mixture on each participant’s behavior.

The code below specifies a mixture model in which mixtures differ in their \(\alpha \) and \(\beta \) parameters. For simplicity, we model only two mixtures in this example. The supplementary materials on the OSF contain an example of how this model can be generalized to any number of mixtures.

figure f

As can be seen, the data block is identical to the person-level and hierarchical models. There are two changes required to the parameters block. As with the multiple-group model, this mixture model estimates two unique alpha and beta parameters. However, in the mixture model, one of these parameters must be declared as an ordered vector (see line 12). The ordered vector is a special Stan object type that sorts the values within it in ascending order. This is needed for model identifiability reasons. In this model, we specify beta as an ordered vector while leaving alpha the same. The mixture model also includes an additional parameter (line 14). This parameter is the mixture weight (mix_weight), which indicates the relative influence of each mixture on the behavior of each participant. This weight ranges from 0 to 1, where higher values indicate a stronger influence of mixture 1. The model estimates a unique mixture weight for each participant.

As can be seen in the model block, no changes to the priors are required to implement the mixture model. The reader may note that we have not explicitly assigned a priori to mix_weight. This is because by default Stan imposes a uniform prior, which is appropriate for mix_weight because the parameter is bounded by [0,1] on both ends. As can be seen, there are some key differences in the likelihood component of the model block compared to the previous models we have demonstrated. First, the model now calculates two separate predicted goals. The first is calculated based on alpha[1] and beta[1], which are the parameters associated with mixture 1 (lines 29 and 33). The second is calculated based on the parameters associated with mixture 2 (lines 30 and 35).

As can be seen on lines 37–39, the expression of the likelihood itself has also changed. The Stan syntax required to define the likelihood for a mixture model is more complex than the syntax used in the previously demonstrated models. To implement a mixture model in Stan, the mix_weight parameter must be marginalized out of the likelihood (Stan Development Team, 2017). This means that the likelihood of the observation given the model must be calculated based on the weighted sum of the likelihood of the data under each separate mixtureFootnote 2.

The bottom row of Fig. 3 displays the results for the above mixture model. The left and middle panels show the differences in \(\alpha \) and \(\beta \) between the two mixtures identified by the model. As can be seen, the posteriors on \(\alpha \) overlap considerably, suggesting that this parameter is very similar between the two mixtures. However, the posteriors on \(\beta \) differ a great deal between the two mixtures. For mixture 1, the most probable \(\beta \) values are just below 0. For mixture 1, the most probable \(\beta \) values are just below 1. The bottom-right panel shows the credible intervals on the mixture weight for each participant. These results suggest that about a half of the participants are most strongly influenced by mixture 1 (i.e., have mixture weights that are clearly above 0.5), and about a quarter are most strongly influenced by mixture 2 (i.e., have mixture weights that are clearly below 0.5). For the remaining participants, the relative influence of each mixture is fairly balanced.

Before making inferences based on a mixture model such as the above, it is advisable to test it against alternative models that specify different numbers of mixtures (Bartlema et al., 2014). For example, we might compare the two-group model above to one-group and three-group models based on which provides the best description of the data. The goal of the model comparison is to determine the number of mixtures that is most strongly favored by the evidence. We address the issue of model comparison in the next section.

Model Evaluation

Once the researcher has specified an appropriate model and confirmed that the model has converged, they can evaluate whether the model provides an adequate account of the data. Parameter estimates are only meaningful to the extent that the model is a good description of the phenomenon being investigated. If the model is a poor approximation of the data, then information contained in the parameter estimates will not be representative of the process the researcher is trying to investigate. In such cases, the researcher may need to re-specify the model in order to improve its ability to account for the empirical observations.

In this section, we discuss tools that researchers can use to help determine whether a model provides a satisfactory description of the data. We must emphasize that there is no one-size-fits-all approach to model evaluation, and there are often strong theoretical reasons to prefer one model over another that need to be considered. Here, we simply address a few tools that researchers have at their disposal to facilitate the evaluation process.

Visual Inspection of Model Fit

One of the simplest, yet most powerful methods of evaluating the extent to which a model adequately describes the empirical trends is by visualizing the predictions of the model in the context of the data (Heathcote et al., 2015). In doing so, researchers can usually see any mismatches between model and data quite clearly, which can help to rule out inadequate models. In a Bayesian model, the model predictions are delivered in the form of a distribution, known as the posterior predictive distribution. The posterior predictive distribution represents the predicted distribution of the outcome variable(s) that is implied by the posterior distribution on the model parameters. The posterior predictive distributions of different models can be compared with the observed data to see which best predicts the data. In our case, we compare the two-mixture model with a one-mixture model that estimates a single set of parameters for all participants (the one-mixture model is identical to the sample-level model described earlier in the paper).

The code below generates a set of goal levels from the distribution of the predicted goal, which are later sampled to generate posterior predictive distributions. This generated_quantities block is from the two-mixture model described above, but the code used for other models, including the sample-level model, can be found in the OSF supplementary materials. First, the block initializes objects to store the goal level predicted by the model for each mixture (line 2) and the set of goal-level samples (i.e., the posterior predictives; line 3). The next block loops through all trials in the dataset, generating predictions in the same way as in the likelihood block for the two-mixture model (lines 5–16).

Finally, rather than evaluating the likelihood of the observed goal, as is done in the likelihood blocks explained above, the code samples a goal level from the distribution of the predicted goal (lines 18–19). This distribution is a mixture of two normal distributions, with a mean equal to the predicted goal for that mixture and a standard deviation of sigma. The sampled goal level is a weighted sum of the samples from each mixture. The weight applied to the sample from mixture 1 is the mixture weight for the relevant subject, and the weight applied to the sample from mixture 2 is one minus the mixture weight for the relevant subject.

figure g

To create the posterior predictive distributions of the two-mixture and one-mixture models, the code below samples from the generated_quantities blocks for those models. First, R is used to extract parameters and classify participants as coming from mixture 1 or 2. The parameters used in the two-mixture model (line 1) and one-mixture model (line 2) are extracted. Unique mean mixture weights are then estimated for each participant (lines 4–5), and these are added to the observed dataset. A column is also added to the observed dataset that classifies participants class as either Mixture 1 (if they had a mean mixture weight greater than 0.5) or Mixture 2 (if they had a mean mixture weight less than 0.5) (lines 7–9). The new dataset is labelled dat2.

Next, using Stan, a list is created, labelled pp_list where the dat2 dataset is replicated 100 times, and predicted goal levels are sampled from the generated quantities block for both models each time. Stan can extract up to 4000 samples from the generated quantities block, but, as specified in lines 11–12, we take 100 samples, as this is what is needed to efficiently generate reliable predictions. For each of the 100 samples, new variables are added to the replicated dat2 datasets. predicted_goal_1 includes the sampled predicted goals from the generated quantities block for the two-mixture model and predicted_goal_2 includes the sampled predicted goals from the generated quantities block for the one-mixture model. Within each of the 100 new datasets, the code then groups by class, condition and trial, and generates the mean predicted goal for each model (lines 18–19) and the mean observed goal from the dataset (line 20). For the sake of consistency with the remainder of the tutorial, we conducted the simulation of the model in Stan, but this can also be conducted in R.

figure h

From this point, the researcher can take the pp_list and extract descriptive statistics across the 100 sampled datasets, including the mean posterior predictive distributions for each model and condition (approach or avoidance), the 95% confidence intervals on the posterior predictive distributions and the standard error of the observed mean goal for each trial. The plots seen in Fig. 4 can then be generated using these statistics. The code required for extracting descriptive statistics and creating the plots can be found in the OSF supplementary materials.

Figure 4 shows summary data from the example dataset superimposed over the posterior predictive distributions from the two-mixture and one-mixture models. This visualization makes it easy to compare the fit of each model’s predictions to the empirical trends. As can be seen in the left panel, both models reproduce the positive trend in goal level over time. However, the two-mixture model is a better match to the data than the one-mixture model, particularly for participants whom the two-mixture model suggested were more strongly influenced by mixture 2 (displayed in the right column). The 95% credible interval associated with the two-mixture model almost always contains the observed mean, suggesting that this model does a good job of describing the data. On the other hand, there are many cases where the observed mean is outside the 95% credible interval associated with the one-mixture model, suggesting that this model provides a poorer description of the data.

Fig. 4
figure 4

Summary data from the example dataset and posterior predictive distributions from one- and two-mixture models. The red dots represent the observed means. The green line and ribbon represent the mean and 95% credible interval of the posterior predictive distributions from the one-mixture model. The blue line and ribbon represent the two-mixture model. Columns are divided according to participants’ mixture weights in the two-mixture model. The left column displays the data from participants whom the two-mixture model suggested were most strongly influenced by mixture 1 (i.e., had a mean mixture weight greater than 0.5). The right column displays the data from participants who the two-mixture model suggested were most strongly influenced by mixture 2 (i.e., had a mean mixture weight less than 0.5)

There are a few points worth considering when coding and conducting visual inspections of model fit. First, the process of sampling, compiling and plotting posterior predictives can be quite involved, and the code required can differ depending on the type of models that are being compared. For less complex models, such as regressions, there are packages available in R that can help to make this process easier. These include rtsanarm (Goodrich et al., 2020), bayesplot (Gabry and Mahr, 2022) and shinystan (Gabry, 2017). rstanarm is a Stan-based package that can be used to fit single or hierarchical regression models in a more succinct manner. Taking the model fits from rstanarm, bayesplot can be used for plotting the posterior predictives against the observed data. shinystan is a web app that can be launched via R and provides a graphical user interface for researchers to assess model fit and generate and plot posterior predictives. Muth et al. (2018) provide a useful tutorial on the rstanarm and shinystan packages, Alexander (2020) use an example to illustrate how rstanarm and bayesplot can be used for model fitting and checking, and Gabry et al. (2019) explain how bayesplot can be used for data visualization.

Second, the standards for model-data correspondence are likely to be highly context-dependent. In basic laboratory research, where there is a high degree of experimental control, it is often expected that model predictions provide an extremely close fit to the data. In such contexts, even minor discrepancies between the posterior predictive distributions and the experimental data may be a sign that the model is not an adequate explanation of the phenomenon being investigated. By contrast, in observational or field studies, where there is more noise, discrepancies between the model and the data may be more tolerable. In general however, researchers should be particularly wary of cases where the model produces a qualitatively different trend than the one observed in the data. For example, if a model predicted that people should decrease their goals over time, this would be a very strong indication that the model does not provide a good explanation, and should therefore be rejected.

The final point we wish to make regarding model visualization is that this tool can be used for evaluating any model that makes predictions, not just Bayesian models. In fact, we argue that visually inspecting the fit of a model to the data should be an essential practice when modeling longitudinal data. Without knowing whether a model adequately characterizes the process that is being investigated, it is impossible to know whether parameters from that model can be interpreted meaningfully. Failure to conduct a visual inspection may therefore lead to inappropriate conclusions.

Quantitative Model Comparison

Although visual inspection of model-data fit can be helpful for comparing models, it is usually not sufficient for discriminating between competing models. If more than one model produces a close fit to the data, the differences in model-data fit may be difficult to determine visually. Moreover, evaluation based on visual fit alone ignores the other key question that needs to be considered: model parsimony. If two models fit the data to a similar extent, but differ in terms of complexity, the simpler model should be preferred on the grounds of parsimony (Myung and Pitt, 1997; Vandekerckhove et al., 2015).

Several approaches to Bayesian model comparison have been proposed that quantify the trade-off between fit and parsimony. The standard solution for Bayesian model comparison is the Bayes factor (Jeffreys, 1935; Kass and Raftery, 1995). The Bayes factor refers to the ratio of the probabilities of the data under each model (known as the marginal likelihood). It provides a useful index of the relative evidence delivered by the data (and the priors) for one model against an alternative that takes into account both fit and model complexity (Kruschke and Liddell, 2018a). If more than two models are being compared, researchers may examine the marginal likelihoods of all competing models individually, with a higher marginal likelihood indicating a greater likelihood of the observed data given a particular model. However, the Bayes factor can be useful for quantifying the relative evidence for one model over another (Fong and Holmes, 2020).

However, researchers may face some challenges when using the Bayes factor (Vandekerckhove et al., 2015). First, if models are complex and contain many parameters, calculating the Bayes factor can be challenging. It would require the likelihood of the data under the model to be integrated across the entire parameter space for all parameters, which can be computationally demanding. This is especially true for models that are estimated using MCMC methods. In recent years, methods have been introduced for approximating Bayes factor for models estimated via MCMC methods, including bridge sampling and importance sampling squared (e.g., Wang and Meng 2016; Evans and Brown 2018; Gronau 2017; Tran et al. 2014), but these can also be computationally demanding.

Another challenge researchers may face when using the Bayes factor is prior sensitivity. The Bayes factor is often influenced by the researcher’s choice of priors (Rouder et al., 2009). This is less problematic for standard, off-the-shelf, statistical tests with established “default” priors (e.g., see JASPJASP2018). However, if models are novel and/or more complex, it is unlikely that there will be default specification for priors, meaning the researcher would be less confident in their choice of priors. In such cases, one can assess the robustness of the results by systematically examining the impact of different priors on the Bayes factor obtained. This process is referred to as prior sensitivity analysis (e.g., Wagenmakers et al., 2011). It is worth noting that prior sensitivity is not necessarily a bad thing. It has been argued that priors are often an important aspect of the underlying theory being tested and are therefore an important consideration when evaluating a model (Vanpaemel, 2010). If the priors are informed, researchers can make strong inferences from even small datasets with the Bayes factor. Even if priors are uninformed, the robustness of results can be assessed using prior sensitivity analysis (Kruschke and Liddell, 2018b; Vehtari and Ojanen, 2012; Vanpaemel, 2010; Wagenmakers et al., 2011).

Several other approaches to quantitative model comparison have been proposed (Gelman et al., 2014; Piironen et al., 2017). These include the Deviance Information Criterion (DIC; Spiegelhalter et al. 2002), the Bayesian Predictive Information Criterion (BPIC; Ando 2018), leave one out cross validation (LOO-CV; Geisser and Eddy 1979) and the Watanabe-Akaike Information Criterion (also known as the “Widely applicable information criterion”; Watanabe 2010). Similar to the Bayes Factor, the DIC and BPIC assess the trade-off between fit and parsimony during model selection, but rather than assessing the relative evidence of two competing models, the DIC and BPIC are computed separately for each model. The LOO-CV, and WAIC, also computed separately for each model, quantify the ability of a model to predict new data, with a lower value indicating better predictive accuracy.

When the dataset is large, and only a select few models are being compared, the LOO-CV and WAIC can be particularly useful as they can give a nearly unbiased estimate of the predictive ability of a given model (Watanabe, 2010; Piironen et al., 2017; Gelman et al., 2014). However, as with the Bayes Factor, there are also challenges associated with LOO-CV and WAIC, and there has been significant debate surrounding the strengths and limitations of LOO-CV in the literature (Gronau and Wagenmakers, 2018; Chandramouli and Shiffrin, 2018; Vehtari et al., 2018; Gronau and Wagenmakers, 2019). Both LOO-CV and WAIC contain a stochastic error term with a variance that can be substantial when the dataset is small, or when there is a large number of models being compared. This variance can lead to over-fitting during the model selection process and thus biased predictive performance estimates of the models being compared (Watanabe, 2010; Piironen et al., 2017; Gelman et al., 2014; Reunanen, 2003; Cawley and Talbot, 2010; Gronau and Wagenmakers, 2018). Further, if the dataset is large, these approaches can also be computationally demanding (Vehtari and Ojanen, 2012). Finally, LOO-CV can be biased when using time series models, as the process of LOO-CV involves future observations predicting earlier observations. An alternative version of LOO-CV, known as Leave Future Out Cross Validation (LFO-CV) has therefore been proposed for time series data, with the methods detailed in Bürkner et al. (2020).

Parameter Recovery

Along with evaluating and comparing the fit of models, it is important that researchers are confident in the reliability of the parameter estimates that their models generate. The parameters generated by a model should be identifiable, meaning that each combination of parameters produces unique model predictions. Parameters that are identifiable are deemed reliable. To test the reliability of parameters, particularly with new or modified models, researchers can run a parameter recovery analysis. Heathcote et al. (2015), Ballard et al. (2021) and Ballard et al. (2023) all discuss this process. Parameter recovery involves first using the known parameter values generated by the model to make a new simulated dataset. The model is then fitted to the simulated data, and new parameters estimated. These new parameters are then compared with the original parameter values that were used to generate the simulated dataset. This comparison can be done visually, by creating a figure (e.g., a line graph) that compares the parameter values used to generate the dataset (on the x-axis) with the new recovered parameter values (on the y-axis). If the new parameters closely match the original parameters (a diagonal line on a line graph), this indicates a strong parameter recovery, meaning that the researcher can be confident in having parameters that are uniquely identified by the data.

Another step researchers can consider to test how strongly their parameters are informed by the data is to visually compare the prior distributions with the posterior distributions on all parameters. If the posterior distributions are similar to the prior distributions, this may suggest that the data has had a relatively small influence on the parameter estimates relative to the prior, meaning that there may have been too little, or uninformative, data (Smid and Winter, 2020). However, if the posterior distributions look very different from the prior distributions, this may indicate that the data have had a relatively strong influence on the posteriors. Ultimately, both parameter recovery and comparing prior and posterior distributions are worthwhile steps to understand the information reflected in a model’s parameter estimates, especially with new or modified models.

Discussion

Dynamic theories, while influential in psychology, can be difficult to test. Testing a dynamic theory requires a statistical model that accurately reflects the processes described by the theory (Collins, 2006). Although a wide range of statistical approaches have been used to examine dynamic phenomena, most do not have the flexibility to enable the direct implementation of a specific theory. As a result, researchers are often forced to distill rich, dynamic theories into simple predictions regarding the direction of relationships between variables, which creates misalignment between the theory and the statistical test. Ultimately, this hinders theoretical progress because it makes theories difficult to falsify.

In this paper, we have demonstrated a Bayesian approach to modeling dynamic processes that has the flexibility required to directly test dynamic theory. To illustrate the flexibility of this approach, we considered a model of goal revision recently published by Gee et al. (2018). An important feature of this model that is typical of many dynamic theories is the assumption that certain variables have “memory”. The ability to represent dynamic variables such as these is critical for testing theories that assume recursive feedback processes, which are a signature of many highly influential theories. However, dynamic variables are difficult to implement using standard statistical methods.

The Bayesian approach that we have demonstrated addresses this issue, and provides an intuitive way of implementing sophisticated models of dynamic processes. To illustrate, we began by demonstrating a sample-level model that quantifies uncertainty in the average parameter values of the goal revision model across participants in the sample. We then showed how this framework could be extended to form more sophisticated models that captured variability in individuals and groups. The person-level model quantifies the components of the goal revision process separately for each individual. The hierarchical model combines the sample- and person-level models into a single framework, enabling components to be quantified separately for each individual, but also allowing inferences to be made at the population level. The multiple-group model quantifies differences in the components of the process between known groups (e.g., levels of an experimental manipulation). Finally, the mixture model allows the researcher to identify distinct, latent subgroups of participants for whom the dynamic process plays out in a similar way.

A More Flexible Approach

Importantly, the models that we have demonstrated in this paper represent just a small slice of the large space of possible models that can be implemented using this framework. The flexibility of this approach makes it possible to implement models that have virtually any functional form. This opens the door for researchers to develop customized models that provide a more accurate representation of the theory being tested than would be provided by generic, off-the-shelf, models. For example, consider the learning literature, where decades of work has focused on understanding the precise nature of the relationship between practice and skill acquisition (e.g., Thurston 1919; Estes 1994). The dominant models in this literature describe the relationship as obeying either exponential or power laws.

Another example is the intertemporal choice and motivation literatures where a long history of research has demonstrated that people temporally discount future rewards or deadlines, treating them as less valuable or motivating as they would be if they were more immediate (Ainslie, 1975; Steel and König, 2006). Theories of temporal discounting often describe this relationship as hyperbolic or exponential. Both of these groups of theories are difficult to test using standard statistical models, because they make specific assumptions about the form of the relationship that cannot be easily implemented in a linear modeling framework. By contrast, non-standard models such as the ones described above can be easily implemented using the Bayesian approach that we have demonstrated.

Another class of phenomena that is notoriously difficult to examine using conventional methods is the bottom-up process (Kozlowski et al., 2013; Kozlowski and Klein, 2000). In a bottom-up process, a higher level phenomenon results from the dynamic behavior of a lower level process as it plays out over time. Bottom-up processes can operate between the person and group levels (e.g., team knowledge emerging as the result of individual members learning and sharing; Grand et al. 2016). They can also operate within individuals over time (e.g., a behavioral set point emerging from a recursive goal regulation process; Ballard et al. 2017).

Bottom-up processes are often assumed by theory, particularly in psychology. However, these processes have typically been examined qualitatively, and can be difficult to examine quantitatively. Some researchers have begun to adopt computational modeling to generate predictions about bottom-up processes (e.g., Grand et al. 2016), but the predictions can be difficult to test. However, the Bayesian approach that we have demonstrated offers a way to quantitatively test these bottom-up processes. For example, Ballard et al. (2021) used this approach to analyze a model in which the lower level process by which higher goals inspire more effort, influenced the higher level goal regulation process over time.

Challenges

The Bayesian approach we have demonstrated in this paper offers many advantages for testing dynamic theories. However, there are some challenges worth noting. First, the programming needed in Stan is more involved than it is with programs that are often used by researchers in psychology, such as SPSS, AMOS or JASP, and it may take researchers some time to become accustomed to it. Stan has the functionality needed to implement MCMC built into the program, but the model itself must be built from the ground-up. However, the fact that the researcher can build the model themselves gives them a lot of flexibility, and as such they can translate a theory directly into a statistical model. This is what makes Stan ideal for building, testing and comparing dynamic models.

Another challenge worth noting is that the MCMC algorithm can be computationally demanding and take a long time to run, especially for more complex models with many parameters. In our case, none of the models presented in this paper required more than about 5 min to run, but this is because our dataset was fairly small (60 participants \(\times \) 10 observations per participant). For models that are more complex, have more parameters or have larger datasets, the MCMC can take some time to run, often up to several hours. This can be especially difficult if comparing many models. To lessen these challenges, Stan has several features that can be automated, such as parallelization. This can help to speed up the process, but it is worth researchers accounting for the extra time they will need if running models with the MCMC.

Conclusion

Dynamic theories, which are common in psychology, can be difficult to test because the processes behind them are often too complex to be adequately captured in statistical models. However, it is important that these dynamic processes be tested rigorously for theoretical progress to be made. The Bayesian approach that we have demonstrated in this paper can overcome these challenges, by allowing a flexible way to develop, test and compare dynamic models. This approach allows researchers to better represent dynamic or hierarchical phenomena, whereby the processes could differ across time, at different levels, and for different contexts and people, which we believe can accelerate theoretical progress.