Participants and procedure
Forty healthy participants (20 male) aged 19 to 38 years (M = 26.45, SD = 3.88) completed a PRLT (Boehme et al., 2015; Deserno et al., 2020; Reiter et al., 2016, 2017) twice, with a one-week gap between sessions. Two different versions were counterbalanced across sessions, such that each participant played different versions on the two test days (Fig. 1a).
In each trial of the task (160 in total), participants had 1.5 seconds to choose between two cards that were associated with different probabilities of winning or losing 10 cents (80/20% and 20/80%, respectively). After making their choice, they were shown a feedback screen (a picture of a 10-cent coin for wins, a picture of a crossed-out 10-cent coin for losses) for 0.5 seconds. Feedback was drawn at each trial with replacement, i.e., if the card with 80% win probability was chosen, a random number between 0 and 1 was drawn from a uniform distribution—if it fell between 0 and 0.8, the participant received a win feedback; if it fell between 0.8 and 1, the participant received a loss feedback. This procedure meant that the actual ratio of wins and losses associated with the stimuli varied across individuals and sessions (from 73.13% to 86.25%). The feedback screen was followed by a variable inter-trial interval with a mean of 2.5 seconds, in which participants were shown a fixation cross. After an initial acquisition phase (1st to 55th trials) the cards’ reward contingencies flipped five times (after the 55th, 70th, 90th, 105th, and 125th trial), such that the previously more lucrative stimulus now became the more frequently losing one, and vice versa. For details, see Fig. 1b.
Analysis
In our analysis, we systematically investigated the respective retest reliability of several approaches to processing the task data. In addition, we calculated within-session (split-half) reliabilities of the raw behavioral measures.
Behavioral metrics
Accuracy was calculated as the probability of choosing the more lucrative stimulus, regardless of actual feedback, estimated by mixed-effects logistic regression and as the proportion of correct trials per person and session. Similarly, stay–switch behavior was calculated as the probability of switching, overall and after wins and losses, respectively, and as the respective proportions per person and session. Perseveration was calculated as the probability of choosing the same incorrect stimulus after two consecutive losses, estimated by mixed-effects logistic regression and as the respective proportions per person and session. Reaction times, both overall and after wins and losses, respectively, were calculated as predicted values estimated by mixed-effects linear regression, and 𝛥RT accordingly, i.e., as the difference in predicted values between win RTs and loss RTs. In addition, we calculated simple averages for overall RTs, win and loss RTs, and 𝛥RT (win RT − loss RT). In the case of metrics derived from mixed-effects models, we compared estimates from separate models for each session with estimates derived from a single model which accounts for the data from both sessions jointly and includes session as a grouping variable nested within subject (Brown et al., 2020). Concretely, this means that the joint regression models took the general form Dependent variable ~ Intercept[*factor] + (Intercept[*factor] |Subject/Session). Because we employ maximal random-effects structures in all our models (Barr et al., 2013), covariances among and between random and fixed effects are taken into account.
We computed mixed-effects logistic and linear regressions in R (version 3.6.1), using the lme4 package (version 1.1-21). Results were considered significant at p ≤ .05.
Computational models
In order to identify individual differences in processes thought to underlie behavior on this task, we fitted different RL models from two families of models. The first model family included models based on Q-learning (Watkins & Dayan, 1992):
$${Q}_{a,t+1}={Q}_{a,t}+\upalpha \left(r-{Q}_{a,t}\right)$$
Here, Qa, t refers to the expected value of an action a at trial t. It is updated at each trial based on the prediction error, i.e., the difference between the feedback just obtained after performing this action, r, and the previous expected value Qa, t, to form the new expected value Qa, t + 1. The learning rate 𝛼 determines how much recent feedback is weighted over the integrated feedback from previous trials (i.e., a learning rate of 1 would only take the last trial into account). The unchosen option is not updated (single update):
$${Q}_{a_{unchosen},t+1}={Q}_{a_{unchosen},t}$$
Learning might be differentially sensitive to wins and losses, resulting in different degrees of value updating after wins versus losses. This can be captured in models with different learning rates for wins and losses.
$${Q}_{a,t+1}={Q}_{a,t}+{\upalpha}_{{win}/{loss}}\left(r-{Q}_{a,t}\right)$$
Because the task has an anti-correlated structure, such that if the chosen stimulus yields a win, the other would have invariably yielded a loss, individuals can use this information to simultaneously update the expected values of both the chosen and the unchosen option. This can be captured in a double update (DU) model:
$${{Q}}_{{{a}}_{{unchosen}},{t}+1}={{Q}}_{{{a}}_{{unchosen}},{t}}+\upalpha \left(\left(-{r}\right)-{{Q}}_{{{a}}_{{unchosen}},{t}}\right)$$
Like the original single update (SU) model, this can be extended with separate learning rates for wins and losses. Finally, it is conceivable that individuals use their knowledge of the task structure and perform double updating but do not update the unchosen option as much as the chosen one. This can be captured using a discount weight 𝜅, which attenuates updating of the unchosen option.
$${{Q}}_{{{a}}_{{unchosen}},{t}+1}={{Q}}_{{{a}}_{{unchosen}},{t}}+\upkappa \upalpha \left(\left(-{r}\right)-{{Q}}_{{{a}}_{{unchosen}},{t}}\right)$$
𝜅 can be added to all DU models, changing only the equations for the unchosen option. In each model, we use a softmax response model to transform values to choice probabilities for each option:
$${p}\left({a}_i\right)=\frac{{\exp}\left(\upbeta {Q}_{a_i}\right)}{\sum_{j=1}^K{\exp}\left(\upbeta {Q}_{a_j}\right)}$$
The parameter β, the softmax inverse temperature or choice sensitivity, influences the extent to which a difference in values translates into a difference in choice probability by determining the steepness of the softmax sigmoid. If β is large, choices are more deterministic or exploitative; if it is small, choices are more stochastic or explorative, such that differences in values exert less influence on action selection. Like learning, choice stochasticity may be differentially sensitive to wins and losses, resulting in asymmetric staying and switching (e.g., with higher win β, a person’s tendency to stay after a win would be stronger than their tendency to switch after loss). This can be captured in separate softmax temperature parameters for trials after receiving wins or losses:
$${p}\left({a}_i\right)=\frac{{\exp}\left({\upbeta}_{{win}/{loss}}{Q}_{a_i}\right)}{\sum_{j=1}^K{\exp}\left({\upbeta}_{{win}/{loss}}{Q}_{a_j}\right)}$$
The different combinations of parameters—SU or DU, single or separate learning rates and temperature parameters for wins and losses—yield a total of 12 models.
In the second model family, we use a reinforcement sensitivity parameter ρ instead of a softmax temperature parameter (dropped in this family) to quantify choice stochasticity:
$${Q}_{a,t+1}={Q}_{a,t}+\upalpha \left(\uprho r-{Q}_{a,t}\right)$$
Unlike the inverse temperature parameter, which influences choice stochasticity by determining the steepness of the softmax, the reinforcement sensitivity ρ does this by determining the maximum difference between expected values, thus posing a lower bound to choice stochasticity. The effect on choice probabilities is essentially the same, but the models can have different estimation properties and may differ in their interpretation under certain circumstances (Huys et al., 2013; Katahira, 2015). As with the softmax temperature models, we fit 12 models in the reinforcement sensitivity family, iteratively including separate learning rates for wins and losses, separate reinforcement sensitivities for wins and losses, double updating, and weighted double updating.
Model fitting
We inverse-logit-transformed the learning rates and DU weights (𝛼, 𝜅) in both model families in order to constrain them to their natural range (0 and 1). For models with a single reinforcement sensitivity (𝜌) or softmax temperatures (𝛽), we used an exponential transform to ascertain that they were positive; for models with separate reinforcement sensitivities for wins and losses, the parameters were left in native space. Parameter estimation was performed in MATLAB R2020b using the emfit toolbox (Huys et al., 2011, 2012; Huys & Schad, 2015). We applied and compared three different approaches to parameter estimation (maximum likelihood [ML], maximum a posteriori estimation with uninformative priors [MAP0], and maximum a posteriori estimation with empirical priors [EM-MAP]). In standard ML estimation, the quantity to be maximized is log(p(data| θ)). In MAP estimation, a regularizing prior on 𝜃 is provided, such that the quantity to be maximized becomes ∝ log(p(data| θ) ∗ p(θ)). For MAP0 estimation, we defined an uninformative Gaussian prior with a mean of 0 and variance of 10 (default in the emfit toolbox). For EM-MAP, we used empirical Gaussian priors on our parameters (p(θ| μ, σ)), inferred from the multivariate distribution of the estimates across subjects in an expectation maximization procedure (Huys et al., 2012).
We used all three estimation methods to fit the sessions separately, i.e., maximizing the (posterior) likelihood of the data from each session one at a time, as well as jointly, i.e., maximizing the overall (posterior) likelihood of the data pooled across the two sessions. For the joint estimation, we concatenated the data from both sessions but fitted separate parameters for each session. Concretely, this meant that we fit one set of parameters for the first 160 trials (from session one) and another set for the second 160 trials (from session two), resetting Q-values for the first trial of session two. Thus, both approaches yield separate parameters for the first and second session. However, when estimated jointly, covariances between parameters across sessions are taken into account in a multivariate prior in the case of EM. Note that this does not mean, for example, that the learning rate for sessions one and two shared a prior. Instead, each parameter is accounted for by its own mean and variance in addition to the covariances between parameters.
To minimize the risk of local minima, we restarted the optimization 10 times (for EM at each M step) at different random starting points, taking the best iteration forward in the case of EM. In addition, we repeated the estimation procedure 10 times, and used the final results with the maximum (posterior) likelihood for reliability analysis. We performed model selection on the estimated models based on the integrated Bayesian information criterion (Huys et al., 2012).
MATLAB’s fminunc function, which the emfit toolbox we employed utilizes, allows users to choose between a quasi-Newtonian and a trust-region algorithm for optimization. The latter requires the user to supply analytically calculated gradients to guide the search of the optimizer but can help improve the optimizer’s performance and the robustness of the results (Daw, 2011). All results reported in the main text are based on trust-region estimates. However, as a supplementary analysis, we also performed model fitting without supplying gradients, i.e., using a quasi-Newtonian algorithm for optimization, for comparison. Detailed results are reported in Supplementary Fig. 1 and Supplementary Table 2.
Parameter recoverability
In order to ensure good fit on a qualitative level, we extracted the parameters of the best-fitting models from each family and generated 100 simulated datasets for 38 participants based on the respective algorithms. We then plotted the behavioral metrics derived from the generated data (averaged across simulations) against those derived from the original data for visual comparison. Further, we probed the recoverability of the parameter estimates by refitting the models to the generated data and computing the average correlation between the resulting estimates and the underlying true values.
In order to show recoverability across the model space, we further extracted the parameters of eight models of varying complexity (four from each model family), and simulated 10 datasets for 38 participants for each model based on the respective algorithms. We then refitted the same eight models to each dataset to probe model and parameter recoverability. In the interest of space, the results of this analysis are reported in the supplement.
Reliability assessment
In order to assess the retest reliability of the behavioral metrics and model-derived parameters, we computed intra-class correlations (ICCs), more specifically ICCs(A,1) in McGraw and Wong’s (1996) notation, between the metrics across time points (McGraw & Wong, 1996; Qin et al., 2019). As Qin et al. (2019) note, this type of ICC (i.e., a two-way mixed, single-measure, absolute-agreement ICC) is appropriate for estimations of retest reliability in which time is a design factor, the space between time points is identical across subjects, there is only one observation per subject and time point, and parameter values are assumed to be constant across time points. Calculations were performed using the irr package (version 0.84.1) in R (version 4.1.0) for the raw behavioral metrics and the ICC toolbox (version 1.3.1.0) in MATLAB 2019b for the indices derived from computational modeling. This was done for convenience; however, the packages employ the same formulae and produce equivalent results.
We calculated ICCs(A,1) for means, predicted values, and fitted parameters (i.e. point estimates); however, certain ICCs can also be obtained directly from variances estimated as part of the model (e.g., Brown et al., 2020). The advantage of this approach is that variances that are estimated as part of a model contain information as to the precision of the predicted values or fitted parameters. Specifically, model-calculated variances reflect the sum of the variance of point estimates and the mean standard error around them. This latter term is absent from the variances calculated on predicted values. To our knowledge, the variance components estimated as part of logistic and linear regressions using lme4 and as part of the computational model fitting using emfit do not permit the calculation of ICCs(A,1), but the former allow the calculation of ICCs(1) and the latter Pearson correlations. We therefore report model-calculated ICCs(1) for the behavioral metrics and model-calculated Pearson correlations for the EM-MAP-estimated parameters, alongside the respective metrics based on point estimates for comparability. We computed model-derived ICCs(1) in McGraw and Wong’s (1996) notation for the behavioral metrics on the basis of the variance components accessed using the get_variance function as part of the insight package for lme4 model fits. Specifically, we took the ratio of the variance explained by the random effect of subject (between-subject variance) and the sum of that variance and the variance explained by session within subject (within-subject variance). Model-derived Pearson correlations between parameters were calculated by dividing the covariance of the equivalent parameters from each session by the product of the square roots of the variances of the individual parameters.
We interpreted the ICC coefficients according to Cicchetti’s (1994) guidelines (“[W]hen the reliability coefficient is below .40, the level of clinical significance is poor; when it is between .40 and .59, the level of clinical significance is fair; when it is between .60 and .74, the level of clinical significance is good; and when it is between .75 and 1.00, the level of clinical significance is excellent.”) (Cicchetti, 1994).
In order to assess internal consistency of the behavioral metrics, we re-estimated them in separate logistic and linear regressions for the odd and even trials of each session and calculated their correlation. We report the correlations with and without Spearman–Brown correction (\({r}_{SB}=\frac{2r}{1+r}\)), which accounts for deflated correlations due to the reduced number of observations.
Simulations: recoverability of reliability
In order to probe different approaches of index estimation in terms of their ability to recover true reliabilities, we simulated correlated binary (choice-like) and continuous (reaction time-like) data (500 datasets with 160 trials for 38 subjects on two sessions each). Specifically, for each dataset, we simulated normally distributed indices for 38 subjects that were correlated at r = .3, r = .5, r = .7, and r = .9 across sessions. We then took those indices forward and simulated trial-by-trial data. For binary data, we drew a random number between 0 and 1 at each trial and compared it to the simulated index; if it was smaller, the trial was coded 1, and if it was larger, the trial was coded 0. For continuous data, we sampled, at each trial, from a normal distribution around the index. For each dataset, we then computed ICCs based on means, predicted values from logistic/linear regressions for each session separately, predicted values from logistic/linear regressions for both sessions simultaneously, and based on variance components as extracted from the joint model, as we did for the original data. We then compared the resulting ICCs to the true correlations.
Data and code availability
The data and the scripts underlying the analyses in this article are available on the Open Science Framework (osf.io/4ng3e).