Decomposing the effects of context valence and feedback information on speed and accuracy during reinforcement learning: a meta-analytical approach using diffusion decision modeling

Reinforcement learning (RL) models describe how humans and animals learn by trial-and-error to select actions that maximize rewards and minimize punishments. Traditional RL models focus exclusively on choices, thereby ignoring the interactions between choice preference and response time (RT), or how these interactions are influenced by contextual factors. However, in the field of perceptual decision-making, such interactions have proven to be important to dissociate between different underlying cognitive processes. Here, we investigated such interactions to shed new light on overlooked differences between learning to seek rewards and learning to avoid losses. We leveraged behavioral data from four RL experiments, which feature manipulations of two factors: outcome valence (gains vs. losses) and feedback information (partial vs. complete feedback). A Bayesian meta-analysis revealed that these contextual factors differently affect RTs and accuracy: While valence only affects RTs, feedback information affects both RTs and accuracy. To dissociate between the latent cognitive processes, we jointly fitted choices and RTs across all experiments with a Bayesian, hierarchical diffusion decision model (DDM). We found that the feedback manipulation affected drift rate, threshold, and non-decision time, suggesting that it was not a mere difficulty effect. Moreover, valence affected non-decision time and threshold, suggesting a motor inhibition in punishing contexts. To better understand the learning dynamics, we finally fitted a combination of RL and DDM (RLDDM). We found that while the threshold was modulated by trial-specific decision conflict, the non-decision time was modulated by the learned context valence. Overall, our results illustrate the benefits of jointly modeling RTs and choice data during RL, to reveal subtle mechanistic differences underlying decisions in different learning contexts. Electronic supplementary material The online version of this article (10.3758/s13415-019-00723-1) contains supplementary material, which is available to authorized users.

Appendix A Bayesian mixed model ANOVA In this section, we detail the results of the two-step model-comparison approach that we used for the Bayesian Mixed Model ANOVA.
In a first step, to determine what would be the base model for the subsequent analyses, we compared two models: the first one (M0) included only participants as random effects and the second one (M1) included also experiment as fixed effect. In the case of accuracy, a model that does not include experiment as fixed effect was preferred (BF M0 / BF M1 =3.6), indicating that mean accuracy was mostly stable across experiments. In the case of RTs, a model that includes experiment as fixed effect was preferred (BF M1 / BF M0 =1.4e9), indicating that mean RTs differed across experiments.
In a second step, we tested different combinations of models in which we varied the possible interactions between experiment and experimental manipulations and the experimental manipulations themselves. In the case of accuracy, all models were tested against M0 of the previous step of the analyses, while, in the case of RTs, these were tested against M1 of the previous analyses. The results are summarized in Tables A1 and A2.
Finally, the two models with highest BF were compared to each other, to provide a simple assessment of the evidence in favor of the best model is. There was substantial evidence for the winning model in the ANOVA of accuracy analyses, M3, compared to its runner-up, M8 (BF M3 / BF M8 =8.6). There was anecdotal evidence for the winning model in the ANOVA of RT analyses, M5, compared to its runner-up, M10 (BF M5 / BF M10 =1.76). Note. The preferred model is marked with an asterisk.

Appendix B Diffusion decision model analyses
In this section, we report some details about the diffusion decision model analyses.
The following prior distributions were assumed for the parameter intercepts: where Cauchy is Cauchy distribution with parameters location and scale. The following prior distributions were assumed for the parameter coefficients (coefficients corresponding to the main and interactions effect were given the same priors): deviations from the group mean: z i represents the deviation of a participant's parameter from that parameter mean in the experiment, while z j represents the deviation of an experiment's parameter mean from the parameter means across the overall dataset. The following prior distributions were given to z i and z j : and σ respectively account for the within-and across-experiment variances, and have priors: where Half Cauchy is a strictly positive Cauchy distribution.
To constrain the threshold and non-decision time parameters to be positive, the exponentially transformed them at the trial level.  Figure B1 . Posterior distributions of the group parameters of the hierarchical diffusion decision model. Posterior distributions of the DDM parameters across experiments (beige areas) and within experiments (grey lines). Because valence was coded as 0=reward/1=punishment, and feedback was coded as 0=partial/1=complete, and the interaction was the product of the two, intercepts (first row) correspond to the parameters in the reward-partial condition.  To assess how well the model fits the observed behavioral patterns, these measures were separately calculated across experiments and experimental conditions. The shaded areas represent the 95% Bayesian Credible Intervals, while the crosses represent the summary of the data.

Appendix C Diffusion decision model parameter recovery
We performed parameter recovery of the Bayesian hierarchical diffusion decision model(DDM) used in the main analyses of this study. We generated data for four experiments using a simple DDM (with no across-trial variability), with the same number of participants and trials as in our study.
The generating group parameters (Table C1) were selected in order to generate a similar performance to the one observed across the experiments ( Figure C1). Participants' parameters were sampled from the group distributions and NDT and threshold intercepts were lowered in Experiment 4 of .4 and .5, respectively.
We fitted the DDM following the same procedure used to fit the real data collected in the four experiments, as described in the Methods Section. To assess the quality of parameter recovery, we plotted the generating parameter values against a summary (mean and mode) of the estimated posterior distributions of the 89 participants ( Figure C2). In general, all group parameters were well recovered, although for some parameters we observe a shrinkage towards the group mean (e.g., for the interaction coefficient of the threshold) which is a typical feature of hierarchical models. Individual drift-rate parameters estimates were also more spread compared to the generating ones.  Figure C1 . Simulated data. Mean accuracy and response times (RTs) of the simulated data, separately by experiment and by context.  Figure C2 . True against recovered diffusion decision model individual parameters. The dotted grey lines represent the identity lines, while the red dotted lines are the group mean parameters. We also calculated correlations between the true and the mean recovered individual parameters, indicated by the Pearson's ρ statistics.
Appendix D Reinforcement learning model analyses In this section, we report some details about the reinforcement learning modelling procedure.
The learning-rate parameters were given the following prior distributions: where µ α is the group-level mean, σ α is the group-level standard deviation, and α is the individual learning-rate. N is the normal distribution (with parameters mean and standard deviation), HN is the half-normal distribution, and φ is the cumulative density function of the standard normal distribution, transforming α so that 0 ≤ α ≤ 1. The decision parameters were given the following priors: While the threshold could not be negative (because of the specific parameterization of Equation 8 and because a int was exponentially transformed) to not allow the non-decision time to be negative, it was exponentially transformed at a trial level.   Figure D2 . Posterior predictives of the hierarchical reinforcement learning diffusion decision model (RLDDM). Posterior predictives for mean accuracy (top row) and mean RTs (bottom row) in binned trials, separately for learning contexts. Each bin corresponds to 12 trials, which means 3 trials per choice context. Mean accuracy was calculated separately across experiments, contexts, and bins. The shaded areas represent the 95% Bayesian Credible Interval of the posterior predictive distributions. The hard lines represent the mean data.