Influences of Reinforcement and Choice Histories on Choice Behavior in Actor-Critic Learning

Reinforcement learning models have been used in many studies in the fields of neuroscience and psychology to model choice behavior and underlying computational processes. Models based on action values, which represent the expected reward from actions (e.g., Q-learning model), have been commonly used for this purpose. Meanwhile, the actor-critic learning model, in which the policy update and evaluation of an expected reward for a given state are performed in separate systems (actor and critic, respectively), has attracted attention due to its ability to explain the characteristics of various behaviors of living systems. However, the statistical property of the model behavior (i.e., how the choice depends on past rewards and choices) remains elusive. In this study, we examine the history dependence of the actor-critic model based on theoretical considerations and numerical simulations while considering the similarities with and differences from Q-learning models. We show that in actor-critic learning, a specific interaction between past reward and choice, which differs from Q-learning, influences the current choice. We also show that actor-critic learning predicts qualitatively different behavior from Q-learning, as the higher the expectation is, the less likely the behavior will be chosen afterwards. This study provides useful information for inferring computational and psychological principles from behavior by clarifying how actor-critic learning manifests in choice behavior.


Introduction
Reinforcement learning (RL) models have been used to infer the computational processes underlying action selection from behavioral data (Yechiam et al. 2005;Corrado & Doya 2007;Daw et al. 2011;Maia & Frank 2011). Most often, RL models based on "action values," which represent the expected reward obtained from an action (e.g., Q-learning models), have been used for this purpose (Samejima et al. 2005;Daw 2011;Wilson & Collins 2019). In such models, the action values are directly translated into weights for corresponding actions such that the higher the action value, the more likely the action is to be chosen. However, it has been noted that many psychological and neuroscientific findings are concisely explained by policy-based RL models in which the preference for each action is represented independently of the reward expectations (for a review, see Mongillo et al. 2014;Bennett et al. 2021).
A purely policy-based framework that does not represent any values (i.e., expected reward), however, is inconsistent with the fact that the neural correlates of value (expected reward) have been extensively reported (O'Doherty 2014). Thus, models with both value (expected reward) and valueindependent preference representations should be reasonable candidates for a plausible model of RL in the brain. Actor-critic learning or the actor-critic model (Barto et al. 1983;Barto 1995) is a representative model that has such an architecture. Actor-critic learning can explain how the choice behavior of animals obeys the matching law, which is the fundamental law of choice behavior (Sakai & Fukai 2008a). The matching law states that the ratio of an individual's choice (across options) equals the ratio of the rewards obtained. Including actor-critic learning, the learning rules by which the choice probability converges to what follows the matching law as a steady state have been theoretically studied (Loewenstein & Seung 2006;Sakai & Fukai 2008a, b). Actor-critic learning can also explain conditioned avoidance, which is difficult to explain by action value-based models (Maia 2010). In addition, the actor-critic architecture is well aligned with the neural architecture of the basal ganglia (Houk & Adams 1995;Joel et al. 2002;Collins & Frank 2014). Since the actor-critic architecture separately represents preference and value, it has been used to represent the processes of pathological decision-making in which behavior is maintained even though no further reward is expected, such as in drug addiction (Redish 2004).
While the actor-critic model has these desirable properties, action value-based models, such as Q-learning, have been commonly used in the computational modeling of learning and decision-making in which models are fitted to experimentally observed behavioral data (Daw 2011;Wilson & Collins 2019). If the actorcritic model better captures the true computational process, are the findings reported in previous studies using valuebased models incorrect or do they capture a part of the truth? In general, a model misspecification can lead to the erroneous interpretation of data (Nassar & Gold 2013;Katahira 2018;Wilson & Collins 2019;Toyama et al. 2019;Katahira & Toyama 2021). However, the impact of fitting an action value-based model to choice data actually generated from processes that are well described by actor-critic learning is not understood.
To discuss this issue, first, we aim to understand the statistical properties of the actor-critic model. Specifically, we investigate how the past history of experiences (rewards and choices) leads to behavioral choices and how each model parameter modifies them. In general, an RL model can be regarded as a statistical model that determines the probability of choice as a function of the history of past events, which include one or more hedonically valenced events (i.e., rewards or punishments). Thus, by analyzing how past rewards and choices affect current choices, we can understand the statistical properties of RL models. Such analyses have been conducted specifically by comparing RL models with a logistic regression model that can explicitly model the influence of past events on future choices (Katahira 2015(Katahira , 2018. In the present study, we perform such an analysis of the actor-critic model and clarify the differences and similarities between this model and Q-learning models. In Q-learning models with certain settings, it has been shown that statistical interactions exist between past rewards and choices that influence subsequent choices (Katahira 2015(Katahira , 2018. The present study addresses whether interactions also exist among past events in actor-critic models and, if such interactions exist, how the model parameters (e.g., learning rates) determine the interactions. The regression coefficients of the interaction terms in logistic regression models are analytically derived as a function of the model parameters. This allows us to quantitatively evaluate how the RL model parameters affect the interactions (Katahira 2018). Although the steady state of RL models, including actor-critic leaning and Q-leaning, has been analyzed theoretically, our analysis is not of the steady state but of the transient behavior of the model (i.e., how past rewards and choices affect subsequent choices immediately). This analysis will allow us to distinguish the actor-critic model from other models based on not only the steady-state properties of behavior as described by the matching-law but also the trial-by-trial dynamics.
Next, we examine when and the extent to which the predictions of the actor-critic and the Q-learning models agree or deviate. In particular, we examine how closely the Q-learning model can predict a choice when actor-critic is the true model. This issue is also related to identifiability between the models. If the actor-critic and Q-learning models can always provide an equivalent prediction of choices, these models would be indistinguishable from the data. In situations in which the predictions differ, the two models can be distinguished. Subsequently, we investigate the relationship between the parameters of actor-critic learning and those of Q-learning. Specifically, we check how the parameter estimates of the Q-learning model vary when fitted to simulated data generated by actor-critic models with varied parameters.
Finally, based on the nature of the actor-critic model we examine in this study, we consider a model of habit formation as an example where the actor-critic model yields qualitatively different predictions than Q-learning. In our model setting, the actor-critic model predicts that if the initial expectations are too high, the habit strength (preference for the target action) is attenuated due to the negative RPE, and the target behavior is not maintained. This prediction qualitatively differs from that of Q-learning, whereby the higher the expectation is, the more likely the behavior is to be maintained. This prediction may have implications for behavior change interventions.
As a specific paradigm of decision-making tasks, we mainly consider a two-alternative choice task called the twoarmed bandit task, which is typically used in studies using RL models (O'Doherty et al. 2004;Pessiglione et al. 2006;Daw 2011). In this task, a single choice is made for each discrete trial, resulting in the presence or absence of a rewarding outcome. The probabilities of reward are unknown to the agent (subject) and change during the task without notice. Thus, the probabilities should be learned by trial-and-error. In addition, although the "value" in general RL refers to the estimated amount of future cumulative reward, in such a task, each trial is considered an independent episode. Therefore, the value in this study represents the expected value of the reward in each trial.

3
The remainder of this article is organized as follows. The "Models and Task Settings" section introduces the models, including the actor-critic models, Q-learning models, and logistic regression models, and the behavioral task settings considered in this study. The "History Dependence of the Actor-Critic Model" section discusses the statistical properties (i.e., history dependence) of the actor-critic models while examining the relationship between the actor-critic models and the logistic regression models. The "Mapping Actor-Critic Learning to Q-Learning" section evaluates how each variant of the Q-learning model is mapped to the choice behavior produced by the actor-critic models. In the "Implications for Habit Formation: the Effect of the Initial Expectation" section, we consider the implications of the properties of the actor-critic model for human behavior in daily life using the model of habit formation. Finally, the "Discussion" section discusses the significance and limitations of this study.

Models and Task Settings
First, we introduce a basic function used to determine the probability of choice for each option called a softmax function, which is commonly used in RL models and logistic regression models (in the "Choice Probability Function" section). The models differ in how they determine the preference for each option given to this softmax function. Next, we introduce the actor-critic leaning (in the "Actor-Critic Learning" section), which is the main focus of this study, followed by Q-leaning models (in the "Q-Learning Models" section) and then logistic regression models (in the "Logistic Regression Models" section). Then, we introduce the two-alternative choice task (in the "Task Settings" section). Readers familiar with these models and the task may skip to the "History Dependence of the Actor-Critic Model" section.

Choice Probability Function
Here, we denote the chosen option at trial t by a t ( a t ∈ {1, 2} in the two-alternative case). Let W t (i) denote the preference for (or weight of) specific action i at trial t. Based on the set of preferences, the model assigns the probability of choosing option i by using the softmax function: where K is the number of possible actions. In our binary choice case ( K = 2 ), the probability of selecting action 1 can be written as follows: The RL models considered in the present study determine the preference W t (i) depending on past events (rewards and choices) as specified below. In the two-alternative case, it is convenient to consider the relative preference between the two options defined as a logit 1 The logistic regression model can directly model this logit as a linear combination of the effects of past rewards and choices (and their interactions in some model settings).

Actor-Critic Learning
Basic RL models learn the values of each state (e.g., in actorcritic learning) or each state-action pair (e.g., in Q-learning). However, the present study basically considers a situation in which no state transition explicitly occurs (Sakai & Fukai 2008a). Thus, we consider RL models without state variables (i.e., the state variable is a constant). We also assume that the choice at each trial does not affect the outcome at subsequent trials.
We introduce the actor-critic model based on these settings. The critic estimates the expected reward of the (single) state. The estimate of the value of the state at trial t, denoted as V t , is computed by using the update rule as follows: where denotes the critic learning rate, which determines the extent to which the reward prediction error (RPE), (r t − V t ), is reflected in the value. The initial value, V 1 , is set to 0.5 in the present study. The outcome (reward) at trial t is denoted by r t .
The preferences (or policy parameters) are directly calculated based on the RPE obtained from the critic as follows: where the decay parameter ∈ [0, 1] is included to adapt to a nonstationary reward contingency (Li & Daw 2011;Spiegler et al. 2020). ∈ [0, ∞) is called the policy learning rate. The initial values of W 1 (1) and W 1 (2) are set to 0.5 in the present study.
Notably, we did not include the inverse temperature parameter , which is often included in the choice rule of RL models (Sakai & Fukai 2008a) because the policy if action i is unchosen at trial t learning rate is statistically indistinguishable from the inverse temperature; the transformation → × c and W 1 (i) → W 1 (i) × c has an equivalent effect on the choice probability as the transformation → × c. Figure 1 shows the typical behavior of an actor-critic model in which = 0.7, = 2.0 , and = 0.2 . The time evolution of the probability of choosing option 1 in the actor-critic model is depicted in panel A (blue lines). The preference W t (i) and the state value V t are depicted in panel B. Notably, the preference for option 1 and the choice probability decrease even when the same choice persists and the reward continues to be given as observed in the first five to ten trials. The mechanism underlying this phenomenon is discussed later.

Q-Learning Models
Here, we provide the formulation of the full version of the Q-learning model considered in the present study. This model includes various additional components (in addition to the standard Q-learning model), such as an asymmetric learning rate, forgetting rate, and choice-autocorrelation factor. Specific reduced models are derived from this general formulation by decreasing or fixing the parameters. The model formulations presented here basically follow Katahira (2018).
The Q-learning model assigns each option i an action value Q t (i) at trial t. The action value of the chosen option i is updated as follows: where + L ∈ [0, 1] and − L ∈ [0, 1] are the learning rates that determine how much the model updates the action value depending on the sign of the RPE, r t − Q t (i) . The initial action values are basically set to zero (i.e., Q 1 (1) = Q 1 (2) = 0 ). For unchosen option j ( i ≠ j ), the action value is updated as follows: where F ∈ [0, 1] is the forgetting rate. Several studies have reported that a non-zero forgetting rate improves the goodness of fit to behavioral data (e.g., Ito & Doya 2009;Toyama et al. 2017;Katahira et al. 2017), whereas the forgetting rate F has often been set to zero (i.e., the action value of the unchosen option is not updated).
To model the effects of choice history (choice hysteresis), the choice trace (or choice kernel; Wilson & Collins 2019) C t (i) , which quantifies how frequently option i has been chosen recently, is computed as follows: where the indicator function I(⋅) is 1 if the statement is true and 0 if the statement is false. The initial values are set to zero, i.e., C 1 (1) = C 1 (2) = 0 . The parameter ∈ [0, 1] is the decay rate of the choice trace.
Using the action values and the choice traces, the preference used in the softmax function (Eqs. (1) and (2)) is given by W t (i) = Q t (i) + C t (i) , where the inverse temperature parameter ∈ [0, ∞) controls how sensitive the choice probabilities are to the value difference between actions, and the choice trace weight ∈ (−∞, ∞) controls the tendency to repeat (when > 0 ) or avoid (when < 0 ) recently chosen options. In the two-alternative choice case, Fig. 1 Sample behaviors of RL models considered in the present study. The simulation of the two-armed bandit task was performed by using an actorcritic model (as a true model), and a standard Q-learning model ("Q"-model) was fitted to the data. A The probability of choosing option 1 calculated from the actor-critic model (blue line) and fitted Q-learning model (orange line). B Preferences W t (i) and state value V of the actor-critic model. C Action values (Q values) of the Q-learning. The first 160 trials of the session are depicted where C 1 = 0 , and the logit is h t = (Q t (1) − Q t (2)) + C t .
We consider four variants of Q-learning models, termed the "Q," "Q+C," "Q+A," and "Q+CA" models. All models have the forgetting rate ( F ) as a free parameter. The Q model has symmetric learning rates ( + L = − L ) and no choice-autocorrelation factor ( = 0 ). The Q+C model has symmetric learning rates ( + L = − L ) and the choice-autocorrelation factor ( and are free parameters). The Q+A model has asymmetric learning rates (both + L and − L are free parameters) and no choice-autocorrelation factor ( = 0 ). The Q+CA model is the full model in which all parameters are free parameters.
Q-learning is similar to actor-critic in that it changes preferences in the direction that minimizes the RPE. Thus, the overall behaviors of these two model types are expected to be similar. The orange line in Fig. 1A shows the time course of the choice probability of the Q model (with the symmetric learning rate and no choice-autocorrelation factor) fitted to the data generated by the actor-critic model. The pattern of change in the choice probability is approximately similar to that of actor-critic in Q-learning as expected, but the two models differ in some trials. The nature of these differences between the two models is discussed later.
A model recovery analysis shows that the actor-critic model considered in this study is statistically distinguishable from the four variants of Q-learning models (Appendix 6). This means that the actor-critic model is not flexible enough to fit the data even when the Q-learning model is the true model. A parameter recovery analysis also confirms that the parameters of the actor-critic model can be recovered by model fitting (Appendix 7).

Logistic Regression Models
As mentioned above, the logistic regression model directly models logit h t (Eq. (3)) as a linear combination of the history of rewards and choices (e.g., Lau & Glimcher 2005;Seymour et al. 2012;Kovach et al. 2012) or their interactions.
Following the convention in Lau and Glimcher (2005) and Corrado et al. (2005), we represent the reward history R t as 1 if option 1 is chosen and a reward is given at trial t −1 if option 2 is chosen and a reward is given at trial t 0 if no reward is given at trial t , and the choice history C t as This coding of histories is naturally obtained by assuming that the effects of past rewards and choices are symmetric around options, i.e., rewards or choices from option 1 have exactly the opposite effect from that of a reward or choice from option 2 (Lau & Glimcher 2005).
Using these history variables, the logistic regression model with only main effects (with no interaction term) is constructed as follows: where b (m) r and b (m) c are the regression coefficients of the reward and choice at trial t − m , respectively. The constants M r and M c are the history lengths of the reward history and the choice history, respectively.
We also consider a logistic model with interaction terms comprising rewards and choices at same/different trials. The (full) model with the past two trials includes interaction terms, such as

Task Settings
In the two-alternative choice task used for the simulations, one option was associated with a high reward probability, 0.7, while the other option was associated with a low reward probability, 0.3. With the probability of the chosen option, the reward was given ( r t = 1 ); otherwise, no reward was given ( r t = 0). 2 After each 100-trial block, the reward probabilities of the two options were reversed.

History Dependence of the Actor-Critic Model
In this section, we examine the history dependence of the actor-critic model by expanding the model equations and mapping it to the logistic regression model as previously applied to Q-learning models (Katahira 2015(Katahira , 2018.

3
The update rule for the actor (Eq. (5)) can be expanded as follows: From Eq. (12), we can evaluate the logit h t for the actorcritic model as follows: where we defined Δ t = I(a t = 1) − I(a t = 2) . When the initial preferences are identical between actions (i.e., W 1 (1) = W 1 (2) ), the first term vanishes. Thus, we basically neglect this initial value effect.

When the State Value Can Be Regarded as a Constant
First, we consider a comprehensible case in which the state value V t is constant ( V t = V const for all t). This situation is an approximation of a case in which the critic learning rate is so small that the change in V is sufficiently slow and can be regarded as a constant over a short period. From Eq. (13), in this case, the logit h t becomes Notably, C t = Δ t , R t = Δ t r t in the logistic regression model formulation (see Eqs. (9) and (10)), from which Eq. (14) can be rewritten as follows: Since the coefficients of both R's and C's are constant, this model is equivalent to a logistic regression with only the main effect with the following relationships: if is sufficiently small so that any terms beyond the history length that are not included in the model are negligible. Notably, when the constant state value V const is positive, i.e., V const > 0 , the coefficients of the choice history, b c 's, are negative because > 0 and > 0 . This effect results from the assumption that the update of preferences is proportional to the RPE and the fact that the expectation (here V const ) has a negative sign in the RPE. Thus, the higher the expectation is, the lower the increase in the preference for the chosen option becomes. This effect reduces the probability of repeating the same choice with higher expectations. More specifically, the preference for a chosen action decreases by V const (in addition to the increment due to rewards; see Eq. (5)), while the preference for the unchosen option is only subject to the decay effect when < 1. Figure 2 plots the regression coefficients of the logistic models with main effects only applied to the data from the actor-critic with a constant state value (symbols). The corresponding analytical results given by Eqs. (16) and (17) are also plotted (solid lines). We observe that the analytical result coincides with the regression coefficients of the simulated data. It should be noted that this coincidence occurs only when the decay rate is not close to 1; when the decay rate is close to 1, the effect of reward does not Fig. 2 The effects of the reward and choice history of actor-critic learning when the state value is constant. The symbols represent the regression coefficients of the logistic regression model fitted to simulated data generated by actor-critic models with a varied constant value of V (see legend). Left, the reward history. Right, the choice history. Error bars representing the standard error of the mean (SEM) are plotted but almost invisible due to the small SEM. The solid lines represent the analytical predictions obtained using Eqs. (16) and (17) Reward Choice decay, and the history in the distant past (beyond what can be included in the logistic regression) influences the current choice. As the theory predicted, the constant state value V const does not affect the dependence on the reward history (left panel) while it does influence the dependence on the choice history (right panel); the larger V const is, the larger the negative choice history dependence becomes. This finding suggests that if the expectation of the reward (i.e., V const ) is large, the choice is easily switched. When the state value is negative, the positive dependence on the choice history appears (gray line); the greater the agent's negative expectations, the more likely the agent is to choose the same option. Although this is an unrealistic assumption in the current setting in which outcome ( r t ) has a non-negative value, such a situation can occur when the outcome has a negative value (i.e., when an aversive outcome, such as an electric shock, is given), such as in an avoidance task (Maia 2010).

General Case
Next, we consider a more general case in which the learning rate is not close to zero, and the variation in the state value V t is not negligible. First, note that the update rule of V (Eq. (4)) can be rewritten as from which we have In this case, since is assumed not to be small, the effect of the initial value of V, which is represented by the term (1 − ) t−1 V 1 , can be neglected. By inserting this result into Eq. (13) (with W 1 (1) = W 1 (2) ), we obtain If we consider only the effect of a reward at trials t − 1 , t − 2 , and t − 3 , we obtain the following (see Appendix 3): A schematic illustrating how the impact of a reward changes as the trial proceeds in this situation is shown in Fig. 3. These graphs show how the impact of the reward (contribution to the logit) given at trial t changes at trial t + 1 , trial t + 2 , and trial t + 3 . Specifically, they plot the coefficient of reward r t in the logits h t+1 (for lag 1), h t+2 (for lag 2), and h t+3 (for lag 3). Due to the effect of the decay of preference ( ), the impact basically decays, but when is non-zero, the effect changes after trial t + 1 depending on which choice was made previously. Interestingly, when is large ( = 0.5 , panel B), depending on the subsequent choice, the effect of the reward may become stronger over time (at trial t + 2 , when + > 1 ) and may even become negative at trial t + 3.
This pattern of history dependence can be captured by a logistic regression model with interaction terms. For simplicity, considering only the effect of a reward at trials t − 1 and t − 2 in Eq. (20), we have where we used the expression given in Eqs. (9) and (10). The general form of the impact of past rewards is given in Appendix 4. Equation (21) clarifies that when has a nonzero value, a third-order interaction between the choices and reward, C t−1 C t−2 R t−2 , exists because the effect of the reward at trial t − 2 is included in the state value of next trial V t−1 , and the next choice determines how V t−1 will influence subsequent choices. The reward at trial t − 2 increases expectation V t−1 and thus decreases the preference for the action chosen at trial t − 1 . This property arises from the peculiarity of actor-critic learning whereby the reward expectation (i.e., the state value) is shared by different actions and modified by any action chosen at the same state. Therefore, the effects of past rewards are also modified by which actions are subsequently chosen. This causes an interaction between choice history and reward history that is specific to actorcritic learning. Figure 4 shows the results of the simulation examining how the characteristics of the actor-critic model are captured by the logistic regression analysis. Here, the critic learning rate varied among 0, 0.5, and 0.8. Panel A shows the effect of the logistic regression with the main effect only. The effect of the critic learning rate is not very pronounced in this case. As the above theoretical considerations indicate, the effect of the critic learning rate is observed in the interactions (panel B). The bars represent the estimates of the regression coefficients. The analytical predictions of Eq. (21) are indicated by the red line. When has a non-zero value, the interaction among C t−1 , C t−2 , and R t−2 becomes negative. When is zero, V t is constant, and this interaction has a zero coefficient. These results highlight the fact that a model with only a linear combination of past events does not capture the characteristics of the actorcritic model well and that it is important to consider interactions.
Next, we consider the similarities and differences in these results with the history dependence of the Q-learning model ("Q model"). In Q-learning models in which the 1 3 learning rate is denoted by L and the forgetting rate is denoted by F , the impact of a reward at trial t − m on trial t is given by Katahira (2015) where n same is the number of trials where the same choice was made as at trial t − m (from trial t − m + 1 to trial t − 1 ), and n diff is the number of trials in which a different choice was made from that at trial t − m ( n diff = m − 1 − n same ). If the learning rate equals the forgetting rate ( L = F ), the impact of a reward at trial m becomes L (1 − L ) m−1 , which is independent of the subsequent choices (Katahira 2015). If the learning rate is higher than the forgetting rate (i.e., L > F ), the impact of the reward decreases the more often the same option is chosen later. In this respect, it is similar to the actor-critic model, but the effect is multiplicative, whereas in the actor-critic model, the effect of subsequent choices is subtractive. Thus, in Q-learning, in contrast to the actor-critic model, the impact of rewards does not increase or become negative. Therefore, the actor-critic model and Q-learning qualitatively differ in their history dependence. This finding indicates that these models can be statistically distinguishable based on their history dependence. Thus, to what extent can Q-learning models mimic the choice behavior of the actor-critic model? If such models can mimic the choice behavior to some extent, how are the parameters of the two models related? We discuss these issues next.

Mapping Actor-Critic Learning to Q-Learning
We evaluate how each variant of Q-learning models can mimic the choice behavior produced by the actor-critic model. Specifically, we generated choice data by using the actor-critic model (as the "true model") when performing the two-armed bandit task. Then, four variants of Q-learning models (see the "Q-Learning Models" section) were fitted to the simulated data. As a reference model, the actor-critic model, which was the same structure as the true model, was also fitted to the simulated data. The detailed procedures of the simulation and model fitting are shown in Appendices 1 and 2, respectively.
First, we varied the critic learning rate in the true model. Figure 5 compares the predictive performance of the fitted models. Here, the predictive performance was measured using the normalized likelihood of the test data (Ito & Doya 2009), i.e., the geometric mean of the likelihood per trial of the test data, which were generated using the same actor-critic model with the same conditions. Our theory suggests that the larger the , the larger the interaction between choice and reward history (Eq. (21)), leading to a large deviation from the Q-learning model. Indeed, when = 0 , the Q-learning model with the choice autocorrelation factor (Q+C model) produced a normalized likelihood almost identical to that of the actor-critic model (Fig. 5). However, as the critic learning rate increased, the likelihood of the Q-learning models decreased compared to those of the actorcritic model. Among the Q-learning models considered in this study, the Q+C and Q+CA models produce predictions that are closest to those of the actor-critic model. Thus, the asymmetry ("A") of the learning rate in Q-learning is not necessary for mimicking the behavior of the actor-critic model, and including the effect of the choice history ("C") is sufficient. Therefore, we examined the correspondence between the parameters of the Q+C model and those of the actor-critic model. Figure 6 shows the correlation between the parameter estimates of the Q-learning model (Q+C model) and the true parameter values of the actor-critic model. The parameters of the true model (the critic learning rate in panel A and the policy learning rate in panel B) were randomly sampled from uniform distributions, while the other parameters were fixed. Overall, varying a single parameter of the actor-critic model influences multiple parameters of the Q-learning models. Thus, there is a one-to-many correspondence from the actor-critic model to Q-learning models. When the critic learning rate was varied (panel A), the learning rate of Q-learning ( L ) increased correspondingly. In contrast, the inverse temperature, , decreased as increased. The other parameters (forgetting rate F and those related to choice autocorrelation factors, , ) are also moderately correlated with . When the policy learning rate, , was randomly varied (panel B), the inverse temperature and learning rate L in the Q-learning model were highly positively correlated with .
Here we found that changes in the critic and policy learning rates manifest as changes across a wide variety of parameters in a Q-learning model. One might be interested in knowing what the converse is, or how changes in the Q-learning parameters manifest in the parameters of the actor-critic model. The simulation addressing this issue is reported in Appendix 8. The results show that the changes in Q-learning parameters induce changes in almost all of the parameters of an actor-critic model (we also find one-to-many mapping from Q-learning to actorcritic learning).
In Appendix 5, as an extension of the actor-critic model, we consider the asymmetries in learning that have been considered mainly in action value-based models in the previous literature (Frank et al. 2007;Niv et al. 2012;Gershman 2015;Lefebvre et al. 2017;Ohta et al. 2021).
Here, we show that the asymmetry in the Q-learning model directly corresponds to the asymmetry of the actor (rather than Fig. 5 Predictive accuracy of different models fitted to simulated data generated from the actor-critic learning model with different critic learning rates = 0, 0.2, 0.4, 0.8 . The predictive accuracy was measured by the normalized likelihood of the test data. The other parameters of the actor-critic model were set to = 0.8, = 1.5 . The error bars show the mean ± SEM across 50 simulations the critic) in the actor-critic model. However, the mapping is not one-to-one.

Implications for Habit Formation: the Effect of the Initial Expectation
The purpose of this section is to demonstrate a situation in which the properties of actor-critic learning revealed in this study (specifically that a high state value induces a negative choice history effect) clearly manifest in behavior. As an example, we consider a situation in which an agent is trying to continue a particular action (habit formation). Even when humans want to continue a healthy habit, such as exercising, or a self-improvement activity, such as learning a foreign language, they may sustain the behavior because of other temptations. We model such a process with RL models.
Let action 1 be the behavior an agent wants to continue. First, we assume that what reinforces the action is an intrinsic (subjective) reward, such as the satisfaction obtained as a result of the action. Furthermore, we assumed that the reward value, r, in the RL models represents the net value, which is the reward obtained from the outcome minus the Until an agent becomes accustomed to the behavior to some extent, the mental cost is high, and the (net) reward value is assumed to be small. Moreover, even once people are accustomed to the behavior, the mental cost increases if the behavior is not performed for a long time, and the reward value is assumed to diminish. Based on these assumptions, we assume that the reward resulting from performing action 1 is initially r = 0.2 , and each time action 1 is performed, the reward value is updated as r ← r + 0.2 × (1 − r) . If action 1 is not performed at a discrete time step, the reward value is updated as r ← 0.8 × r , and the reward value decays toward zero. Let action 2 be an action that provides some degree of satisfaction without a cost, such as watching a video on a smartphone, and assume that choosing action 2 always yields a constant reward value r = 0.5 . Therefore, action 1 will have a higher reward than action 2 if it is chosen with some persistence. However, if action 1 is not selected for a certain period, action 2 will yield a higher reward than action 1. Note that these reward settings are designed to emphasize the property of actor-critic learning. We do not argue that this is a valid model for this situation. It is necessary to examine the validity of this model setting in light of empirical data.
Initially, when the agent decides to maintain the action, the agent intends to perform action 1; thus, in the actor-critic model, it would be appropriate to set the initial value of the preference for action 1 high and the preference for action 2 low. Here, we set W 1 (1) = 5 and W 1 (2) = 0 . This setting was applied to ensure that action 1 is strongly preferred at the beginning.
Here, we consider the effects of the initial expectation and the learning rate. In the actor-critic model, the initial expectation corresponds to the initial value of the state value, V 1 ; in Q-learning, it is represented per action, where Q 1 (1) is the expectation for action 1. Figure 7A shows an example behavior of the actor-critic model when the state value remains unchanged and the initial value is kept low ( V 1 = 0.2 , = 0 ). The other parameters were set as = 0.9 and = 2.0 , which were the same as the Fig. 7 Examples of simulations of habit formation with an actorcritic model. A An example with a low initial expectation, V 1 = 0.2 , where the expectation is not updated (critic learning rate = 0 ). B Case with a high initial expectation,V 1 = 0.8 , where the expectation is not updated. C An example with a high initial expectation, V 1 = 0.8 , where the expectation is updated relatively slowly (critic learning rate = 0.05). D An example with a high initial expectation, V 1 = 0.8 , where expectation is updated relatively quickly ( V 1 = 0.8 , critic learning rate = 0.2 ). In each example, the upper panel shows the reward (orange), state value (i.e., expectation; yellow line), and the probability of choosing action 1 (blue line). The gray dot on the top indicates that action 1 was selected, while the gray dot on the bottom indicates that action 2 was selected. The bottom panel shows the preference for action 1 ( W t (1) , magenta line) and the preference for action 2 ( W t (2) , green line) other conditions. In this example, action 1 was selected first, and since the reward obtained was as expected or did not greatly differ from the initial expectation ( r ≈ 0.2 ), the RPE was small even if it was positive for a short duration, and the preference decreased due to the effect of decay. However, once the reward obtained by action 1 and the positive RPE increased, W t (1) increased until it was balanced by the effect of the decay, and action 1 was continuously selected. Figure 7B shows the behavior when the expectation for reward remains high and unchanged ( V 1 = 0.8 and = 0 ). In this example, action 1 was initially selected, and since the reward was less than expected ( r < 0.8 ), the preference for action 1 decreased due to negative RPE. The reward of 0.5 obtained by action 2 was also less than V = 0.8 ; thus, the effect of the negative selection history dominated, and action switching was repeated. Action 1 was selected with some frequency, but because it was intermittent, the reward value obtained by action 1 did not increase to exceed the reward value obtained by action 2.
The results of the cases in which the expectation, V t , is updated ( > 0 ) are shown in Fig. 7C-D. Even when the initial expectation was high ( V 1 = 0.8 ), if V t converges to the stable reward obtained by action 2, action 2 can become dominant (Fig. 7C). However, if V t decreases to the reward value obtained by action 1 while action 1 is still frequently selected, the negative RPE will not be experienced as much, and the reward obtained by action 1 will increase such that action 1 is selected at a high frequency (Fig. 7D). The latter case is more likely to occur when V t is more likely to change quickly, i.e., when is larger.
Next, we systematically examined the effects of initial expectations and the critic learning rate in actor-critic learning. Figure 8A shows the fraction of selecting action 1 from the 100th trial to the 200th trial as a function of the initial expectation for several critic learning rates. One hundred simulations were performed under each condition, and the averages are connected by a line. The overall selection rate of action 1 tends to decrease as the initial expectation increases. When V t does not change ( = 0 ) or changes slowly ( = 0.01 ), the selection rate of action 1 is concentrated at 1 and 0.5 because, as we saw in Fig. 7B, the preference for action 2 does not increase, and therefore, action 2 does not become dominant. If is larger than that, the decrease in the probability of choosing action 1 due to the increase in the initial expectation tends to be suppressed.
In Q-learning, the effect of the initial expectation is the opposite of that in actor-critic learning (Fig. 8B). In Q-learning, we assumed that the initial action value of action 1, Q 1 (1) , corresponds to the initial expectation (of the reward obtained from action 1) and was varied in the simulation. The learning rate L and the forgetting rate F were varied among 0, 0.05, 0.1, and 0.2 ( L = F ). The initial action value of action 2 was set to Q 2 (1) = 0 . The inverse temperature parameter was set to = 5.0 , which produces the same initial choice probability as actor-critic when Q 1 (1) = 1 . Figure 8B shows that the probability of choosing action 1 tends to increase as the initial expectation increases in Q-learning. This occurs because although the amplitude of the negative RPE increases as the action value increases, the order relation is preserved as follows. With the same learning rate, the higher the Q value is, the higher the next Q value becomes. That is, from the update rule, Q t+1 ← (1 − )Q t + r t , we can confirm that if Q t < Q ′ t , then Q t+1 < Q � t+1 is always true for the given and r t . Therefore, if the learning rate is the same, it is unlikely that there will be a reversal phenomenon such that the larger the initial action value is, the smaller the subsequent action value becomes (reversal may occur in some situations due to a change in action selection depending on the initial action value).

Discussion
The present study investigated the statistical, transient property of actor-critic learning. Specifically, we explored how recent reward and choice histories influence subsequent choices rather than focusing on the long-term asymptotic behavior of the model. We emphasized the commonality and differences from the Q-learning model, which has commonly been used to model learning and decision-making in value-based choice tasks. The actor-critic and Q-learning models share the basic property that causes repetition of the choice that leads to reward. Therefore, the overall trend in their choice probabilities should be similar. However, the details of how the histories of the reward and the choice affect behavior differ. Specifically, differences arise from the main effects, interactions between reward and choice histories, and their balance. The analytical results of this study play an important role in clarifying how the model parameter affects the pattern of the interaction. Figure 4, for example, shows how the critic learning rate, , determines the strength of the interaction between the choice history and reward history. A notable feature of the actor-critic model is that the model produces a negative dependence on choice history when the state value (i.e., expectation of reward) is positive. This effect arises because the expectation has a negative impact on RPE (the higher the expectation, the smaller the RPE) and, thus, decreases the preference for the chosen option. This negative effect of choice can be stronger when the critic learns optimistically (i.e., updates the value of a positive outcome more than of negative ones) or when the initial state value is high and persists due to a low critic learning rate. The negative choice history effects inherent in the actor-critic model might explain the exploratory behavior. Suzuki et al. (2021) found that participants tended to avoid the most recently chosen option in a three-armed bandit task. While Suzuki et al. modeled this tendency with a Q-learning model with choice-autocorrelation factor, the choice history effect may also be explained by the actorcritic model. Suzuki et al. also found that the tendency to avoid the most recently selected option was weaker in individuals with obsessive-compulsive tendencies (Suzuki et al. 2021). Given the results of the present study, another possibility might be that such individuals tend to have a low expectation of reward.
This negative effect of the expectation causes the qualitative differences between actor-critic models and action-value based models, such as Q-learning, in the effects of reward expectation. In the "Implications for Habit Formation: the Effect of the Initial Expectation" section, we show that in the actor-critic learning, if the initial expectation is too high and persists (due to the small critic learning rate), the behavior is less likely to be retained. These results may provide the implications of the actor-critic model for habit formation. For example, this model may provide guidelines for how to intervene when people want to continue a behavior. If the actor-critic model is plausible for habit formation (although this is not yet widely appreciated), when a behavior needs to be continued and made a habit, it may be important to keep expectations not too high and maintain a positive RPE. The effects of such initial expectations have not yet been empirically observed and should be confirmed in future studies. Furthermore, the basic findings presented in this study could be useful in designing experimental paradigms and models.
In computational modeling, the effect of model misspecification is crucial when the purpose is to interpret the parameters as a measure of the characteristics of psychological or cognitive processes (Nassar & Gold 2013;Katahira 2018;Toyama et al. 2019;Sugawara & Katahira 2021). As the present study shows, if the "true processes" and the model used for fitting are different, a change in a single true process can affect the estimates of multiple parameters in the fitted model. Specifically, we have shown that if the actor-critic model is the "true model" and the Q-learning model is fitted, the difference in a single parameter of the actor-critic model will appear as a difference in almost all parameters in the Q-learning model (e.g., a change in the policy learning rate influences both the learning rate and the inverse temperature in Q-learning). The true processes and the fitted models are generally considered to be different ). These facts may lead to inconsistency across studies that report the association between RL parameters and psychiatric disorders. For example, several studies reported that the inverse temperature parameter (or equivalently, reward sensitivity) of Q-learning is smaller in individuals with depression, whereas other studies associated a diminished learning rate with depressive symptoms (for a review, see Robinson & Chase (2017)). However, if the policy learning rate is weakened in such patients, the argument regarding which parameters, i.e., inverse temperature or learning rates, are associated with the symptoms may be meaningless; the inconsistency may be due to the one-to-many mapping from true processes to fitted model parameters. The situation examined in this study may serve as an illustration of a possible scenario of such an issue.
Applying a dimensionality reduction technique, such as a principal component analysis (PCA) or factor analysis, to a set of parameter estimates may be an approach to address this issue Moutoussis et al. 2021). The interpretation of a dimension consisting of multiple parameters rather than an individual parameter may lead to more consistent results. In addition, while it is important to evaluate a model by its goodness of fit, it may be better to evaluate a model by considering the ease of the interpretation of its parameters as follows: a model with fewer parameters that correspond to individual characteristics of a behavior may be considered a better model.
To mitigate the problem of model misspecification, it is recommended that researchers focus on model-independent measures that capture the characteristics of the processes in which they are interested Wilson & Collins 2019). As a model-independent analysis, logistic regression with only main effects of past events has been frequently performed. However, along with previous studies (Katahira 2015(Katahira , 2018, the present study revealed that there are elements of the computational process that do not directly emerge as main effects; instead, such effects (e.g., the difference in the critic learning rate) may appear in certain interactions between past events. Thus, including an interaction term in the regression model is sometimes required. However, the longer the history to be considered, the larger the number of interactions. To determine which interactions are most likely to reveal traces of a particular process, it may be useful to plan the analysis in advance by simulating a RL model that explicitly models the process.
A limitation of the present study is that we only consider RL models with no state variable. Although much research in psychology and neuroscience has been conducted using RL models without state variables, several studies employing RL model-based analyses have used RL models that incorporate state variables (e.g., Tanaka et al. 2004;Schweighofer et al. 2008;Spiegler et al. 2020). However, the basic results of the present study can be applied to models with state variables. For example, the results could be applied to examine how choice in a particular state is influenced by the history of events experienced when staying at the same state as that of the past. Specifically, the negative dependence on choice history can also occur in a multistate actor-critic model: the action selected in a particular state will be less likely to be selected when the agent visits the next time. In addition, the specific interactions found in the present study may be observed between the reward (plus next state value) and choice histories under a particular state.
Although the actor-critic model is only one of a wider class of RL models, the findings of the present study and the framework of the analysis may be applicable to other models. For example, the results may have implications for algorithms that perform value estimations and policy updates in different systems, such as policy-gradient approaches, which have attracted attention (Mongillo et al. 2014;Bennett et al. 2021). When considering the online learning of continuous actions, such as in determining response vigor, a policygradient method based on the REINFORCE algorithm (Williams 1992) is often used (Niv 2007;Lindström et al. 2021).
In such models, the effect of the reward reflected in the reward expectation (average reward) does not appear immediately; it appears only after the expectation is used in RPE in a subsequent policy update. Thus, the effect of rewards can have an interaction with subsequent actions depending on the values of the parameters as the present study shows in the actor-critic model. These complex effects may cause the model to not behave as intended and may require careful parameter tuning and model design. However, by analyzing data focusing on such interactions, it is possible to verify whether such computational processes actually exist. The findings and analytical framework of the present study could be useful when considering such issues.
The actor-critic model without the state variable itself also has several possible variants. For example, the opponent actor learning (OpAL) model is an extension of actor-critic learning that incorporates neuroscientific knowledge (Collins & Frank 2014). The OpAL model is characterized by the fact that for each action, preference is determined by a combination of two weights, called the Go weight and NoGo weight, and these update rules include a nonlinear Hebbian component. Additionally, the gradient bandit algorithm (Sutton & Barto 2018) can be considered another variant of the actor critic model; it updates the preference of the unchosen action in the opposite direction to the preference of the chosen action. Another possible extension would be a model that adds a decay parameter (such as for the preference update) not only to preference updates but also to state value updates. From the insights gained from our analytical results, these models can also be expected to have the basic properties discussed in our study, but understanding their specific properties will be a subject for future work.

Appendix 1. Simulation settings
Here, we detail the settings of each simulation. All simulations, statistical analyses, and figure plots were conducted using R version 4.0.0 (R Core Team 2015).
The procedures of the simulations of the choice task based on the actor-critic model are as follows. First, we generated choice data from actor-critic models that were applied to a two-armed bandit task (see the "Task Settings" section for details).
To generate the sample behavior of RL models (Fig. 1), the parameters of the actor-critic model were set as = 0.7 , = 2.0 , and = 0.2 . The model performed the two-armed bandit task consisting of 4000 trials. The Q-learning model without learning asymmetry and choice-autocorrelation factor (Q model) was fit to the data.
To investigate the history dependence of the actor-critic models (logistic regression analysis; Figs. 2, 4, 9, and 10), we generated hypothetical data for 1000 sessions per condition.

3
Each session consisted of 2000 trials. The logistic regression model was fitted to the simulated data using the maximum likelihood method. Specifically, we used the "glm" function implemented in the R programming language. The length of the history included in the logistic regression model with only main effects was set to 12 for both the reward and choice histories ( M r = M c = 12 ). In the logistic regression with the interaction terms, all combinations of histories from two trials ago were included as interaction terms. However, in the resulting figure, only the terms that could be non-zero were plotted.
Regarding the simulations in which the mapping from the actor-critic model to the Q-learning model was examined (Figs. 6 and 11), in total, 200 actor-critic parameter sets were generated, and 2000 trials of choice data were generated for each. For the simulations shown in Fig. 5, the parameters and were fixed at 0.8 and 1.5, respectively. Regarding the parameters to be varied, the critic learning rate was varied among 0, 0.2, 0.4, and 0.8 in Fig. 5 and was generated from a uniform distribution on the range [0.1, 0.9] in Fig. 6A. In Fig. 6B, was generated from a uniform distribution on the range [0.5, 3.0]. In the simulation in Fig. 11 considering the asymmetry of the learning rate, the critic learning rates were set as + = 0.5 + x and − = 0.5 − x (Fig. 11A), and the policy learning rate was set + = 1.5 + x and − = 1.5 − x (Fig. 11B), where x was a uniform random number on the range [−0.4, 0.4].

Appendix 2. Model fitting to the simulated choice data
To estimate the model parameters, a maximum likelihood estimation (MLE) was separately performed for the data of each session. MLE searches a single parameter set to maximize the log-likelihood of the model for all trials. This maximization was performed using the rsolnp 1.15 package, which implements the augmented Lagrange multiplier method with an SQP interior algorithm (Ghalanos  Fig. 9 The effect of asymmetry in the critic. A Logistic regression with main effects only. The convention is the same as that in Fig. 4. B Logistic regression with interaction. The critic learning rate was varied (see legend), and the other parameters were fixed as = 0.5 , = 1.5 As this indicates, the impact of a past reward on the current choice is influenced by all subsequent choices as follows: the effect of the reward decreases the more often the same choice is repeated (i.e., C t−m+n = C t−m ).

Appendix 5. Asymmetric learning in actor-critic learning
In this Appendix, we consider an extension of the actorcritic model and discuss its statistical properties. Specifically, we consider the asymmetries in learning that have been considered mainly in action value-based models in the previous literature (Frank et al. 2007;Niv et al. 2012;Gershman 2015;Lefebvre et al. 2017;Ohta et al. 2021). In actor-critic learning, two types of asymmetry, namely, asymmetries in the critic (state value update) and the actor (policy update), can be considered, although such models have not yet been used for model fitting to behavioral data. 3 We examine the effect of these asymmetries on the history dependence of choice behavior. We also examine the relationship between these asymmetric learning rates in the actor-critic model and the parameters of the Q-learning model with asymmetric learning rates.

Asymmetry in the critic
First, we consider a case in which the critic learning rate is asymmetric. Let the critic learning rate for positive RPE be + and that for negative RPE be − . Figure 9 shows the results obtained from applying logistic models fitted to the simulated data when the critic learning rate was varied as + ∕ − = 0.7∕0.3 ("optimistic"), 0.5/0.5 (symmetric), and 0.3/0.7 ("pessimistic"). When we fit a logistic regression with only main effects (panel A), the effect of the choice history becomes more negative as the critic becomes more optimistic ( + is larger than − ). This trend is opposite to the effect of asymmetry in Q-learning, where the effect of choice history is more positive as it becomes more optimistic (Katahira 2018). The mechanism of this effect can be understood based on the result presented in the "When the State Value Can Be Regarded as a Constant" section; The more optimistic the critic, the higher the state value V. As discussed in the analysis in which the state value was assumed to be constant, the higher the state value, the larger the negative effect of choice history.
The results of the logistic regression including interactions between the history of the previous two trials are shown in Fig. 9B. In the asymmetric case of Q-learning, there is an interaction between rewards R t−1 and R t−2 (given that C t−1 = C t−2 ; Katahira 2018), but no such interaction between past rewards appears in the actor-critic model. This finding is explained as follows. The impact of a reward through the state value appears only two trials later. In addition, the value of − has no effect on the subsequent choice because when a reward is given, the critic learning rate is + , and when no reward is given, the reward r has no contribution to the logit. Thus, the effect of asymmetry in the critic learning rate is first observed in the influence of the reward three trials ago as explained in the following.
Let − = + − be the learning rate corresponding to the negative RPE: represents the strength of the asymmetry. Expanding the logit to three trials ago, we obtain the following: From this, we can observe that the asymmetry of the critic learning rate emerges only with the interaction term C t−1 C t−2 C t−3 R t−2 R t−3 and the main effect term R t−3 .
The analysis shown in Fig. 9B only examined the effects for up to two trials, and thus, we did not observe the interaction between rewards.

Asymmetry in the actor
Next, we consider a case in which there is asymmetry in the actor (i.e., in the update of policy). Here, the policy learning rate for positive PRE is denoted by + and that for negative RPE by − . Figure 10 shows the results of the history dependence when varying the policy learning rate as + ∕ − = 1.0∕2.0 ("pessimistic"), 1.5/1.5 ("symmetric"), and 1.0/2.0 ("optimistic"). When we applied the logistic regression with only main effects (panel A), the effect of choice history becomes positive as the actor becomes optimistic, which contrasts the case in which the critic is asymmetric (compare with Fig. 9A). This finding is consistent with the effect of asymmetry in the learning rate in Q-learning. The mechanism is the same as that in Q-learning as follows: the more optimistic the agent, the greater the effect of the reward, and the smaller the effect of the negative outcome (absence of reward). This results in a positive choice dependence (Katahira 2018). The results of the logistic regression including the interaction terms up to two trials ago are shown in Fig. 9B. In addition to the interaction term among C t−1 , C t−2 , and R t−2 , which also has a non-zero value in the standard actor-critic model, the coefficient of the interaction among R t−1 , R t−2 , and C t−2 has a non-zero value when there is asymmetry in the policy learning rate. The emergence of this interaction is explained as follows. We define the degree of asymmetry such that − = + − . Expanding the logit to up to two trials ago, we obtain the following: From this, the interaction R t−1 R t−2 C t−2 has a non-zero coefficient only when there is asymmetry in the policy update ( ≠ 0 ). Notably, this interaction is enhanced as the critic learning rate increases.

Fitting the asymmetric Q-learning model
Finally, we examine how these different types of asymmetry in the actor-critic model are mapped onto the Q-learning model with asymmetric learning rates. We generated the choice data by using the actor-critic model with different levels and different types of asymmetry. Then, the Q-learning model with asymmetric learning rates and choice-autocorrelation factor (Q+CA model) was fitted to the data. Figure 11A and B shows the resulting parameter estimates when the asymmetry in the actor (panel A) and the critic (panel B) were varied. The asymmetry of the actor ( + − − ) is strongly correlated with the asymmetry of the learning rate in Q-learning ( + L − − L ), but it also modestly affects the other parameters (panel A). The asymmetry of the critic is also moderately correlated with the asymmetry of the learning rate in Q-learning, but the trend is opposite to the asymmetry of the actor (panel B).
For the standard actor-critic model without learning asymmetry, the estimated learning rate for Q-learning (Q+CA model) is near symmetric ( + L − − L = 0 ), as shown by the results for + − − = 0 in the upper left panel of Fig. 11A, i.e., learning asymmetry was not identified. On the other hand, if the Q-learning model without the choice autocorrelation factor (Q+A model) is fitted, learning asymmetry is expected to be identified for the following reason. As we show in the present study, the actor-critic model has a negative choice history effect (when the state value is positive). The negative effect of choice history is represented by including the choice autocorrelation factor in Q-learning (as perseverance parameter is negative (25) h t = + Δ t−1 r t−1 + + Δ t−2 r t−2 − ( + − + r t−1 ) Δ t−1 r t−2 = + R t−1 + + R t−2 − ( + − )C t−1 C t−2 R t−2 − R t−1 R t−2 C t−2 .
in Fig. 11A). In Q-learning without the autocorrelation factor, however, the negative effect of choice history is captured by learning rate asymmetry since the symmetric learning rate can also induce choice autocorrelation, albeit with the interaction between past rewards; when the learning rate for the negative outcome is higher than that for the positive outcome, negative dependence on the choice history will arise (Katahira 2018).

Appendix 6. Model recovery
To show that actor-critic and four variants of Q-learning models (Q, Q+C, Q+A, and Q+CA) are statistically distinguishable, we performed a model recovery analysis (Wilson & Collins 2019). Specifically, we simulated the behavior of the actor-critic model and Q-learning models with the two-armed bandit task described in the "Task Settings" section with each involving 1000 trials. The parameters of the simulated model were sampled from uniform distributions on the following ranges: [0.2, 0.8] for , L , and ; [0.5, 3.0] for ; [2.0, 5.0] for ; and [1.0, 3.0] for . One hundred simulated datasets per model were generated, and the frequency of being the best fit model was calculated for each model. Model selection was performed based on Akaike's information criterion (AIC): where LL is the log likelihood function for the model and k is the number of estimated parameters in the model. A smaller AIC denotes a better model fit to the data. Figure 12 shows the confusion matrix resulting from the model recovery. The numbers denote the frequency with which each fitted model was selected for each simulated model. Although confusion was often observed among the variants of the Q-learning models (especially when Q+CA was the simulated model), the actor-critic model was rarely selected when a Q-learning model was true. The opposite was not the case at all. These results indicate that the actor-critic model and the Q-learning models are identifiable in our simulation setting.

Appendix 7. Parameter recovery
In parameter recovery analysis, we examined whether the true parameters used for simulating choice data were successfully estimated by the model fitting to the simulated data (Wilson & Collins 2019). Specifically, 100 sets of three parameters of the actor-critic model were generated Parameter recovery analysis of the actor-critic model. For each of the three parameters (critic learning rate , policy learning rate , and decay rate ), the recovered parameter estimates are plotted against the true ones used in the simulation from uniform distributions, and each parameter set was used to simulate the choice data in the two-armed bandit task involving 2000 trials. The ranges of uniform distributions from which each parameter was drawn were [0.1, 0.9] for the critic learning rate , [0.5, 2.0] for the policy learning rate , and [0.1, 0.9] for the decay rate . We fit the actor-critic model to the choice data and examined the correlation between the estimates and the true parameters. Figure 13 shows the results. For all three parameters, the recovered and true parameters are highly correlated, confirming that these parameters are identifiable.

Appendix 8. Mapping Q-learning to actor-critic learning
Here we examined how changes in Q-learning parameters manifest as changes in the parameters of an actor-critic model. We simulated the choice data using the Q-learning model with a choice-autocorrelation factor (Q+C model) while randomly sampling a single focal parameter from a uniform distribution whose range is [0.2, 0.8] for L , [0.5, 3.0] for , [0.5, 3.0] for , and [0.2, 0.8] for , while the other parameters were fixed to L = 0.5 , = 2.0 , = 2.0 , and = 0.5 . The actor-critic model with symmetric learning rates was fitted to the simulated data. Figure 14 shows the resulting parameter estimates. These results show that changes in any Q-learning parameter are mapped onto changes in almost all of the parameters of an actor-critic model. Thus, we conclude that there is also one-to-many mapping from Q-learning to actor-critic learning.