Introduction

The ultimate goal of behavioral control is to select policies that maximize reward and minimize punishment. To achieve this, animals are endowed with a flexible controller (typically referred to as instrumental) that learns choices on the basis of their contingent consequences. However, animals are endowed with an additional controller (called a Pavlovian controller) which produces stereotyped hard-wired behavioral responses to the occurrence of affectively important outcomes or learned predictions of those outcomes (Dickinson and Balleine 2002). Two central forms of Pavlovian control are active approach and engagement given the prospect of reward, and inhibition and withdrawal given the prospect of punishment (Gray and McNaughton 2000). Thus, in Pavlovian control, vigor and valence are coupled, and this could be a source of suboptimal behavior. Instrumental and Pavlovian controllers often prescribe the same policies in a manner that can accelerate the expression of good performance. These are the most common circumstances encountered by animals and humans alike. Everyone knows that obtaining a reward normally requires some sort of overt behavioral response (go to win), from picking berries in the forest, to buying them in a shop, or going to a restaurant to eat them. Similarly, the most efficient way to avoid a punishment is to avoid those actions that may lead to it (no-go to avoid losing); it’s better to keep off the road if you want avoid being driven over. However, when the Pavlovian and instrumental controllers are in opposition, behavioral output becomes suboptimal (Boureau and Dayan 2011; Breland and Breland 1961; Dayan et al. 2006). For example, if an unexpected car threatens a pedestrian while crossing the street, it is not uncommon that the pedestrian freezes (which is a highly suboptimal Pavlovian influence) before starting the appropriate running response (go to avoid losing). Similarly, a hunter will often need to remain completely still in the proximal presence of a potential prey (no-go to win), waiting for the optimal moment to act. Failure to be inactive during this critical period (another highly suboptimal Pavlovian influence) results in the prey escaping and the omission of the potential reward.

An important source of influence on the coupling between action and valence may arise from monoaminergic neuromodulation (Boureau and Dayan 2011; Cools et al. 2011; Gray and McNaughton 2000). Dopamine is believed to generate active motivated behavior (Berridge and Robinson 1998; Niv et al. 2007; Salamone et al. 2007) and to support instrumental learning (Daw and Doya 2006; Frank et al. 2004; Wickens et al. 2007) through model-free reward prediction errors (Bayer and Glimcher 2005; Morris et al. 2006; Schultz et al. 1997). These joint roles of dopamine on action invigoration and model-free reward prediction error signalling resonate with the involvement of dopamine in Pavlovian behaviors observed in experimental animals (Flagel et al. 2011; Parkinson et al. 1999). On the other hand, the role or the serotonergic system is more debated, but it appears closely related to behavioral inhibition in aversive contexts (Crockett et al. 2009; Dayan and Huys 2009; Soubrie 1986). In order to manipulate action and valence orthogonally, we and others have designed go/no-go tasks that involve four different conditions: go to win, go to avoid losing, no-go to win, no-go to avoid losing. These tasks have been used to show the involvement of serotonin in punishment-induced inhibition (Crockett et al. 2009) and dopamine in invigoration of actions that lead to reward (Guitart-Masip et al. 2012a).

However, the precise role played by these neuromodulators during learning has yet to be investigated. To explore these effects, we manipulated dopaminergic and serotoninergic systems during learning. Participants received placebo, levodopa, or citalopram. The pharmacological agents are assumed to affect postsynaptic levels of dopamine (Koller and Rueda 1998) and serotonin (Spinks and Spinks 2002), respectively. However, the balance of their influences on phasic and tonic aspects of these neuromodulators and the anatomical location of their sites of action are not clear. If the predominant effect were to enhance the coupling between action and valence typically associated with the Pavlovian control system, we would expect to see increased valence-specific Pavlovian interference on instrumental learning. Indeed, based on the bulk of the literature reviewed above, one would exactly expect that after levodopa administration. However, if the predominant effects of the drugs lay elsewhere, for instance, in the modulation of the contribution of prefrontal cortex to control (Hitchcott et al. 2007), then other effects might arise, such as a decrease in the extent of suboptimal behavior. Our results bear out the latter expectation. We found differential, but not opposing, roles for dopamine and serotonin on instrumental learning whereby boosting dopamine levels decreased the coupling between action and valence on the one hand, while boosting serotonin resulted in a valence-independent decrease in behavioral inhibition.

Methods and materials

Subjects

Ninety healthy volunteers were recruited from a subject pool associated with University College London’s Psychology Department and completed the pharmacological experiment. They received full written instructions and provided written consent in accordance with the provisions of University College London Research Ethics Committee. Participants were randomly assigned to one of three treatment groups: 30 participants received levodopa (13 female; age range, 17 years; mean, 24.07, SD = 4.08 years), 30 participants received citalopram (17 female; age range, 15 years; mean, 23.31, SD = 3.77 years), and 30 participants received placebo (13 female; age range, 11 years; mean, 24.38, SD = 3.22 years). The study was double blind. All participants were right-handed and had normal or corrected-to-normal visual acuity. None of the participants reported a history of neurological, psychiatric, or any other current medical problems. Two participants were excluded (one from the placebo and one from the citalopram groups) because of deterministic performance. Two further participants did not complete the task, one because of technical problems and the other because of gastrointestinal side effects after receiving citalopram.

Experimental procedure for the drug study

Participants completed the task (see below) 60 min after receiving levodopa (150 mg + 37.5 mg benserazide; time to reach peak blood concentration after oral administration 1–2 h) or 180 min after receiving citalopram (24 mg in drops which is equivalent to 30 mg in tablet; time to reach peak blood concentration after oral administration 1–4 h). To ensure participants and investigators were blind to the treatment condition, each participant received one glass containing either citalopram or placebo. Two hours later, they received a second glass containing either placebo or levodopa and waited for another hour before engaging with the go/no-go learning task. The participants in the placebo group received a placebo in both occasions. Participants earned between £10 and £35, according to their performance in the current task. In addition, after performing the go/no-go task, participants engaged in an unrelated task and received between £5 and £20 for their participation in this second task. Participants completed a subjective state analogue-scales questionnaire on three occasions. We did not detect any difference in subjective ratings between treatment groups (data not shown).

Behavioral paradigm

We used the learning version of an experimental design that orthogonalizes action and valence (Guitart-Masip et al. 2012b). The trial timeline is displayed in Fig. 1. Each trial consisted of three events: a fractal cue, a target detection task, and a probabilistic outcome. At the beginning of each trial, one of four distinct fractal cues was presented which indicated whether the best choice in a subsequent target detection task was a go (emitting a button press to a target) or a no-go (withholding any response to a target). The fractal also reported the valence of any outcome consequent on the subject’s behavior (reward/no reward or punishment/no punishment). The meaning of fractal images (go to win; no-go to win; go to avoid losing; no-go to avoid losing) was randomized across participants. As in Guitart-Masip (2012b), but unlike Guitart-Masip et al. (2011), subjects had to learn these by trial and error. Participants were instructed that correct choice for each fractal image could be either go or no-go and about the probabilistic nature of the task.

Fig. 1
figure 1

Experimental paradigm. On each trial, one of four possible fractal images indicated the combination between action (making a button press in go trials or withholding a button press in no-go trials) and valence at outcome (win or lose). Actions were required in response to a circle that followed the fractal image after a variable delay. On go trials, subjects indicated via a button press on which side of the screen the circle appeared. On no-go trials they withheld a response. After a brief delay, the outcome was presented: a green upward arrow indicated a win of £1 and a red downward arrow a loss of £1. A horizontal bar indicated of the absence of a win or a loss. On go to win trials a correct button press was rewarded, on go to avoid losing trials a correct button press avoided punishment, in no-go to win trials a correct withholding a button press led to reward, and in no-go to avoid losing trials a correct withholding a button press avoided punishment

The target was a circle on one side of the screen and was displayed for 1,500 ms starting 250 to 2,000 ms after the offset of the fractal image. Based on the fractal image, participants had to decide whether (go) or not (no-go) to press the key to indicate the target location. A response was classified as a correct go choice if participants pressed the key corresponding to the correct side within 1,000 ms after target onset, and a no-go choice otherwise. At 1,000 ms following offset of the target, the outcome was displayed for 1,000 ms: A green upward arrow indicated a £1 win; a red downwards arrow indicated a £1 loss, and a yellow horizontal bar indicated no win or loss. The outcome was probabilistic: In win trials, 80 % of correct choices and 20 % of incorrect choices were rewarded (the remaining 20 % of correct and 80 % of incorrect choices led to no outcome); in lose trials, 80 % of correct choices and 20 % of incorrect choices avoided punishment.

The task included 240 trials in total, i.e., 60 trials per condition. Before starting with the learning task, subjects performed 20 trials of the target detection task in order to get familiarized with the speed requirements.

Behavioral data analysis

The behavioral data were analyzed using the statistics software SPSS, version 16.0. The probability of correct choice in the target detection task (correct button press for go conditions and correct omission of responses in no-go trials) were collapsed across time bins of ten trials per condition and were analyzed with a mixed ANOVA with time bins, action (go/no-go), and valence (win/lose) as within-subject factors and treatment (levodopa, citalopram, and placebo) as a between-subjects factor. Greenhouse-Geiser correction was applied when the sphericity assumption was violated.

Reinforcement learning models

Following Guitart-Masip et al. (2012b), we built six nested models incorporating different instrumental and Pavlovian reinforcement-learning hypotheses and fit these to the observed behavioral data. All models assigned probabilities to each action a t (here, go or no-go) on each trial t. These probabilities were based on action propensities w(a t,s t ) that depended on the stimulus on that trial and which were passed through a squashed sigmoid function (Sutton and Barto 1998):

$$ p\left({a}_t|{s}_t\right)=\left[\frac{ \exp \left(\mathrm{w}\right.\left({a}_t,{s}_t\right)}{{\displaystyle {\sum}_{a\prime } \exp \left(\mathrm{w}\left(a\prime, {s}_t\right)\right)}}\right]\left(1-\xi \right)+\frac{\xi }{2} $$
(1)

The models differed in the construction of the action propensities and the value of the irreducible noise ξ. They allowed us to test the hypotheses that behavior was purely instrumental, or included a Pavlovian component (‘Pav’) which captures the critical coupling between affect and effect: that there were or were not asymmetries between the subjects’ sensitivities to reward versus punishment (rew/pun); that they had an intrinsic propensity to go versus no go (bias), or to repeat or avoid their previous choice (stick); and that there was or was not irreducible stochasticity (or trembling) in their behavior (noise).

More completely, ξ was kept at 0 for one of the models (RW) but was free to vary between 0 and 1 for all other models. For models RW and RW + noise, w(a,s) = Q(a,s) was based on a simple Rescorla-Wagner or delta rule update equation:

$$ {Q}_t\left({a}_t,{s}_t\right)={Q}_{t-1}\left({a}_t,{s}_t\right)+\varepsilon \left(\rho {r}_t-{Q}_{t-1}\left({a}_t,{s}_t\right)\right) $$
(2)

where ε is the learning rate. Reinforcements enter the equation through r t {−1,0,1} and ρ is a free parameter that determined the effective size of reinforcements. For some models (RW, RW + noise, and RW + noise + bias), there was only one value of ρ per subject. This meant that those models assumed that loss of a reward was equally as aversive as obtaining a punishment. Other models included different sensitivities to reward and punishment (RW(rew/pun) + noise + bias, RW(rew/pun) + noise + bias + Pav, and RW(rew/pun) + noise + bias + Pav + stick) allowing different values of the parameter ρ on reward-and-punishment trials, thus assuming that loss of a reward was not equally as aversive as obtaining a punishment.

Further models added extra factors to the action propensities. For models that contained a bias parameter, the action weight was modified to include a static bias parameter b:

$$ {w}_t\left(a,s\right)=\left\{\begin{array}{l}{Q}_t\left(a,s\right)+b\kern1em \mathrm{if}\kern0.5em a=\mathrm{go}\hfill \\ {}{Q}_t\left(a,s\right)\kern4em \mathrm{else}\hfill \end{array}\right. $$
(3)

For the model including a Pavlovian factor (RW(rew/pun) + noise + bias + Pav), the action weight consisted of three components:

$$ {w}_{\mathrm{t}}\left(a,s\right)=\left\{\begin{array}{l}{Q}_t\left(a,s\right)+b+\pi {V}_t(s)\kern1em \mathrm{if}\kern0.5em a=\mathrm{go}\hfill \\ {}{Q}_t\left(a,s\right)\kern8em \mathrm{else}\hfill \end{array}\right. $$
(4)
$$ {V}_t\left({s}_t\right)={V}_{t-1}\left({s}_t\right)+\varepsilon \left(\rho {r}_t-{V}_{t-1}\left({s}_t\right)\right) $$
(5)

where π was again a free parameter. Thus, for the “avoid loss” conditions, in which the V(s) would be non-positive, the Pavlovian parameter inhibited the go tendency in proportion to the negative value V(s) of the stimulus, while it similarly promoted the tendency to go in conditions in the “win” conditions.

For the model including stickiness (RW(rew/pun) + noise + bias + Pav + stick), the action weight consisted of four components:

$$ {w}_t\left(a,s\right)=\left\{\begin{array}{l}{Q}_t\left(a,s\right)+b+\pi {V}_t(s)+c{\chi}_{a=a\left(t-1\right)}\kern1em \mathrm{if}\kern0.5em a=\mathrm{go}\hfill \\ {}{Q}_t\left(a,s\right)+c{\chi}_{a=a\left(t-1\right)}\kern8em \mathrm{else}\hfill \end{array}\right. $$
(6)

where c is a free parameter that boosts or suppresses the action performed on the previous trial. This component was added because it is often found that subjects have a tendency either to repeat or avoid doing the same action twice (Lau and Glimcher 2005; Schoenberg et al. 2007; Rutledge et al. 2010) and dietary tryptophan depletion results in increased value independent of choice perseveration (Seymour et al. 2012).

As in previous reports (Guitart-Masip et al. 2012a; Huys et al. 2011), we used a hierarchical Type II Bayesian (or random effects) procedure using maximum likelihood to fit simple parameterized distributions for higher-level statistics of the parameters. Since the values of parameters for each subject are “hidden”, this employs the expectation–maximization procedure. On each iteration, the posterior distribution over the group for each parameter is used to specify the prior over the individual parameter fits on the next iteration. For each parameter, we used a single distribution for all participants. Therefore, the fitting procedure was blind to the existence of different treatment groups with putatively different parameter values. Before inference, all parameters except the action bias were suitably transformed to enforce constraints (log and inverse sigmoid transforms).

Models were compared using the integrated Bayesian Information Criterion (iBIC), where small iBIC values indicate a model that fits the data better after penalizing for the number of parameters. The iBIC is not the sum of individual likelihoods, but the integral of the likelihood function over the individual parameters (for details, see Huys et al. 2011). Comparing iBIC values is akin to a likelihood ratio test (Kass and Raftery 1995). The model fitting and selection procedures were verified on surrogate data generated from a known decision process (Electronic supplementary material figures 1 and 2).

The model parameters of the winning model were compared across treatment groups using a one-way ANOVA when these were normally distributed (the sensitivity to reward and the action bias) and the Kruskal–Wallis test when not normally distributed. Normality was assessed by means of Kolmogorov–Smirnov test. Independent sample t test or Mann–Whitney U test were used as post hoc test when appropriate.

Results

Levodopa and citalopram differentially impact on the effects of reward and punishment on go and no-go choices

A mixed ANOVA with time bins, action (go/no-go), and valence (win/lose) as within-subject factors, and treatment (levodopa, citalopram, and placebo) as a between-subjects factor revealed two key patterns across all participants as previously reported (Cavanagh et al. 2013; Guitart-Masip et al. 2012a, b). First, overall performance across the entire experiment was better in the go to win condition compared with the go to avoid losing condition and in the no-go to avoid losing condition when compared with no-go to win condition (see Table 1). This results in a significant action by valence interaction (F(1,85) = 69.29, p < 0.001), which is consistent with a Pavlovian process linking action to valence. Second, participants showed an overall better performance on go compared with no-go conditions reflected in a main effect of action (F(1,85) = 64.17, p < 0.001).

Table 1 Raw overall behavioral performance

These effects were modulated by the pharmacological treatments. First, there was a significant action by valence by treatment interaction (F(2,85) = 3.82, p = 0.026) which was driven by the levodopa group. Levodopa decreased the difference in overall performance between the go to win and the go to avoid losing conditions and between the no-go to avoid losing and the no-go to win conditions (see Fig. 2a). Second, we also observed a trend for a treatment by action effect (F(2,85) = 2.7, p = 0.073) driven by an enhanced main effect of action in the citalopram group (see Fig. 2b). Interestingly, the two drug treatments only differed significantly in the no-go to win condition in which participants who received levodopa showed higher performance than participants who received citalopram (t(57) = 2.34; p = 0.023). This is a key result as it allows us to distinguish between a decoupling of action and valence from a facilitation of go responses regardless of valence. Whereas a decoupling of action and valence is associated with simultaneous facilitation of the go to avoid losing and the no-go to win conditions, a facilitation of go responses regardless of valence is associated with facilitation of the go to avoid losing condition but impairment of the no-go to win condition.

Fig. 2
figure 2

Effects of levodopa and citalopram on choice performance. a Mean (±SEM) difference in proportion of correct trials between go to win and go to avoid losing (left) and between no-go to avoid losing and no-go to win (right). These two different scores represent the two terms of the interaction between action and valence in choice accuracy. Green represents the differential scores for the placebo group, blue for the levodopa group, and red for the citalopram group. Levodopa decreased the disadvantage of learning go to avoid losing when compared with go to win observed in the placebo group. Levodopa also decreased the disadvantage of learning no-go to win when compared with no-go to avoid losing observed both in the placebo and the citalopram groups. Post hoc comparisons were implemented by means of t test: *p < 0.05. b Mean (±SEM) difference in proportion of correct trials between go and no-go conditions, that is, the main effect of action. Green represents the differential scores for the placebo group, blue for the levodopa group, and red for the citalopram group. Citalopram increased the advantage of learning the go when compared with the no-go conditions observed in the levodopa and the placebo groups. Post hoc comparisons were implemented by means of t test: *p < 0.05

Finally, we found a main effect of time (F(2.9,242.3) = 199.1, p < 0.001) in the absence of any interaction between treatment × time (p > 0.05). The learning curves for each trial type because in each treatment group are available in the supplementary material. These drug effects are unlikely to be related to unspecific arousal effects because we did not find any difference in subjective ratings between treatment groups.

Effects of drugs on model parameters

We examined these effects in more detail using reinforcement-learning models to parameterize a fine-grained account of the interaction between action and valence while participants learnt the reward structure of the environment. We built a nested collection of models incorporating different instrumental and Pavlovian reinforcement learning hypotheses which have been discussed in detail previously (Guitart-Masip et al. 2012b). In brief, the base model (RW) is purely instrumental, learning action values independently of outcome valence using the Rescorla-Wagner rule. This model was augmented in successive steps: In RW + noise, the model includes irreducible choice noise to the instrumental system; in RW + noise + bias, the model further includes a value-independent action bias that promotes or suppresses go choices equally in all conditions; in RW(rew/pun) + noise + bias, the instrumental system includes separate reward and punishment sensitivities which implies that losing a reward was not equally as aversive as getting a punishment; in RW(rew/pun) + noise + bias + Pav, the model further includes a (Pavlovian) parameter the adds a fraction of the state value into the action values learned by the instrumental system, thus effectively coupling action and valence during learning; finally, the last model RW(rew/pun) + noise + bias + Pav + stick includes a value-independent perseveration parameter that boosts or suppresses the action performed on the previous trial.

The most parsimonious model turned out to be RW(rew/pun) + noise + bias + Pav. Critically, this includes a Pavlovian bias parameter that increased the probability of go choices proportionally to the overall (action-independent) state value of each stimulus. This Pavlovian bias parameter thus increased the probability of go choices when the state values were positive (winning conditions) and decreased it when the state values were negative (avoid losing conditions). The fact that the winning model included a Pavlovian component demonstrates that the observed learning behavior is best characterized when including a component that couples action and valence. Previous incarnations of the learning version of the task have also implicated this model (Cavanagh et al. 2013; Guitart-Masip et al. 2012b), except that including separate reward and punishment sensitivity parameters improved the model fit independently from the Pavlovian parameter (see Table 2). We did not consider this possibility in our previous report where the winning model only included one single reinforcement sensitivity parameter (Guitart-Masip et al. 2012b).

Table 2 Model comparison

Once we identified that the model that best characterized the observed learning asymmetries includes an instrumental learning system with irreducible noise and a value-independent action bias along with a Pavlovian system that effectively couples action and valence during learning, we examined whether the pharmacological manipulations had any effect on the parameters of the model. For each parameter of the winning model, the median and 25th and 75th posterior percentiles across the whole sample are displayed in Table 3. We detected a difference between treatment groups on the Pavlovian parameter (Kruskal Wallis test χ 2(2) = 6.5, p = 0.039) and the bias parameter (one way ANOVA F(2,85) = 3.94; p = 0.023). As shown in Fig. 3a, levodopa decreased the Pavlovian parameter compared with placebo (Mann–Whitney U test Z = 2.14, p = 0.033) and to citalopram (Mann–Whitney U test Z = 2.2, p = 0.028). On the other hand, citalopram increased the action bias parameter compared with placebo (Fig. 3b, t test t(56) = 2.62, p = 0.011) and to levodopa (t test t(57) = 2.27, p = 0.027). The effects of citalopram were not related to changes in stickiness.

Table 3 Parameters of the winning model
Fig. 3
figure 3

Effects of levodopa and citalopram on model parameters. a Maximum a posteriori (MAP) median parameter estimates of the best model for the Pavlovian parameter. Green represents the differential scores for the placebo group, blue for the levodopa group, and red for the citalopram group. Levodopa decreased the Pavlovian parameter when compared with placebo and citalopram. Post hoc comparisons were implemented by means of Mann–Whitney U test: *p < 0.05. b Maximum a posteriori (MAP) median parameter estimates of the best model for the action bias parameter. Green represents the differential scores for the placebo group, blue for the levodopa group, and red for the citalopram group. Citalopram increased the action bias parameter when compared with placebo and levodopa. Post hoc comparisons were implemented by means of t test: *p < 0.05

A recent study showed that dietary tryptophan depletion increased a value-independent choice perseveration or stickiness (Seymour et al. 2012). To rule out the possibility that the effect of citalopram is explained by increased stickiness, we compared the action bias and the stickiness posterior parameter estimates for the model including the stickiness parameter (despite the fact that this model did not provide a better account of the data). We did not find any significant difference in the stickiness parameter between the placebo and the citalopram group (Mann–Whitney U test Z = 0.63, p = 0.53), whereas the difference in action bias parameter remained significant (t(56) = 2.32; p = 0.024).

Discussion

The current data reveal differential, but not opponent, effects of levodopa (l-DOPA) and citalopram, on instrumental control, which we assume arise from effects on dopamine and serotonin, respectively. As in previous experiments, we detected a striking asymmetry during instrumental learning in the placebo group, whereby participants learnt better to emit a behavioral response in anticipation of reward and learnt better to withhold a response in anticipation of punishment. This asymmetry was attenuated post-administration of levodopa, and our computational analysis indicated this was mediated by a decreased influence of a Pavlovian controller that corrupts instrumental control. Conversely, administration of citalopram increased the propensity to perform go choices regardless of outcome valence, as reflected in an increased magnitude of the value independent action bias parameter.

A wealth of studies suggests at least three different roles dopamine might play in guiding behavior. Two of these roles relate to value/action learning via reward prediction errors (Bayer and Glimcher 2005; Montague et al. 1996; Morris et al. 2006; Schultz et al. 1997) and action invigoration via phasic (Satoh et al. 2003) and tonic release (Salamone et al. 1994) in Pavlovian (Lex and Hauber 2008; Parkinson et al. 2002) and instrumental (Dayan 2012; Guitart-Masip et al. 2012a; Niv et al. 2007) contexts. Both are tied to dopamine’s effects in ventral and dorsal striatum, for example, through influencing the balance between go-related direct and no-go-related indirect pathways (Frank et al. 2004; Wickens et al. 2007).

The role of dopamine in learning provides a plausible mechanism for acquisition of active responses through positive reinforcement and passive responses through punishment. According to a prevalent view in reinforcement learning and decision making, dopamine neurons signal reward prediction error signals (Bayer and Glimcher 2005; Montague et al. 1996; Schultz et al. 1997) in the form of phasic bursts for positive prediction errors and dips below baseline for negative prediction errors (Bayer et al. 2007), to target structures including the striatum (McClure et al. 2003; O’Doherty et al. 2003, 2004; Pessiglione et al. 2006). In the striatum, increases of dopamine when an unexpected reward is obtained reinforce the direct pathway and generate go choices, while dips in dopamine levels when an unexpected punishment is obtained reinforce the indirect pathway and generate no-go choices (Frank et al. 2007; Frank et al. 2004; Hikida et al. 2010; Wickens et al. 2007). However, this framework provides no clear mechanism for learning to go to avoid losing or no-go to win. For this reason, we have argued that a coupling between action and valence within the corticostriatal system could underlie the strong Pavlovian influences in instrumental learning observed in our task (Guitart-Masip et al. 2012b). However, if l-DOPA through its impact on dopamine mediated this function, then we would expect increased asymmetries in task performance, rather than the decrease in asymmetric learning that we observed.

It is known that dopamine depletion results in decreased motor activity and decreased motivated behavior (Palmiter 2008; Ungerstedt 1971), along with decreased vigor or motivation to work for rewards in demanding reinforcement schedules (Niv et al. 2007; Salamone et al. 2005). Conversely, boosting dopamine levels with levodopa invigorates motor responding in healthy humans (Guitart-Masip et al. 2012a), possibly by increasing the invigorating effects exercised by average reward rate on response time (Beierholm et al. 2013). Additionally, enhancing dopamine in the nucleus accumbens increases vigor in appetitive Pavlovian-instrumental transfer (Lex and Hauber 2008; Parkinson et al. 2002) and invigorates appetitive instrumental actions (Taylor and Robbins 1984, 1986). However, if this was the predominant effect of levodopa, then we would have expected increased action biases and/or Pavlovian influences, which in fact did not arise.

A third potential role for dopamine arises from its influence on the balance between different sorts of control (Hitchcott et al. 2007). This function can be achieved, for instance, by facilitating the operation of prefrontal processes such as working memory or rule learning (Clatworthy et al. 2009; Cools and D’Esposito 2011; Mehta et al. 2005; Williams and Goldman-Rakic 1995), perhaps reducing the error in their outputs and thereby increasing their influence on behavior (Daw et al. 2005). An alternative mechanism by which dopamine may arbitrate between different sorts of control is through recruitment of the prefrontal–subthalamic nucleus pathway (Aron and Poldrack 2006; Fleming et al. 2010) to raise a decision threshold within the basal ganglia and thereby prevent execution of a biased decision computed in the striatum (Cavanagh et al. 2011; Frank 2006; Zaghloul et al. 2012).

The involvement of a prefrontal mechanism in overcoming a Pavlovian interference is supported by recent evidence that theta power over midline frontal sensors exert a moderating effect on Pavlovian influences on trials where Pavlovian and instrumental control conflict (Cavanagh et al. 2013). Furthermore, successfully learning to perform the no-go conditions in our task is known to involve a recruitment of inferior frontal gyrus (Guitart-Masip et al. 2012b). A related finding is the observation that levodopa increases the degree to which young healthy participants employ model-based, as opposed to model-free, control in a two-step choice task (Wunderlich et al. 2012), a task sensitive to manipulations of working memory load (Otto et al. 2013). Concomitantly, depleting dopamine can boost model-free control (de Wit et al. 2012). Thus, the effects we observe may in part depend on dopamine’s actions on functions implemented in prefrontal cortex (Hitchcott et al. 2007). However, future imaging experiments are required to localize the anatomical site and component processes that account for the observed effects of levodopa seen in the context of the current task.

Two previous experiments from our laboratory also suggest a role for dopamine in decoupling the instrumental and Pavlovian learning systems in the current task, although they do not pin down the site of its action. In one, older adults showed the same asymmetric association between action and valence that we report in younger adults. Furthermore, in older adults, we also found that the integrity of the substantia nigra/ventral tegmental area, as measured with structural magnetic resonance imaging, is positively correlated with performance in the no-go to win condition, the condition with highest Pavlovian and instrumental conflict and worst performance (Chowdhury et al. 2013). In the other, we studied the effects of levodopa on striatal BOLD signal in subjects who had explicitly been taught the contingencies of the task rather than having to learn them for themselves. In this context, striatal BOLD responses are dominated by action requirements (go > no-go) rather than valence (Guitart-Masip et al. 2011). However, after administering levodopa, there was decreased BOLD response in the no-go to win condition along with increased BOLD in the go to win case (Guitart-Masip et al. 2012a). This neuronal effect may be a homologue of the decrease in a Pavlovian bias observed in the current study after administration of levodopa and suggest yet another mechanism by which a supposedly prefrontal effect of levodopa may decrease the Pavlovian bias, namely by modulation of model-free representations of prediction errors at a subcortical level (Daw et al. 2011; Doll et al. 2011).

There is good evidence of a role for serotonin in inhibition (Soubrie 1986) with serotonin depletion in rats impairing their ability to withhold action in a symmetrically rewarded go/no-go task (Harrison et al. 1999) and increasing the number of premature responses in the five choice reaction time task (Carli and Samanin 2000; Harrison et al. 1997). Furthermore, selectively inhibiting serotonin neurons and preventing serotonin increases in the prefrontal cortex abolishes the ability of rats to wait for rewards during a long delay (Miyazaki et al. 2012). Nevertheless, involvement of serotonin in behavioral inhibition is typically complicated (Cools et al. 2011; Drueke et al. 2010). A previous study found that dietary tryptophan depletion abolishes punishment-induced inhibition (as measured with reaction times) akin to the disadvantage of performing a go response in the avoid losing condition when compared with the winning condition (Crockett et al. 2009), and recent follow up study suggests that this effect is driven by a Pavlovian mechanism (Crockett et al. 2012).

By themselves, these results depict a complex picture without any clear expectation about the effects of citalopram. Citalopram is a selective serotonin reuptake inhibitor, whose direct effect is locally increased serotonin availability. However, acute citalopram administration results in decreased total postsynaptic serotonin availability, at least at the cortical level (Selvaraj et al. 2012), possibly through a presynaptic inhibitory mechanism (Artigas et al. 1996; Hajos et al. 1995). Citalopram is likely to have weaker effects than dietary tryptophan depletion, and the study suggesting a Pavlovian source for punishment-induced inhibition (Crockett et al. 2012) is most equivalent to our steady state or instructed study, where we did not observe an effect of citalopram (Guitart-Masip et al. 2012a).

Our data suggest that the effects of citalopram were confined to a behavioral inhibition independent of valence, with no apparent modulation of the strength of the Pavlovian parameter. Given the lack of effect on learning, one might have thought that citalopram, as in (Guitart-Masip et al. 2012a), would have had no effect at all. However, two functional differences between taught and learnt versions of our task are worth noting. First, subjects in the learning version have to overcome a value-independent action bias that may have arisen because participants performed 20 target detection trials in order to get familiarized with the speed requirements, before embarking on the learning task; second, as a result of this, they choose go more frequently, rendering themselves more susceptible to the effects of perseveration. Our computational model clearly shows that the effects of citalopram are captured by an increase in action bias, which may explain why we only find an effect of citalopram when the task involves the requirement to overcome an action bias.

The current data again highlight the importance of orthogonally manipulating action requirements and outcome valence if one wants to reveal the full complexity of the roles played by dopamine and serotonin in instrumental learning. We found that boosting dopamine via levodopa decreases the pervasive coupling between Pavlovian and instrumental control systems. On the other hand, our data reveal the differential, but not opponent, effect of reducing motor inhibition by manipulating serotonin via citalopram. Overall, the data speak to a need for wider panoply of methods for manipulating dopamine and serotonin in human subjects, allowing the more fine-grained range of effects evident in more pharmacologically and spatially restricted studies in animals to be examined.