Model-based learning protects against forming habits

Studies in humans and rodents have suggested that behavior can at times be “goal-directed”—that is, planned, and purposeful—and at times “habitual”—that is, inflexible and automatically evoked by stimuli. This distinction is central to conceptions of pathological compulsion, as in drug abuse and obsessive-compulsive disorder. Evidence for the distinction has primarily come from outcome devaluation studies, in which the sensitivity of a previously learned behavior to motivational change is used to assay the dominance of habits versus goal-directed actions. However, little is known about how habits and goal-directed control arise. Specifically, in the present study we sought to reveal the trial-by-trial dynamics of instrumental learning that would promote, and protect against, developing habits. In two complementary experiments with independent samples, participants completed a sequential decision task that dissociated two computational-learning mechanisms, model-based and model-free. We then tested for habits by devaluing one of the rewards that had reinforced behavior. In each case, we found that individual differences in model-based learning predicted the participants’ subsequent sensitivity to outcome devaluation, suggesting that an associative mechanism underlies a bias toward habit formation in healthy individuals. Electronic supplementary material The online version of this article (doi:10.3758/s13415-015-0347-6) contains supplementary material, which is available to authorized users.

gold and silver states were learned independently of one another, as was the structure of the task. The model uses an eligibility trace to propagate second-stage reward information to the first-stage values. Specifically, at the end of each trial, the first-stage values are updated according to: where λ is an eligibility trace decay parameter (Sutton & Barto, 1998). We assume that eligibility traces are reset to 0 between episodes (i.e., that eligibility does not carry over from trial to trial). Additionally, at the end of each trial, we decayed the Q values for all of the non-selected actions by multiplying them by 1 − α (Ito & Doya, 2009;Lau & Glimcher, 2005). This decay makes the present model correspond more closely to the one-trial-back regression model described in the main text, in the limit as α→1.

Model-based component
In general, a model-based RL algorithm works by learning a transition function (mapping state-action pairs to a probability distribution over the subsequent state), and immediate reward values for each state, then computing cumulative state-action values by iterative expectation over these. Specialized to the structure of the current task, this amounts to, first deciding which first-stage action maps to which second-stage state (because subjects were instructed that this was the structure of the transition contingencies), and second, learning reward values for each of the second-stage states.
At the second stage (where immediate rewards were offered), the problem of learning immediate rewards is equivalent to that for TD above, because ܳ ெி ‫ݏ(‬ ଶ௧ ) is just an estimate of the immediate reward r t ; with no further stages to anticipate, and the SARSA learning rule reduces to a delta rule for predicting the immediate reward. Thus, the two approaches coincide at the second stage, and we define Q MB = Q MF at those states. Critically, the top level model-based values are defined from both the transition and reward estimates using the Bellman Equation (Bellman, 1957): where we have assumed these are recomputed on each trial from the current estimates of the transition probabilities and rewards.

Choice rule
Finally, to connect the values to choices, we use a softmax choice rule, which assigns a probability to each action according to the weighted combination of the two state-action value estimates ܳ ௧ ‫,ݏ(‬ ܽ) = ‫ܳݓ‬ ெ ‫,ݏ(‬ ܽ) + (1 − ‫ܳ)ݓ‬ ெி ‫,ݏ(‬ ܽ). Q MB and Q MF , weighted according to a free parameter w. Choice is then softmax in the net state-action values.
The probability of each choice at the first stage is calculated, accordingly, as where the inverse temperature parameter ߚ governs the stochasticity of choices. The indicator function rep(ܽ) is defined as 1 if ܽ is the same one as was chosen on the previous trial, zero otherwise. Together with the "stickiness" parameter p, this captures first-order perseveration (p > 0) or switching (p < 0) in the first-stage choices (Lau & Glimcher, 2005).

Group-Level Modeling
Thus far we have described the modeling of a single subject's data. This model was embedded within a multi-level random effects model of the population variation in its parameters to estimate it for all subjects simultaneously. All of the free parameters of the model (α, λ, β, w , p) were taken as random effects, instantiated separately for each subject s from a common group level distribution. For all parameters, the group level distributions were Gaussian with free group-level mean (ߤ) and SD (ߪ), plus an additional free slope (e.g. ߚ ௗ ) allowing the parameter to scale, across subjects, with their (z-scored) devaluation score ݀ ௦ . This was implemented identically in Experiments 1 and 2, using their respective indices of devaluation sensitivity. For instance: and analogously for p. However, for the parameters with support confined to [0,1] (α, λ, and w) the resulting variables were transformed according to a logistic sigmoid, e.g.: and similarly for the others.

Estimation
We estimated the joint distribution of the parameters of the model, conditional on all subjects' observed choices and rewards. For this, we used Markov Chain Monte Carlo (MCMC) techniques (specifically the No-U-Turn variant of Hamiltonian Monte Carlo) as implemented in the Stan modeling language (v2.5, 2014). Given a probabilistic generative model (the above equations) and a subset of observed variables, MCMC techniques provide samples from the conditional joint distribution over the remaining random variables. We ran four chains of 4,000 samples each, discarding the first 2,000 samples of each chain for burn-in. We examined the chains visually for convergence and also computed Gelman and Rubin's (1992) potential scale reduction factors. For this, large values indicate convergence problems, whereas values near 1 are consistent with convergence. We ensured that these diagnostics were less than 1.2 for all variables.

Supplementary Regression Analyses
Given our relatively large sample size and the presence of some unique task features, we carried out additional exploratory analyses that may be of general interest (but were unrelated to our hypotheses  (Daw et al., 2011), there were two independent states, one gold and another silver. Therefore, on some trials the previous state experienced was different from the current state. We tested if this affected choice behavior by including state information as an additional predictor in our logistic regression analysis of the learning task behavior (same state:1, different state: -1, relative to the previous trial). This factor was included as a random effect (i.e., allowed to vary across subjects) and interacted with the other explanatory variables. We found that when the State on the current trial was the same as that on the previous one, participants were more likely to make the same choice (main effect of State β =0.13, SE=.033, p<.001), were more model-free (interaction of State by Reward, β =0.26, SE=.03, p<.001) and trended towards being more model-based (three-way interaction of State, Reward, and Transition, β =-.048, SE=.03, p=.068). Therefore, not unexpectedly, the three basic effects on the choices were more robustly observed when participants made choices in the same state as the immediately previous trial, whereas these effects were diminished when experience with the other state intervened.  Welcome. Today you will be playing a game where you can earn some extra money.
On every turn, you will make a choice between two boxes, which will bring you to another box, which may or may not contain a coin.
At the end of the game, we will convert a proportion of what you have won into real money, which you can keep in addition to your hit payment. Simply put, the more coins you collect in the game, the more cash we will pay you when you are finished.
Coins can be gold or silver, and are always worth 25c, regardless of color. The border of the screen on each turn will let you know whether gold or silver coins are available.
You must read all of the instructions carefully. There will be a quiz before you begin the game and if you do not answer all of the questions correctly, you will be sent back to the beginning and will need to re-read them.

SCREEN 2
Before you start making choices, we will have a tutorial to show you how the game works. There are two things for you to learn in order to do well. We will practice these separately.

The first is:
You need to keep track of which boxes have the highest chance of having a coin inside. If a box contains a coin, it will appear below the box. If not, you will see a zero.
Every time you find a box, the computer will decide whether or not to give you a coin based on a 'chance', which has been assigned to that box. Some boxes have a higher chance of having a coin inside than others. Importantly, the chances of each box containing a coin will change slowly, and independently, over time. It is your job to keep track of which boxes are currently better than others and to try and get to these boxes.
There are no strange patterns to this game, such as a box containing coins on every other choice. The computer is not trying to play tricks on you; it strictly works on the chance assigned to each box, which will change slowly over time.

SCREEN 3
We will now let you practice this part of the game. You need to track which 'coin boxes' have a better chance of having coins than others.
Two of the coin boxes that we will show you sometimes contain silver coins, but will never contain gold coins. Likewise, the other two boxes will sometimes contain gold coins, but will never contain silver coins.
Unlike the real game, you won't be making any choices in this tutorial. Instead you simply need to pay attention to what happens on-screen so that you can work out which coin boxes are better than others. Learning how to do this will help you win money later on in the game.

SCREEN 4 (following 20 practice trials passively learning which boxes are more likely to yield coins)
Good. Hopefully, you saw that some boxes had a higher chance of having a coin inside than others. Also you probably noticed that even 'good' coin boxes didn't have a coin every single time.
The second thing you need to learn is how to make choices that will bring you to coin boxes that are currently good.
At the start of each turn, you will be able to choose between two boxes, which appear on the left and right. One of these boxes usually brings you to one of the coin boxes, and the other box usually brings you to the other coin box. For example, one box that you choose might bring you to one of the coin boxes on 7 out of 10 turns. But that means that on 3 out of 10 turns, it will take you to another box, by mistake. These chances are fixed, so you just need to learn these rules once. This is unlike the goodness of the coin boxes, which will change slowly over time during the game.
To sum-up: learning which boxes are more (or less) likely to bring you to each coin box is very important to playing the game well. If you can do this, you will be able to make good choices that will bring you to the coin boxes that are currently best.
We will practice making the choices now. We won't show you the coins in this tutorial, so that you can concentrate on learning which choices bring you to which coin boxes.
Press the: E key to choose the LEFT BOX I key to choose the RIGHT BOX You have a limited amount of time to make each choice. Try practicing a few times right now.

SCREEN 5 (following 20 practice trials actively learning the transition structure through making choices and viewing their consequences)
Good. Hopefully you learned which choices usually lead to which coin boxes. You also will have noticed that by 'usually', we mean 70% of the time. That means that 30% of the time, you ended up in the other coin box.
It is important to know that the chance that a certain choice takes you to a certain coin box does not change over time. This means that if one box usually takes you to a certain coin box, that relationship will stay the same throughout the game.

SCREEN 6
Now that you understand the two parts you have practiced, we will remind you how they fit together in the game you are about to play. On each turn, you have a choice between two boxes, you will choose a box that will take you to a coin box and you will see if it contains a coin. After you find out whether or not your box contains coins, you will go back to the start and make another choice and try to earn another coin and so on.
While some of the coin boxes may become very good at times (that is, they often contain a coin), these same coin boxes may become bad later in the game. You need to stay on top of which coin boxes are best. You must use this information to make good choices that are likely to bring you to the boxes that are currently good.
It costs you 1 cent to choose between the boxes and play the game on each turn. We will deduct this from your bonus at the end of the game. This is a good deal, because if you know that a box is good, you will often find a 25 cent coin inside. If you choose not to open one of the boxes, or are too slow to respond, there is no way that you can get a 25 cent coin on that trial, but will save yourself 1 cent. Sometimes it might make sense to save yourself that 1 cent if you are really sure that there is nothing of value inside the coin boxes.

SCREEN 7
IMPORTANT: We will store your gold and silver coins in separate containers, one for gold coins and one for silver coins, so we can keep track of how much you have won. Please note that if either of these containers fills up completely during the game, you have maxed-out on storing that kind of coin! You won't be able to keep any more coins of that color for the rest of the game. That means even if you continue to find that kind of coin inside boxes, we won't store them for you, and you won't get to keep them. We will alert you if at any point your containers get half-way full, and again if they get completely full.

SCREEN 8 (Comprehension Test ) (correct answers are emboldened)
Here is a short quiz to test if you understood the instructions correctly.
If you missed any important things, we will bring you to go through the instructions again. Are silver and gold coins worth the same amount of money? -I don't know -Silver coins are more valuable than gold coins -Gold coins are more valuable than silver coins -Yes! They are both worth 25c Can some boxes contain both silver and gold coins at different times? -I don't know -No. Boxes contain either silver or gold coins, never both -Sometimes boxes will have both types of coins inside -Yes! Boxes often have both types of coins inside

SCREEN 9
Well done! You answered the questions correctly! Before you start the game, let's review. Your aim is to collect as much money as possible by finding coins. To do this, you will make choices between pairs of boxes which will bring you to another box that may contain a coin. Often one box will be better than another because it pays you more often. However, the goodness of the coin boxes will change slowly over time, so you need to stay on top of which coin boxes are currently good.
Getting to the best coin boxes relies on your choice between the two boxes at the start of each turn. You have already learned how some choices are more likely to take you to certain coin boxes. This will not change over the game.
You will need to put these parts together to do well at the game and win as much money as possible. You need to keep track of which are the best coin boxes as that changes slowly, and make good choices that you think are likely to bring you to them.
Now we're going to play the real game where you can win money, which we will pay you as a bonus, in addition to your regular hit payment.
Use the 'E' and 'I' keys to choose the left and right boxes.

Consumption test
Free Coin Collection! For a brief time, you can collect coins by clicking on them using the mouse/trackpad. Once collected, coins will disappear from the screen and be placed in their containers, if they are not already full.
You will have 4 seconds to collect as many coins as you like. Get ready!

Withholding outcome information
We will now no longer show you the results of your choices (i.e. whether or not you get a coin on each trial). Apart from that, nothing about the game has changed, and you should continue playing as before.