Problem definition
We now describe a Markov decision process to model the stopping problem described at the beginning of this article. We consider the simplest possible case of this problem, where we: (i) restrict the number of choice alternatives to two (buy or sell), (ii) assume that observations are made at discrete (and constant) intervals, (iii) assume that observations consist of binary outcomes (up or down transitions), and (iv) restrict the difficulty of each decision to one of two possible levels (assets could be rising (or falling) at one of two different rates).
The decision-maker faces repeated decision-making opportunities (trials). On each trial the world is in one of two possible states (asset is rising or falling), but the decision-maker does not know which at the start of the trial. At a series of times steps t = 1,2,3,… the decision-maker can choose to wait and accumulate evidence (observe if value of asset goes up or down). Once the decision-maker feels sufficient evidence has been gained, they can choose to go, and decide either buy or sell. If the decision is correct (advisor recommends buy and asset is rising or advisor recommends sell and asset is falling), they receive a reward. If the decision is incorrect they receive a penalty. Under both outcomes the decision-maker then faces a delay before starting the next trial. If we assume that the decision-maker will undertake multiple trials, it is reasonable that they will aim to maximize their average reward per unit time. A behavioral policy which achieves the optimal reward per unit time will be found using average reward dynamic programming (Howard, 1960; Ross, 1983; Puterman, 2005).
We formalize the task as follows. Let t = 1,2,… be discrete points of time during a trial, and let X denote the previous evidence accumulated by the decision-maker at those points in time. The decision-maker’s state in a trial is given by the pair (t, X). Note that, in contrast to previous accounts that use dynamic programming to establish optimal decision boundaries (e.g., Drugowitsch et al., 2012; Huang & Rao, 2013), we compute optimal policies directly in terms of evidence and time, rather than (posterior) belief and time. The reasons for doing so are elaborated in the Discussion. In any state, (t, X), the decision-maker can take one of two actions: (i) wait and accumulate more evidence (observe asset value goes up/down), or (ii) go and choose the more likely alternative (buy/sell).
If action wait is chosen, the decision-maker observes the outcome of a binary random variable, δX, where \(\mathbb {P}(\delta X=1)=u=1-\mathbb {P}(\delta X=-1)\). The up-probability, u, depends on the state of the world. We assume throughout that u ≥ 0.5 if the true state of the world is rising, and u ≤ 0.5 if the true state is falling. The parameter u also determines the trial difficulty. When u is equal to 0.5, the probability of each outcome is the same (equal probability of asset value going up/down); consequently, observing an outcome is like flipping an unbiased coin, providing the decision-maker absolutely no evidence about which hypothesis is correct. On the other hand, if u is close to 1 or 0 (asset value almost always goes up/down), observing an outcome provides a large amount of evidence about the correct hypothesis, making the trial easy. After observing δX the decision-maker transitions to a new state (t + 1, X + δX), as a result of the progression of time and the accumulation of the new evidence δX. Since the decision-maker does not know the state of the world, and consequently does not know u, the distribution over the possible successor states (t + 1, X ± 1) is non-trivial and calculated below. In the most general formulation of the model, an instantaneous cost (or reward) would be obtained on making an observation, but throughout this article we assume that rewards and costs are only obtained when the decision-maker decides to select a go action. Thus, in contrast to some approaches (e.g., Drugowitsch et al., 2012), the cost of making an observation is 0.
If action go is chosen then the decision-maker transitions to one of two special states, C or I, depending on whether the decision made after the go action is correct or incorrect. As with transitions under wait, the probability that the decision is correct depends in a non-trivial way on the current state, and is calculated below. From the states C and I, there is no action to take, and the decision-maker transitions to the initial state (t, X) = (0,0). From state C the decision-maker receives a reward R
C
and suffers a delay of D
C
; from state I they receive a reward (penalty) of R
I
and suffers a delay of D
I
.
In much of the theoretical literature on sequential sampling models, it is assumed, perhaps implicitly, that the decision-maker knows the difficulty level of a trial. This corresponds to knowledge that the up-probability of an observation is u = 0.5 + 𝜖 when the true state is rising, and u = 0.5 − 𝜖 when the true state is falling. However, in ecologically realistic situations, the decision-maker may not know the difficulty level of the trial in advance. This can be modeled by assuming that the task on a particular trial is chosen from several different difficulties. In the example above, it could be that up / down observations come from different sources and some sources are noisier than others. To illustrate the simplest conditions resulting in varying decision boundaries, we model the situation where there are only two sources of observations: an easy source with \(u \in \mathcal {U}_{e} = \{\frac {1}{2}-\epsilon _{e}, \frac {1}{2}+\epsilon _{e}\}\) and a difficult source with \(u \in \mathcal {U}_{d} = \{\frac {1}{2}-\epsilon _{d}, \frac {1}{2}+\epsilon _{d}\}\), where \(\epsilon _{e}, \epsilon _{d} \in [0,\frac {1}{2}]\) are the drifts of the easy and difficult stimuli, with 𝜖
d
< 𝜖
e
. Thus, during a difficult trial, u is close to 0.5, while for an easy trial u is close to 0 or 1. We assume that these two types of tasks can be mixed in any fraction, with \(\mathbb {P}(U \in \mathcal {U}_{e})\) the probability that the randomly selected drift corresponds to an easy task in the perceptual environment. For now, we assume that within both of \(\mathcal {U}_{e}\) and \(\mathcal {U}_{d}\), u is equally likely to be above or below 0.5—i.e., there is equal probability of the assets rising and falling. In the section titled “Extensions of the model” below, we will show how our results generalize to the situation of unequal prior beliefs about the state of the world.
Figure 1a depicts evidence accumulation as a random walk in two-dimensional space with time along the x-axis and the evidence accumulated, X
1,…, X
t
, based on the series of outcomes, + 1,+1,−1,+1, along the y-axis. The figure shows both the current state of the decision-maker at (t, X
t
) = (4,2) and their trajectory in this state-space. In this current state, the decision-maker has two available actions: wait or go. As long as they choose to wait they will make a transition to either (5,3) or (5,1), depending on whether the next δX outcome is + 1 or − 1. Figure 1b shows the transition diagram for the stochastic decision process that corresponds to the random walk in Fig. 1a once the go action is introduced. Transitions under go take the decision-maker to one of the states C or I, and subsequently back to (0,0) for the next trial.
Our formulation of the decision-making problem has stochastic state transitions, decisions available at each state, and transitions from any state (t, X) depending only on the current state and the selected action. This is therefore a Markov decision process (MDP) (Howard, 1960; Puterman, 2005), with states (t, x) and the two dummy states C and I corresponding to the correct and incorrect choice. A policy is a mapping from states (t, x) of this MDP to wait/go actions. An optimal policy that maximizes the average reward per unit time in this MDP can be determined by using the policy iteration algorithm (Howard, 1960; Puterman, 2005). A key component of this algorithm is to calculate the average expected reward per unit time for fixed candidate policies. To do so, we must first determine the state-transition probabilities under either action (wait/go) from each state for a given set of drifts (Eqs. 6 and 7 below). These state-transition probabilities can then be used to compare the wait and go actions in any given state using the expected reward under each action in that state.
Computing state-transition probabilities
Computing the transition probabilities is trivial if one knows the up-probability, u, of the process generating the outcomes: the probability of transitioning from (t, x) to (t + 1, x + 1) is u, and to (t + 1, x − 1) is 1 − u. However, when each trial is of an unknown level of difficulty, the observed outcomes (up/down) during a particular decision provide information not only about the correct final choice but also about the difficulty of the current trial. Thus, the current state provides information about the likely next state under a wait action, through information about the up-probability, u. Therefore, the key step in determining the transition probabilities is to infer the up-probability, u, based on the current state and use this to compute the transition probabilities.
As already specified, we model a task that has trials drawn from two difficulties (it is straightforward to generalize to more than two difficulties): easy trials with u in the set \(\mathcal {U}_{e} = \{\frac {1}{2}-\epsilon _{e}, \frac {1}{2}+\epsilon _{e}\}\) and difficult trials with u in the set \(\mathcal {U}_{d} = \{\frac {1}{2}-\epsilon _{d}, \frac {1}{2}+\epsilon _{d}\}\) (note that this does not preclude a zero drift condition, 𝜖
d
= 0). To determine the transition probabilities under the action wait, we must marginalize over the set of all possible drifts, \(\mathcal {U} = \mathcal {U}_{e} \cup \mathcal {U}_{d}\):
$$\begin{array}{@{}rcl@{}} p_{(t,x) \rightarrow (t+1,x+1)}^{wait} \!&=&\! \mathbb{P}(X_{t+1}\,=\,x\,+\,1 | X_{t}\,=\,x)\\ \!&=&\! \sum\limits_{u \in \mathcal{U}} \mathbb{P}(X_{t+1}\,=\,x\,+\,1 | X_{t}\,=\,x,U\,=\,u) \cdot \mathbb{P}(U\,=\,u|X_{t}\,=\,x) \\ p_{(t,x) \rightarrow (t+1,x-1)}^{wait} \!&=&\! 1 - p_{(t,x) \rightarrow (t+1,x+1)}^{wait} \end{array} $$
(1)
where U is the (unobserved) up-probability of the current trial. \(\mathbb {P}(X_{t+1}=x+1 | X_{t}=x,U=u)\) is the probability that δX = 1 conditional on X
t
= x and the up-probability being u; this is simply u (the current evidence level X
t
is irrelevant when we also condition on U = u). All that remains is to calculate the term \(\mathbb {P}(U=u|X_{t}=x)\).
This posterior probability of U = u at the current state can be inferred using Bayes’ law:
$$ \mathbb{P}(U\,=\,u | X_{t}\,=\,x) \; \,=\, \; \frac{\mathbb{P}(X_{t}\,=\,x | U\,=\,u) \cdot \mathbb{P}(U\,=\,u)} {{\sum}_{\tilde{u}\in\mathcal{U}} \mathbb{P}(X_{t}\,=\,x | U\,=\,\tilde{u}) \cdot \mathbb{P}(U\,=\,\tilde{u})} $$
(2)
where \(\mathbb {P}(U=u)\) is the prior probability of the up-probability being equal to u. The likelihood term, \(\mathbb {P}(X_{t}=x | U=u)\), can be calculated by summing the probabilities of all paths that would result in state (t, x). We use the standard observation about random walks that each of the paths that reach (t, x) contains \(\frac {t+x}{2}\) upward transitions and \(\frac {t-x}{2}\) downward transitions. Thus, the likelihood is given by the summation over paths of the probability of seeing this number of upward and downward moves:
$$ \mathbb{P}(X_{t}\,=\,x | U\,=\,u) \,=\, \sum\limits_{\text{paths}} u^{(t+x)/2}(1\!-u)^{(t-x)/2} \,=\, n_{\text{paths}} u^{(t+x)/2}(1-u)^{(t-x)/2}. $$
(3)
Here n
paths is the number of paths from state (0,0) to state (t, x), which may depend on the current decision-making policy. Plugging the likelihood into (2) gives
$$ \mathbb{P}(U\,=\,u | X_{t}\,=\,x) \; \,=\, \; \frac{ n_{\text{paths}} u^{(t+x)/2}(1\,-\,u)^{(t-x)/2} \mathbb{P}(U\,=\,u)}{{\sum}_{\tilde{u}\in\mathcal{U}} n_{\text{paths}} \tilde{u}^{(t+x)/2}(1\,-\,\tilde{u})^{(t-x)/2}\mathbb{P}(U\,=\,\tilde{u})}. $$
(4)
Some paths from (0,0) to (t, x) would have resulted in a decision to go (based on the decision-making policy), and therefore could not actually have resulted in the state (t, x). Note, however, that the number of paths n
paths is identical in both numerator and denominator, so can be cancelled.
$$ \mathbb{P}(U\,=\,u | X_{t}\,=\,x) \; \,=\, \; \frac{ u^{(t+x)/2}(1-u)^{(t-x)/2} \mathbb{P}(U\,=\,u)}{{\sum}_{\tilde{u}\in\mathcal{U}} \tilde{u}^{(t+x)/2}(1\,-\,\tilde{u})^{(t-x)/2}\mathbb{P}(U\,=\,\tilde{u})}. $$
(5)
Using Eq. 1, the transition probabilities under the action wait can therefore be summarized as:
$$ p_{(t,x) \rightarrow (t+1,x+1)}^{wait} \! = \! \sum\limits_{u \in \mathcal{U}} u \cdot \mathbb{P}(U\,=\,u | X_{t}\,=\,x) \; \,=\, \; 1 - p_{(t,x) \rightarrow (t+1,x-1)}^{wait} $$
(6)
where the term \(\mathbb {P}(U=u | X_{t}=x)\) is given by Eq. 5. Equation 6 gives the decision-maker the probability of an increase or decrease in evidence in the next time step if they choose to wait.
Similarly, we can work out the state-transition probabilities under the action go. Under this action, the decision-maker makes a transition to either the correct or incorrect state. The decision-maker will transition to the Correct state if they choose buy and the true state of the world is rising, i.e., true u is in \(\mathcal {U}_{+} = \{\frac {1}{2}+\epsilon _{e}, \frac {1}{2}+\epsilon _{d}\}\), or if they choose sell and the true state of the world is falling, i.e., true u is in \(\mathcal {U}_{-} = \{\frac {1}{2}-\epsilon _{e}, \frac {1}{2}-\epsilon _{d}\}\) (assuming 𝜖
d
> 0; see the end of this section for how to handle 𝜖
d
= 0).
The decision-maker will choose the more likely alternative–they compare the probability of the unobserved drift U coming from the set \(\mathcal {U}_{+}\) versus coming from the set \(\mathcal {U}_{-}\), given the data observed so far. The decision-maker will respond buy when \(\mathbb {P}(U\in \mathcal {U}_{+}|X_{t}=x)>\mathbb {P}(U\in \mathcal {U}_{-}|X_{t}=x)\) and respond sell when \(\mathbb {P}(U\in \mathcal {U}_{+}|X_{t}=x)<\mathbb {P}(U\in \mathcal {U}_{-}|X_{t}=x)\). The probability of these decisions being correct is simply the probability of the true states being rising and falling respectively, given the information observed so far. Thus when \(\mathbb {P}(U\in \mathcal {U}_{+}|X_{t}=x)>\mathbb {P}(U\in \mathcal {U}_{-}|X_{t}=x)\) the probability of a correct decision is \(\mathbb {P}(U\in \mathcal {U}_{+}|X_{t}=x)\), and when \(\mathbb {P}(U\in \mathcal {U}_{+}|X_{t}=x)<\mathbb {P}(U\in \mathcal {U}_{-}|X_{t}=x)\) the probability of a correct answer is \(\mathbb {P}(U\in \mathcal {U}_{-}|X_{t}=x)\); overall, the probability of being correct is the larger of \(\mathbb {P}(U\in \mathcal {U}_{+}|X_{t}=x)\) and \(\mathbb {P}(U\in \mathcal {U}_{-}|X_{t}=x)\), meaning that the state transition probabilities for the optimal decision-maker for the action go in state (t, x) are:
$$\begin{array}{@{}rcl@{}} p_{(t,x) \rightarrow C}^{go} &=& max \left\{\mathbb{P}(U \in \mathcal{U}_{+} | X_{t}\,=\,x), \mathbb{P}(U \in \mathcal{U}_{-} | X_{t}\,=\,x)\right\}\\ p_{(t,x) \rightarrow I}^{go} &=& 1 - p_{(t,x) \rightarrow C}^{go}. \end{array} $$
(7)
Assuming that the prior probability for each state of the world is the same,Footnote 2 i.e., \(\mathbb {P}(U \in \mathcal {U}_{+}) = \mathbb {P}(U \in \mathcal {U}_{-})\), the posterior probabilities satisfy \(\mathbb {P}(U \in \mathcal {U}_{+} | X_{t}=x) > \mathbb {P}(U \in \mathcal {U}_{-} | X_{t}=x)\) if and only if the likelihoods satisfy \(\mathbb {P}(X_{t}=x | U \in \mathcal {U}_{+}) > \mathbb {P}(X_{t}=x | U \in \mathcal {U}_{-})\). In turn, this inequality in the likelihoods holds if and only if x > 0. Thus, in this situation of equal prior probabilities, the optimal decision-maker will select buy if x > 0 and sell if x < 0 so that the transition probability \(p_{(t,x) \rightarrow C}^{go}\) is equal to \(\mathbb {P}(U \in \mathcal {U}_{+} | X_{t}=x)\) when x > 0 and \(\mathbb {P}(U \in \mathcal {U}_{-} | X_{t}=x)\) when x < 0.
Note that when 𝜖
d
= 0, a situation which we study below, the sets \(\mathcal {U}_{+}\) and \(\mathcal {U}_{-}\) intersect, with \(\frac {1}{2}\) being a member of both. This corresponds to the difficult trials having an up-probability of \(\frac 12\) for the true state of the world being either rising and falling. Therefore, in the calculations above, we need to replace \(\mathbb {P}(U \in \mathcal {U}_{+} | X_{t}=x)\) in the calculation of the transition probability \(p_{(t,x) \rightarrow C}^{go}\) with \(\mathbb {P}(U = \frac {1}{2}+\epsilon _{e} | X_{t}=x) + \frac {1}{2} \mathbb {P}(U=\frac {1}{2} | X_{t}=x)\) and \(\mathbb {P}(U \in \mathcal {U}_{-} | X_{t}=x)\) with \(\mathbb {P}(U = \frac {1}{2}-\epsilon _{e} | X_{t}=x) + \frac {1}{2} \mathbb {P}(U=\frac {1}{2} | X_{t}=x)\).
Finding optimal actions
In order to find the optimal policy, a dynamic programming procedure called policy iteration is used. The remainder of this section provides a sketch of this standard procedure as applied to the model we have constructed. For a more detailed account, the reader is directed towards standard texts on stochastic dynamic programming such as Howard (1960), Ross (1983) and Puterman (2005). The technique searches for the optimal policy amongst the set of all policies by iteratively computing the expected returns for all states for a given policy (step 1) and then improving the policy based on these expected returns (step 2).
Step 1: Compute values of states for given π
To begin, assume that we have a current policy, π, which maps states to actions, and which may not be the optimal policy. Observe that fixing the policy reduces the Markov decision process to a Markov chain. If this Markov chain is allowed to run for a long period of time, it will return an average reward ρ
π per unit time, independently of the initial stateFootnote 3 (Howard, 1960; Ross, 1983). However, the short-run expected earnings of the system will depend on the current state, so that each state, (t, x), can be associated with a relative value, \(v_{(t,x)}^{\pi }\), that quantifies the relative advantage of being in state (t, x) under policy π.
Following the standard results of Howard (1960), the relative value of state \(v_{(t,x)}^{\pi }\) is the expected value over successor states of the following three components: (i) the instantaneous reward in making the transition, (ii) the relative value of the successor state and (iii) a penalty term equal to the length of delay to make the transition multiplied by the average reward per unit time. From a state (t, x), under action wait, the possible successor states are (t + 1, x + 1) and (t + 1, x − 1) with transition probabilities given by Eq. 6; under action go, the possible successor states are C and I with transition probabilities given by Eq. 7; the delay for all of these transitions is one time step, and no instantaneous reward is received. Both C and I transition directly to (0,0), with reward R
C
or R
I
, and delay D
C
or D
I
respectively. The general dynamic programming equations reduce to the following
$$\begin{array}{@{}rcl@{}} v_{(t,x)}^{\pi} &=& \left\{\begin{array}{ll} p_{(t,x)\to(t+1,x+1)}^{wait} v_{(t+1,x+1)}^{\pi} + p_{(t,x)\to(t+1,x-1)}^{wait} v_{(t+1,x-1)}^{\pi} - \rho^{\pi} &\quad \text{if }\; \pi(t,x)=wait\\ p_{(t,x)\to C}^{go} v_{C}^{\pi} + p_{(t,x)\to I}^{go} v_{I}^{\pi} - \rho^{\pi} &\quad \text{if }\; \pi(t,x)=go \end{array}\right. \\ v_{C}^{\pi} &=& R_{C}+v_{(0,0)}^{\pi} - D_{C}\rho^{\pi}\\ v_{I}^{\pi} &=& R_{I}+v_{(0,0)}^{\pi} - D_{I}\rho^{\pi} \end{array} $$
(8)
The unknowns of the system are the relative values \(v_{(t,x)}^{\pi }\), \(v_{C}^{\pi }\) and \(v_{I}^{\pi }\), and the average reward per unit time ρ
π. The system is underconstrained, with one more unknown (ρ
π) than equations. Note also that adding a constant term to all \(v^{\pi }_{\cdot }\) terms will produce an alternative solution to the equations. So we identify the solutions by fixing \(v_{(0,0)}^{\pi }=0\) and interpreting all other \(v^{\pi }_{\cdot }\) terms as being values relative to state (0,0).
Step 2: Improve \(\pi \rightarrow \pi ^{new}\)
So far, we have assumed that the policy, π, is arbitrarily chosen. In the second step, we use the relative values of states, determined using Eq. 8, to improve this policy. This improvement can be performed by applying the principle of optimality (Bellman, 1957): in any given state on an optimal trajectory, the optimal action can be selected by finding the action that maximizes the expected return and assuming that an optimal policy will be followed from there on.
When updating the policy, the decision-maker thus selects an action for a state which maximizes the expectation of the immediate reward plus the relative value of the successor state penalized by the opportunity cost, with successor state values and opportunity cost calculated under the incumbent policy π. In our model, actions need only be selected in states (t, x), and we compare the two possible evaluations for \(v_{(t,x)}^{\pi }\) in Eq. 8. Therefore the decision-maker sets π
new(t, x) = wait if
$$\begin{array}{@{}rcl@{}} p_{(t,x)\to(t+1,x+1)}^{wait} v_{(t+1,x+1)}^{\pi} &+& p_{(t,x)\to(t+1,x-1)}^{wait} v_{(t+1,x-1)}^{\pi}\\ &>& p_{(t,x)\to C}^{go} v_{C}^{\pi} + p_{(t,x)\to I}^{go} v_{I}^{\pi} \end{array} $$
(9)
and selects go otherwise. Note also that, by Eq. 8 and the identification \(v^{\pi }_{(0,0)}=0\), the relative values of the correct and incorrect states satisfy \(v_{C}^{\pi } = R_{C}-D_{C}\rho ^{\pi }\) and \(v_{I}^{\pi } = R_{I}-D_{I}\rho ^{\pi }\). We therefore see the trade-off between choosing to wait, receiving no immediate reward and simply transitioning to a further potentially more profitable state, and choosing go, in which there is a probability of receiving a good reward but a delay will be incurred. It will only be sensible to choose go if \(p_{(t,x)\to C}^{go}\) is sufficiently high, in comparison to the average reward ρ
π calculated under the current policy π. Intuitively, since ρ
π is the average reward per time step, deciding to go and incur the delays requires that the expected return from doing so outweighs the expected opportunity cost \(\bar {D}\rho ^{\pi }\) (where \(\bar {D}\) is a suitably weighted average of D
C
and D
I
). The new policy can be shown to have a better average reward \(\rho ^{\pi ^{new}}\) than ρ
π (Howard, 1960; Puterman, 2005).
This policy iteration procedure can be initialized with an arbitrary policy and iterates over steps 1 and 2 to improve the policy. The procedure stops when the policy π
new is unchanged from π, which occurs after a finite number of iterations, and when it does so it has converged on an optimal policy, π
∗. This optimal policy determines the action in each state that maximizes the long-run expected average reward per unit time.
For computing the optimal policies shown in this article, we initialized the policy to one that maps all states to the action go then performed policy iteration until the algorithm converged. The theory above does not put any constraints on the size of the MDP—the decision-maker can continue to wait an arbitrarily large time before taking the action go. However due to computational limitations, we limit the largest value of time in a trial to a fixed value t
max
by forcing the decision-maker to make a transition to the incorrect state at t
max
+ 1; that is, for any x, \(p_{(t_{max},x) \rightarrow I}^{wait} = 1\). In the policies computed below, we set t
max
to a value much larger than the interval of interest (time spent during a trial) and verified that the value of t
max
does not affect the policies in the chosen intervals. The code for computing the optimal policies as well as the state-transition probabilities is contained in a Toolbox available on the Open Science Framework (https://osf.io/gmjck/).