Hierarchical Reinforcement Learning Explains Task Interleaving Behavior

How do people decide how long to continue in a task, when to switch, and to which other task? It is known that task interleaving adapts situationally, showing sensitivity to changes in expected rewards, costs, and task boundaries. However, the mechanisms that underpin the decision to stay in a task versus switch away are not thoroughly understood. Previous work has explained task interleaving by greedy heuristics and a policy that maximizes the marginal rate of return. However, it is unclear how such a strategy would allow for adaptation to environments that offer multiple tasks with complex switch costs and delayed rewards. Here, we develop a hierarchical model of supervisory control driven by reinforcement learning (RL). The core assumption is that the supervisory level learns to switch using task-specific approximate utility estimates, which are computed on the lower level. We show that a hierarchically optimal value function decomposition can be learned from experience, even in conditions with multiple tasks and arbitrary and uncertain reward and cost structures. The model also reproduces well-known key phenomena of task interleaving, such as the sensitivity to costs of resumption and immediate as well as delayed in-task rewards. In a demanding task interleaving study with 211 human participants and realistic tasks (reading, mathematics, question-answering, recognition), the model yielded better predictions of individual-level data than a flat (non-hierarchical) RL model and an omniscient-myopic baseline. Corroborating emerging evidence from cognitive neuroscience, our results suggest hierarchical RL as a plausible model of supervisory control in task interleaving.


Introduction
How long will you keep reading this paper before you return to email? Knowing when to persist and when to do something else is a hallmark of cognitive functioning and is intensely studied in the cognitive sciences (Altmann and Trafton 2002;Brumby et al. 2009;Duggan et al. 2013;Janssen and Brumby 2010;Jersild 1927;Monsell 2003;Norman and Shallice 1986;Oberauer and Lewandowsky 2011;Payne et al. 2007;Wickens and McCarley 2008). In the corresponding decision problem, the task interleaving Christoph Gebhardt cgebhard@ethz.ch 1 Eidgenossische Technische Hochschule Zurich, Stampfenbachstrasse 48, 8092 Zürich, Switzerland Fig. 1 Example of the task interleaving problem with two tasks: Given a limited time window and N tasks with reward/cost structures, an agent has to decide what to focus on at any given time such that the totally attained reward gets maximized. Attending a task progresses its state and collects the associated rewards r T (s), while switching to another task incurs a cost c T (s) It is well-known that human interleaving behavior is adaptive. In particular, the timing of switches shows sensitivity to task engagement (Janssen and Brumby 2015;Wickens and McCarley 2008). Factors that define the engagement of a task are interest (Horrey and Wickens 2006) and priority (Iani and Wickens 2007), commonly modeled as in-task rewards. Task interleaving is also sensitive to interruption costs (Trafton et al. 2003) and to resumption costs (Altmann and Trafton 2002;Gutzwiller et al. 2019;Iqbal and Bailey 2008). These costs represent additional processing demands due to the need to alternate back and forth between different tasks and the resulting additional time it takes to complete them (Jersild 1927;Oberauer and Lewandowsky 2011). This is affected by skill-level (Janssen and Brumby 2015) and memory recall demands (Altmann and Trafton 2007;Oulasvirta and Saariluoma 2006). In addition, task switches tend to be pushed to boundaries between tasks and subtasks because a task can be resumed more rapidly on return when it was left at a good stopping point as switch costs are lower (Altmann and Trafton 2002;Janssen et al. 2012;McFarlane 2002).
Previous models have shed light on possible mechanisms underlying these effects: (i) According to a time-based switching heuristic, the least attended task receives resources, to balance resource-sharing among tasks (Salvucci and Taatgen 2008;Salvucci et al. 2009), or in order to refresh it in memory (Oberauer and Lewandowsky 2011); (ii) According to a foraging-based model, switching maximizes in-task reward (Payne et al. 2007;Duggan et al. 2013), which is tractable for diminishing-returns reward functions using the marginal value theorem; (iii) According to a multi-attribute decision model, task switches are determined based on task attractiveness, defined by importance, interest, and difficulty (Wickens et al. 2015).
While these models have enhanced our understanding, we still have an incomplete picture of how human inter-leaving adapts to multiple tasks and complex reward/cost structures, including delayed rewards. Examples with nondiminishing rewards are easy to construct: in food preparation, the reward is collected only after cooking has finished. In choosing cooking over Netflix, people demonstrate an ability to avoid being dominated by immediately achievable rewards. In addition, we also need to explain people's ability to interleave tasks they have not experienced before. If you have never read this paper, how can you decide to switch away to email or continue to read?
Here we propose hierarchical reinforcement learning (HRL) as a unified account of adaptive supervisory control in task interleaving. While there is extensive work on HRL in machine learning, we propose it here specifically as a model of human supervisory control that keeps track of ongoing tasks and decides which to switch to (Norman and Shallice 1986;Wickens and McCarley 2008). We assume a two-level supervisory control system, where both levels use RL to approximate utility based on experience.
From a machine learning perspective, RL is a method for utility approximation in conditions that are uncertain and where gratifications are delayed (Sutton and Barto 1998). In task interleaving, we use it to model how people estimate the value of continuing in a task and can anticipate a high future reward even if the immediate reward is low. Hierarchical RL extends this by employing temporal abstractions that describe state transitions of variable durations. Hierarchicality has cognitive appeal thanks to its computational tractability. Selecting among higher level actions reduces the number of decisions required to solve a problem (Botvinick 2012). We demonstrate significant decreases in computational demands when compared with a flat agent equal in performance.
Emerging evidence has shed light on the neural implementation of RL and HRL in the human brain. The temporal difference error of RL correlates with dopamine signals that update reward expectation in the striatum and also explains the release of dopamine related to levels of uncertainty in neurobiological systems (Gershman and Uchida 2019). The prefrontal cortex (PFC) is proposed to be organized hierarchically for supervisory control (Botvinick 2012;Frank and Badre 2011) such that dopaminergic signaling contributes to temporal difference learning and PFC representing currently active subroutines. As a consequence, HRL has been applied to explain brain activity during complex tasks (Botvinick et al. 2009;Rasmussen et al. 2017;Balaguer et al. 2016). However, no related work considers hierarchically optimal problem decomposition of cognitive processes in task interleaving. Hierarchical optimality is crucial in the case of task interleaving, since rewards of the alternative tasks influence the decision to continue the attended task.
To test the idea of hierarchical RL, it is necessary to develop computational models that are capable of performing realistic tasks and replicating human data closely by reference to neurobiologically plausible implementation (Kriegeskorte and Douglas 2018). Computational models that generate task performance can expose interactions among cognitive components and thereby subject theories to critical testing against human behavior. If successful, such computational models can, in turn, serve as reference and inspiration for further research on neuroscientific events and artificial intelligence methods. However, such models will unnecessarily involve substantial parametric complexity, which calls for methods from Bayesian inference and large behavioral datasets (Kangasrääsiö et al. 2019).
In this spirit, we present a novel computational implementation of HRL for task interleaving and assess it against a rich set of empirical findings. The defining feature of our implementation is a two-level hierarchical decomposition of the RL problem. (i) On the lower-or task type-level, a state-action value function is kept for each task type (e.g., writing, browsing) and updated with experience of each ongoing task instance (e.g., writing task A, browsing task B, browsing task C). (ii) On the higher-or task instancelevel, a reference is kept to each ongoing task instance. HRL decides the next task based on value estimates provided from the lower level. This type-instance distinction permits taking decisions without previously experiencing the particular task instance. By modeling task type-level decisions with a semi-Markov decision process (SMDP), we model how people decide to switch at decision points rather than at a fixed sampling interval. In addition, the HRL model allows learning arbitrarily shaped reward and cost functions. For each task, a reward and a cost function is defined over its states (see Fig. 1).
While the optimal policy of hierarchically optimal HRL and flat RL produce the same decision sequence given the same task interleaving problem, differences in how this policy is learned from experience render HRL a cognitively more plausible model than flat RL. Flat RL learns expected rewards for each task type to task type transition and, hence, needs to observe a particular transition in training to be able to make a rational decision at test time. In contrast, through hierarchical decomposition, our HRL agent does not consider the task from which a switch originates, but learns expected rewards for transitions by only considering switch destination. This enables the HRL agent to make rational decisions for task type to task type switches that have not been observed in training when task types themselves are familiar. We hypothesize that this better matches with human learning of task interleaving.
Modeling task interleaving with RL assumes that humans learn by trial and error when to switch between tasks to maximize the attained reward while minimizing switching costs. For the example in Fig. 1, this means that they have learned by conducting several writing tasks that the majority of its reward is attained at its end (states 8 and 9). In addition, they experienced that a natural break point is reached when finishing a paragraph (states 6 and 7) and that one can switch after that without any induced costs. Similarly, they learned that switching to and from a browsing task is generally not very costly due to its simplicity. However, also its reward quickly diminishes as no interesting new information can be attained. The acquired experiences are encoded in memory, which provides humans with an intuition on which task to attend in unseen similar situations. RL also considers the human capability to make decisions based on future rewards that cannot be attained immediately. For instance, it can depict the behavior of a human that finishes the writing task to attain the high reward at its end while inhibiting to switch to the browsing task that would provide a low immediate gratification.
In the rest of the paper, we briefly review the formalism of hierarchical reinforcement learning before presenting our model and its implementation. We then report evidence from simulations and empirical data. The model reproduces known patterns of adaptive interleaving and predicts individual-level behavior measured in a challenging and realistic interleaving study with six tasks (N = 211). The HRL model was better or equal than RL-and omniscientmyopic baseline models, which does not consider longterm rewards. HRL also showed more human-like patterns, such as sensitivity to subtask boundaries and delayed gratification. We conclude that human interleaving behavior appears better described by hierarchically decomposed optimal planning under uncertainty than by heuristics, or myopic, or flat RL strategies.

Markov and Semi-Markov Decision Processes
The family of Markov decision processes (MDP) is a mathematical framework for decision-making in stochastic domains (Kaelbling et al. 1998). The MDP is a four-tuple (S, A, P , R), where S is a set of states, A a set of actions, P the state transition probability for going from a state s to state s after performing action a (i.e., P (s |s, a)), and R the reward for action a in state s (i.e., R : S × A → R). The expected discounted reward for action a in s when following policy π is known as the Q value: , where γ is a discount factor. Q values are related via the Bellman equation: Q π (s, a) = s P (s |s, a)[R(s , s, a) + γ Q π (s , π(s ))]. The optimal policy can then be computed as π * = arg max a Q π (s, a). Classic MDPs assume a uniform discrete step size. To model temporally extended actions, semi-Markov decision processes (SMDPs) are used. SMDPs represent snapshots of a system at decision points where the time between transitions can be of variable temporal length. An SMDP is a five-tuple (S, A, P , R, F ), where S, A, P , and R describe an MDP and F gives the probability of transition times for each state-action pair. Its Bellman equation is: where t is the number of time units after the agent chooses action a in state s and F (t|s, a) is the probability that the next decision epoch occurs within t time units.

Reinforcement Learning
Reinforcement learning solves Markov decision processes by learning a state-action value function Q(s, a) that approximates the Q value of the Bellman equation Q π (s, a).
There are two classes of algorithms for RL: model-based and model-free algorithms. In model-based algorithms, the state transition probabilities F (t|s, a) and P (s |s, a) are known and policies are found by enumerating the possible sequences of states that are expected to follow a starting state and action while summing the expected rewards along these sequences. In this paper, we use modelfree RL algorithms to solve an MDP. These algorithms learn the approximate state-action value function Q (s, a) in an environment where the state transition probability functions F (t|s, a) and P (s |s, a) are unknown but can be sampled from it. One model-free algorithm that learns the approximate state-action value function via temporal difference learning is Q-learning: , where s t , s t+1 , a t and R t+1 are sampled from the environment.

Hierarchical Reinforcement Learning
Hierarchical RL (HRL) is based on the observation that a variable can be irrelevant to the optimal decision in a state even if it affects the value of that state (Dietterich 1998). The goal is to decompose a decision problem into subroutines, encapsulating the internal decisions such that they are independent of all external variables other than those passed as arguments to the subroutine. There are two types of optimality of policies learned by HRL algorithms. A policy which is optimal with respect to the non-decomposed problem is called hierarchically optimal (Andre and Russell 2002;Ghavamzadeh and Mahadevan 2002). A policy optimized within its subroutine, ignoring the calling context, is called recursively optimal (Dietterich 1998).

Task Model
We model tasks via the reward r T (s) and cost c T (s) functions defined over discrete states s (see Fig. 2). The reward represents subjective attractiveness of a state in a task (Norman and Shallice 1986;Wickens and McCarley 2008). The cost represents overheads caused by a switch to a task (Jersild 1927;Oberauer and Lewandowsky 2011;Oulasvirta and Saariluoma 2006). A state is a discrete representation of progress within a task and the progress is specific to a task type. For instance, in our reading task model, progress is approximated by the position of the scroll bar in a text box. Reward and cost functions can be arbitrarily shaped. This affords flexibility to model tasks with high interest (Horrey and Wickens 2006;Iani and Wickens 2007), tasks with substructures (Bailey and Konstan 2006;Monk et al. 2004), as well as complex interruption and resumption costs (Trafton et al. 2003;Rubinstein et al. 2001).

Fig. 2
An exemplary task model for paper writing. The model specifies in-task rewards with function r T and resumption/interruption costs with function c T . Both are specified over a discrete state s that defines progress in a task

Hierarchical Decomposition of Task Environments
Literature sees task interleaving to be decided by a human supervisory control mechanism that keeps track of ongoing tasks and decides which to switch to (Norman and Shallice 1986;Wickens and McCarley 2008). We propose to model this mechanism with hierarchical reinforcement learning, assuming a two-level supervisory control system. Intuitively, this means that we assume humans to make two separate decisions at each task switch: first, when to leave the current task and second, which task to attend next. These decisions are learned with two separate memory structures (i.e., state-action value functions) and updated with experience. The lower level learns to decide whether to continue or leave the current task. Thus, it keeps a state-action value function for each task type (e.g., writing, instances, e.g., T ask 11 (s), which in turn call the subroutine of their respective type, e.g., T askT ype 1 (s). A subroutine can either continue Continue(s) or leave Leave(s) a task browsing) and updates it with experience of each ongoing task instance (e.g., writing task A, browsing task B, browsing task C). The higher level learns to decided which task to attend next based on the learned reward expectations of the lower level. In contrast, in flat RL, task interleaving is learned with one memory structure and every task switch is a single decision to attend a next task. We explain the difference between flat and hierarchical RL in more detail in "Comparison with Flat RL." Figure 3 shows the hierarchical decomposition of the problem. We decompose the task interleaving decision problem into several partial programs that represent the different tasks. Each task is modeled as a behavioral subroutine that makes decision independent from all other tasks only considering the variables passed to it as arguments (see "Hierarchical Reinforcement Learning" for background). Rectangles represent composite actions that can be performed to call a subroutine or a primitive action. Each subroutine (triangle) is a separate SMDP. Primitive actions (ovals) are the only actions that directly interact with the task environment. The problem is decomposed by defining a subroutine for each task type: T askT ype 1 (s) to T askT ype N (s). A subroutine estimates the expected cumulative reward of pursuing a task from a starting state s, until the state it expectedly leaves the task. At a given state s, it can choose from the actions of either continuing Continue(s) or leaving Leaving(s) the task. These actions then call the respective action primitives: continue, leave. The higher level routine Root, selects among all available task instances, T ask 11 (s) to T ask NN (s), the one which returns the highest expected reward. When a task instance is selected, it calls its respective task type subroutine passing its in-task state s (e.g., T ask 11 (s) calls T askT ype 1 (s)).

Reward Functions
We define two reward functions. On the task type level, the reward for proceeding with a subroutine from its current state s with action a is: where c T (s) and r T (s) are the respective cost and reward functions of the task. This covers cases in which the agent gains a reward by pursuing a task (r T (s), a = continue).
It also captures human sensitivity to interruption costs (Trafton et al. 2003) and future resumption costs (Altmann and Trafton 2002;McFarlane 2002), when deciding to terminate task execution (−c T (s), a = leave). Finally, it models the effect of decreasing reward as well as increasing effort both increasing the probability of leaving a task (Gutzwiller et al. 2019). On the task instance level, we penalize state changes to model reluctance to continue tasks that require excessive effort to recall relevant knowledge (Altmann and Trafton 2007;Oulasvirta and Saariluoma 2006). The respective reward function is where s is the state on the root level, z(s) maps s to the state of its child's SMDP, and c T (s) is again the cost function of the task.

Hierarchical Optimality
Modeling task interleaving with hierarchical reinforcement learning (HRL) raises the question if policies of this problem should be recursively (Dietterich 1998) or hierarchically optimal (Andre and Russell 2002;Ghavamzadeh and Mahadevan 2002). In our setting, recursive optimality would mean that humans decide to continue or leave the currently pursued task only by considering its rewards and costs. However, rewards of the alternative tasks influence a human's decision to continue the attended task. This is captured with hierarchically optimal HRL, which can be implemented using the three-part value function decomposition proposed in Andre and Russell (2002): Q π (s, a) = Q π r (s, a) + Q π c (s, a) + Q π e (s, a) where Q π r (s, a) expresses the expected discounted reward for executing the current action, Q π c (s, a) completing the rest of a subroutine, and Q π e (s, a) for all the reward external to this subroutine. Applied on the lower level of our task interleaving hierarchy, it changes the Bellman equation of task type subroutines as follows: where s, a, s , P type , F type , π type , and γ type are the respective functions or parameters of a task typelevel semi-Markov decision process (SMDP). π root is the optimal policy on root level, p(s) maps from a state s to the corresponding state in its parent's SMDP, and Q π root is the Bellman equation on the root level of our HRL model. SS(s , t) and EX(s , t) are functions that return the subset of next states s and transition times t that are states respectively exit states defined by the environment of the subroutine. The three Q-functions in Eq. 3 specify the respective parts of the three-part value function decomposition of Andre and Russell (2002). The full Bellman equation of a task type subroutine is then defined as Q π type (s, a) = Q π type,r (s, a)+Q π type,c (s, a)+Q π type,e (s, a) .
On root level, the decomposed Bellman equation is specified as where s, a, P root , F root , π root , and γ root are the respective functions or parameters of the root-level SMDP. z(s) is the mapping function from root level state to the state of its child's SMDP. Again Q π root,r and Q π root,c are the respective parts of the three-part value function decomposition. Note that there is no Q-function to specify the expected external reward as root is not called by another routine. Following Andre and Russell (2002), Q π root,r (s, a) is rewarded according to the expected reward values of its subroutine Q π type (z(s), π type (z(s))). In addition, to model reluctance to continue tasks that require excess effort to recall relevant knowledge, it is penalized according to R root (s). The full Bellman equation of the root routine is defined as

Decision Processes
We design the decision processes of our HRL agent to model human supervisory control in task interleaving Fig. 4 Transition graph of the two SMDPs of our HRL model. On root level, S is the supervisory control state and t i are actions representing available tasks. Once a task is selected its task type-level SMDP is called. s i are discrete states representing progress, c i is the continue action, l i is the leave action, and e the exit state handing control to root level (Norman and Shallice 1986;Wickens and McCarley 2008). Our hypothesis is that humans do not learn expected resumption costs for task type to task type transitions. Instead, they learn the resumption costs of each task type separately and compute the expected costs of a switch by adding the respective terms of the two tasks. In parts, this behavior is modeled through the hierarchical decomposition of the task environment, allowing us to learn the cost expectations of leaving the current task and continuing the next task on separate levels. However, it is also necessary to model the SMDPs of the HRL agent accordingly. Figure 4 shows the transition graph of our model. We define a single supervisory state S for the higher level decision process. In this state, our agent chooses among the available tasks by selecting the respective action t i . This calls the lower level SMDP, where the agent can decide to continue c i a task and proceed to its next state s i , or leave it l i and proceed to the exit state e. Once the exit state is reached, control is handed to the root level and the agent is again situated in S. To avoid reward interactions between state-action pairs on the higher level, we set γ root to zero. While the higher level resembles a multi-armed bandit, HRL allows modeling task interleaving in a coherent and cognitively plausible model.

Modeling State Transition Times
In our HRL model, we assume that primitive actions of the lower, task type level of the hierarchy follow an SMDP rather than an MDP. This models human behavior, as we do not make decisions at a fixed sampling rate, but rather decide at certain decision points whether to continue or leave the attended task. To model non-continuous decision rates, an SMDP accounts for actions with varying temporal length by retrieving state transition times from a probability function F (t|s, a) (see Eq. 1). Transition times are used to discount the reward of actions relative to their temporal length.
To be able to solve the task type-level SMDP with model-free RL, our environment also needs to account for actions with varying temporal length. This is done by sampling a transition time t uniformly at random for each taken action from an unconditioned probability distribution E T P (t), defined for each task type T and participant P . These distributions are computed per participant by saving the transition time of all logged state transitions of a task type in all trials (excluding the test trial). Thus, we log participants' actions every 100 ms. This rate is high enough to ensure that their task switches are not missed and, hence, the correct transition times are used in the model (cf. the shortest time (700 ms) on high workload tasks (Raby and Wickens 1994)).

Simulations
We report simulation results showing how the model adapts to changing cost/reward structures. To this end, the twotask interleaving problem of Fig. 1 is considered. The writing task T w awards a high reward when completed. Switching away is costly, except upon completing a chapter. The browsing task T b , by contrast, offers a constant small reward and switch costs are low. In the simulations, we trained the agent for 250 episodes 1 , which was sufficient for saturation of expected reward. The HRL agent was trained using the discounted reward HO-MAXQ algorithm (Andre and Russell 2002) (see Appendix 1 for details). In the simulations, the HRL agent was forced to start with the writing task.
Cost and Task Boundaries: In Fig. 5c, the agent only switches to browsing after reaching a subtask boundary in writing, accurately modeling sensitivity to costs of resumption (Altmann and Trafton 2002;Gutzwiller et al. 2019;Iqbal and Bailey 2008).

Fig. 5
Interleaving sequences (a-d) generated by our hierarchical reinforcement learner on the task interleaving problem specified in Fig. 1 for different values of the discount factor γ type . Discount factors specify the length of the RL reward horizon Reward Structure: The HRL agent is sensitive to rewards (Horrey and Wickens 2006;Iani and Wickens 2007;Norman and Shallice 1986;Wickens and McCarley 2008), as shown by comparison of interleaving trajectories produced with different values of γ type in Fig. 5. For example, when γ type = 0, only immediate rewards are considered in RL, and the agent immediately switches to browsing.

Level of Supervisory Control:
The discount factor γ type approximates the level of executive control of individuals. Figure 5d illustrates the effect of high executive control: writing is performed uninterruptedly while inhibiting switches to tasks with higher immediate but lower long-term gains.

Comparison with Human Data
Novel experimental data was collected to assess (i) how well the model generalizes to an unseen task environment and (ii) if it can account for individual differences. The study was conducted on an online crowd-sourcing platform. Participants' data was only used if they completed a minimum of 6 trials, switched tasks within trials, and did not exceed or subceeded reasonable thresholds in trial times and attained rewards. Participants practiced each task type separately prior to entering interleaving trials. Six task instances were made available on a browser view. The reward structure of each task was explained, and users had to decide how to maximize points within a limited total time. Again, the agent was trained for 250 episodes using the discounted reward HO-MAXQ algorithm (Andre and Russell 2002).

Experimental Environment
The trials of the experiment were conducted on a web page presenting a task interleaving problem (see Fig. 6). Each interleaving trial consisted of six task instances of four different task types. The four different task types were math, visual matching, reading, and typing. Each task type consisted of different subtasks. All task instances were shown as buttons in a menu on the left side of the UI. Task instances were color coded according to their respective task type. The attended task was shown in a panel to the right of the task instances menu. Participants were informed about the score they attained in the current trial with a label on the top right. The label also showed the attained reward of the last subtask in brackets. For all task types, participants were allowed to leave a task at any point and were able to continue it at the position they have left it earlier. However, it was not possible to re-visit previously completed subtasks. Tasks for which all subtasks were completed could not be selected anymore. No new tasks were introduced into the experimental environment after a trial started.

Tasks and Task Models
In this section, we explain the tasks of our experiment and how we designed the respective task models. In-task rewards were designed to be realistic and clear. Participants were told about the reward structures of tasks and how reward correlates to monetary reward (shown in table). The explanation of reward structures was held simple for all task types (e.g., "you receive 1 point for each correctly solved equation/answered question."). Feedback on attained rewards was provided (see score label in Fig. 6). A mapping was created between what is shown on the display and task state s. Figure 7a illustrates this for the reading task where text paragraphs are mapped to the state of the reading model. Task models were used in the RL environment of the HRL agent. All tasks could be left in any state.

Reading:
Reading tasks featured a text box on top, displaying the text of an avalanche bulletin, and two multiple-choice questions to measure text comprehension displayed in a panel below (see Fig. 6). The progress of participants within a reading task was tracked with the text's scroll bar. After finish reading a paragraph, participants had to click the "Next paragraph" button to advance. The button was only enabled when the scroll bar of the current paragraph reached its end. Per correctly answered question participants attained ten points of reward.
Reading Model: An example of a reading task model is presented in Fig. 7b. Each state represents several lines of text of the avalanche bulletin. The bumps in the cost function c T r (s) match with the end of paragraphs. The two states of the reward function r T r (s) that provide ten points of Fig. 7 Task models of the four tasks used in the experiment: a Example of how task state is assigned to visible state on display: passages of text in the reading task are assigned to the discrete states of its task model (column of numbers) over which reward (green) and cost function (red) are specified. The row highlighted yellow provides the answer to a comprehension query at the end. Exemplary task models for b reading, c visual matching, d math, and e typing tasks reward match the respective lines of the avalanche bulletin which provide an answer to one of the multiple-choice questions.
Visual Matching: Visual matching tasks featured a scrollable list of images (see Fig. 8). From these images participants had to identify those that display airplanes. This was done by clicking on the respective image. Per correct click, participants attained one point of reward. A visual matching task contained six of these lists and participants could proceed to the next one by clicking the "Next subtask" button. Again, this button was only enabled when the scroll bar reached its end. Progress was tracked using the scroll bar.
Visual Matching Model: An example of a visual matching task model is presented in Fig. 7c. Each state represents several images. The bumps in the cost function c T v (s) depict the end of an image list. The number of points that is returned by the reward function r T v (s) for a specific state s depends on the number of images in that state that are airplanes (1 point per plane).

Math:
In math tasks, equations were displayed in a scrollable list (see Fig. 9). Thereby, one number or operator was shown at a time and the next one only was revealed when scrolling down. Participants received one point of reward for each correctly solved equation. A math task contained six equations. Participants could proceed by clicking the "Next equation" button. The button was only enabled when the scroll bar of the current equation reached its end. Progress was logged via the position of the scroll bar.
Math Model: An example of a math task model is presented in Fig. 7d. Each state represents several numbers or operators. The states at which the cost function c T m (s) returns zero represent the end of one of the six equations of a math task. Between ends the returned penalty of c T m increases linearly with the number of operators and numbers in equations. The reward function r T r (s) returns one point of reward in the last state of each equation.
Typing: Typing tasks featured a sentence to copy at the top and a text box to type in below (see Fig. 10). Using HTMLfunctionality, we prevented participants to copy-paste the sentence into the text box. Participants received one point of reward for each correctly copied sentence. In a typing task, participants had to copy six sentences. Progress was tracked via the edit distance (Levenshtein 1966) between the text written by participants and the sentence to copy. They could proceed by clicking the "Next sentence" button that was enabled when the edit distance of the current sentence was zero.
Typing Model: An example of a typing task model is presented in Fig. 7e. Each state represents a discrete fraction of the maximal edit distance (Levenshtein 1966) of a  sentence to copy (capped at 1.0). The bumps in the cost function c T t (s) match with the end of sentences. The reward function r T t (s) provides a reward in the last state of each sentence.

Procedure
After instructions, informed consent, and task type specific practice, the participants were asked to solve a minimum of two-task interleaving trials but were allowed to solve up to five trials to attain more reward. Every trial contained six task instances, each sampled from a distribution of its general type. Trial durations were sampled from a random distribution unknown to the participant. The distribution was constrained to lie between 4 and 5 min. This limit was chosen empirically to ensure that participants cannot complete all task instances of a trial and are forced to interleave them to maximize reward. The stated goal was to maximize total points linked to monetary rewards. No   Fig. 10 Typing task with text to copy (top) and text box to type in (bottom) task instance was presented more than once to a participant. The average task completion time was 39 min. The average number of completed task interleaving trials was 3.
Participants 218 participants completed the study. Ten were recruited from our institutions, and the rest from Amazon Mechanical Turk. Monetary fees were designed to meet and surpass the US minimum wage requirements. A fee of 5 USD was awarded to all participants who completed the trial, and an extra of 3 USD as a linear function of points attained in the interleaving trials. We excluded 7 participants who did not exhibit any task interleaving behavior or exceeded respectively subceeded thresholds in attained rewards or trial times.

Model Fitting
Empirical Parameters: Given the same set of tasks, humans choose different interleaving strategies to accomplish them. This can be attributed to personal characteristics like varying levels of executive control or a different perception of the resumption costs of a particular task (Janssen and Brumby 2015). In our method, we model individual differences with a set of personal parameters. More specifically, we introduce parameters that can scale the cost function of each task type and a parameter to model a constant cost that is paid for every switch. In this way, each cost function can be adjusted to model the perceived costs of an individual person. The personal cost function c P T of a task T is defined as c P T (s) = c P + s P T c T (s) where 0.0 < c P < 0.3 is a constant switch cost paid for each switch and 0.0 < s P T < 1.0 is a scaler of the task type's general cost function c T (s). In addition, we also fit γ type , the discount factor of the task type hierarchy of our model to data (0.0 < γ type < 1.0). γ type is used to model various degrees of executive control.

Inverse Modeling Method:
To fit these to an individual's data, we used approximate Bayesian computation (ABC) (Kangasrääsiö et al. 2017;Lintusaari et al. 2018). ABC is a sample-efficient and robust likelihood-free method for fitting simulator models to data. It yields a posterior distribution for the likelihood of parameter values given data. An aggregate index of interleaving similarity is the to-be-minimized discrepancy function: where S s is the set of states in which participants switched tasks, A s is the set of chosen actions (tasks), S l are the states in which participants left a task, and A l is the set of leave actions. N s , N l are the number of task switches respectively leave actions. π root is the root-level policy of the HRL agent, and π type is its type-level policy. Note that this accuracy metric collapses the next task and leaving a task accuracies reported in the paper.
Fitting Procedure: We held out the last trial of a participant for testing and used the preceding interleaving trials for parameter fitting. We run the above fitting method to this data for 60 iterations. In each, we trained the HRL agent ten times using the same set of parameters in a task interleaving environment matching that of the participant in question. For the Gaussian Process proxy model in ABC, we used a Matern-kernel parameterized for twice-differentiable functions. On a commodity desktop machine (Intel Core i7 4 GHz CPU), learning a policy took on average 10.3 sec (SD 4.0), and fitting for full participant data took 103.8 min (SD 28.2). The reported results come from the policy with lowest discrepancy to data obtained in 15 repetitions of this procedure with different weights (best: w = 100).

Baseline Models
To analyze the capability of our HRL model in terms of reproducing human task interleaving, we compared it against several variants of two other models: a flat RL agent and an omniscient-myopic agent. In total, our experiment had the following ten baseline models: where s, a, s , t, P f lat , F f lat , π f lat , and γ f lat are the respective functions or parameters of the flat RL agent.

SA(s , t) is a function that returns the subset of available next states s and transition times t of other tasks in the environment. R f lat (s, a, s ) is the reward function of the flat RL agent and is defined as
where task is a function that returns the task of the respective state, c T (s) represents interruption and future resumption costs of state s, and c T (s ) the resumption costs of the next state s (see "Reward Functions" for more details). Using Q-learning, we trained the flat RL agent for 250 episodes, which was sufficient for the expected reward to saturate. For parameter fitting of the RL model, we define the following discrepancy function:

RL-Up is the upper bound of our RL model. It is trained
like HRL-Up. 6. RL-Myopic is the myopic version of the RL model (with γ f lat = 0). 7. Om.-Myopic is an omniscient-myopic policy that chooses the task T that provides the highest reward in its next state s : where s T is the next state of task T , s is the current state of the ongoing task, and c T is the respective task's cost function. To compare against a strong model, it decides based on the true rewards and costs of the next states. By contrast, HRL decides based on learned estimates. Myopic models only consider the reward attainable in the next state in their task switching decisions and, hence, tend to switch to tasks with immediate gratification. Intuitively, these models would switch to the browsing task in the example of Fig. 1 as soon as there is no higher immediate reward available in writing (states 0-2 and 5-7). All omniscient models posses a myopic reward horizon. However, rather than deciding on estimated expected rewards, they know the actual reward (and/or cost) in the next state and decide based on it. RL-and HRL-Up can be considered omniscient models as they are trained on the task environment of the test trial. In contrast to the other myopic models, they consider timediscounted future rewards when deciding for which task to attend. Considering the example of Fig. 1, they could exert behavior where task switches to the browsing task are inhibited to attain the large delayed reward of writing (states 8-9). In general, RL and HRL models differ in that HRL conducts two decisions to switch between tasks (decision 1: leave the current task; decision 2: task to attend to next) while in RL a task switch is a single decision (see "Hierarchical Decomposition of Task Environments").
All HRL and flat RL models were fitted to the data of individual participants using the model fitting procedure described in "Model Fitting." We did not compare against marginal rate of return (Duggan et al. 2013) or information foraging models (Payne et al. 2007) as in-task states can have zero reward. Both models would switch task in this case, rendering them weaker baselines than Om.-Myopic. The multi-criteria model of Wickens et al. (2015) does not adapt to received task rewards and offers no implementation. Models of concurrent multitasking (i.e., Oberauer and Lewandowsky 2011;Salvucci and Taatgen 2008) are not designed for sequential task interleaving.

Results
Predictions of HRL were made for the held-out trial and compared with human data. Analyzing base rates for continuing versus leaving a task of the behavioral sample revealed that task continuation dominates events (= 0.95).
For this reason, we analyze the capability of models to predict if participants leave or continue a task separately. As normality assumptions are violated, we use Kruskal-Wallis for significance testing throughout. Pairwise comparisons are conducted using Tukey's post hoc test Empirical Data: Before comparing the performance of the various models, we inspected participant data. The most popular task type was visual matching which was selected by 95% of participants in one of their interleaving trials (see Fig. 11). It was followed by math (78.5%), typing (70.0%), Fig. 11 Fraction of participants that selected task instances of a particular type in an interleaving trial and reading tasks (65.5%). The unconditioned probability distribution E T (t) of logged state transition times per task type over all participants shows that these differ between the task types of our study (see Fig. 12). Participants seem to be faster in transitioning between states in reading and visual matching tasks compared with math and typing tasks. We use E T (t) to approximate F (t|s, a) in Eq. 1 when training our HRL agent (see "Modeling State Transition Times").
Reward: Participants attained a mean reward of 33.18 (SD 11.92) in our study (see Fig. 13  Means and 95% confidence intervals for a attained rewards (significance notation with respect to participants), b accuracy in predicting next task, c accuracy in predicting leaving of a task, d accuracy in predicting continuing of a task, and e error in predicting order of tasks (lower is better). For b-e, significance notation is with respect to HRL 33.1), HRL (M 26.22,SD 30.14), and HRL-Up (M 28.85, SD 35.32). However, these differences were not statistically significant. Random was the worst (M 182.47, SD 124). All models had a significantly smaller error than Random (p < 0.001 for all).

State Visitations:
We computed histograms of state visitation frequencies per task type (see Fig. 14). As visual inspection confirms, HRL-Up (0.95) and HRL (0.93) had a superior histogram intersection with participants than other baseline models. They were followed by RL-Up ( Random (0.81). The step-like patterns in the histograms of Participants were reproduced by HRL and RL models, illustrating that its policies switched at the same subtask boundaries as participants (e.g., see top-row in Fig. 14). However, the histograms of HRL models show a higher overlap with participants' histograms than RL models.

Comparison with Flat RL
To better understand the implications of hierarchicality in the context of task interleaving, we further compared our HRL model with the flat RL implementation. Thus, we learned 100 policies for a ten task, six instance problem and the same simulated user using default values for cost scalers (c P and s P T ) and γ type . Figure 15 shows the learning curves of the two methods. HRL converged faster than flat RL which is in line with prior work (Dietterich 1998;Andre and Russell 2002;Ghavamzadeh and Mahadevan 2002). This is Fig. 14 State visitations: HRL shows better match with state visitation patterns than Myopic and Random. y-axis shows fraction of states visited aggregated over all trials Fig. 15 Learning curves of the flat RL and our HRL agent. Solid line denotes mean reward (y-axis) per episode (x-axis). Shaded area represents standard deviation due to a significant decrease in the number of states (43-fold for this example). It is important to note that the optimal policy of flat RL and HRL for a given problem are the same. This experiment exemplified this, as both perform similarly in terms of attained reward after convergence. Table 1 reports the mean fraction of reproduced actions per participant for each iteration of our model fitting procedure. Fractions are computed using the normalized sum of reproduced actions of Eq. 7. Results on training trials improve with each iteration of the procedure and show that learned parameters generalize to the held-out test trials.

Model Inspection
To further inspect the performance of our HRL agent, we compared interleaving sequences of individual participants with those reproduced by the agent for the particular participant. Figure 16 shows the interleaving sequences produced by the HRL agent which attained (b) the lowest, (c) the closest to average, and (d) the highest error in task order compared with the sequences of the respective participant (according to Eq. A2.1). The interleaving sequence with the lowest error reproduces the order of task types of participants almost exactly. In contrast, the interleaving sequence of the closest to average and highest error task order between participant and agent are interchanged. However, both of these participants exhibit a lot of task switches and conduct particular tasks without attaining points for it. Figure 17 shows the state-action value functions (Q(s, a), see Eq. 1) for the different levels and task types of our HRL agent trained by using the optimal parameters of one participant of our study. On the task type level, the pattern of the state value of the action Continue matches with the reward function of the respective task type (see Fig. 7). The same holds for the action Leave and the cost function of a task type. The state-action value functions on root level of our HRL agent (T ask NN ) show the the expected reward for entering a task type instance at a particular state. These values have converged to approximate the sum of expected rewards of Continue and Leave actions of the task type level.

Discussion
The results of our study provided evidence for hierarchical reinforcement learning (HRL) as a mechanism in task interleaving. Considering all metrics, it reproduced human task interleaving significantly better than all other baselines. In particular, state visitation histograms show that HRL exhibits more human-like behavior in terms of leaving at certain subtask boundaries and avoiding to continue tasks with low gratification.
The omniscient-myopic models proved to be strong baselines as they managed to reproduce human task interleaving behavior on most metrics. Interestingly, the model that only considers costs (Om.-Costs) better reproduced participant behavior than the omniscient models that considered rewards and costs (Om.-Myopic) or only rewards (Om.-Rewards). This indicates that humans may prioritize avoiding cognitive costs when making task switching decisions over gaining rewards. Given our study setting, this intuitively makes sense as costs needed to be certainly paid while reward was not necessarily gained (e.g., when answering questions wrong). This finding is also inline with related work that revealed the human tendency to push task switches to task and subtask bounds (Altmann and Trafton 2002;Janssen et al. 2012;McFarlane 2002). However, all omniscient-myopic models have the tendency to leave a task when participants still continued with it. In contrast, HRL and RL models reproduce this factor of participant behavior significantly better. This highlights the necessity of task interleaving models to consider long-term rewards in tasks to model human executive control.
The importance of considering long-term rewards not only within the task but also to choose the correct next task is indicated in comparing the accuracy of reproduced task switches of HRL models that consider Fig. 16 Examples of reproduced task interleaving sequences of the HRL agent with respect to a particular participant. a Task models of the reproduced trial. Task interleaving sequences of participant and HRL agent with the b lowest, c closest to average, d highest task order error according to Eq. A2.1 discounted future rewards within tasks (HRL, HRL-Up) and HRL-Myopic. The former models choose next tasks that match significantly better with participants' selections than myopic HRL. To understand if our model can generalize task type knowledge to previously not encountered instances, we compare the results of HRL and HRL-up. HRL-Up was learned and tested on the test trial of our study and, hence, has encountered the instances it is tested on during training. No significant differences between HRL and HRL-Up on all metrics show that our model can indeed generalize to unobserved task instances.
All models performed worse when predicting participants' next task than when predicting whether they continue or leave the current one. This can be partly explained by the fact that the first problem is harder, Random was at 0.24, than the latter two binary predictions where Random was at 0.51. In addition, we assume that this discrepancy comes from our reward function not capturing all factors that affect the decision to attend a particular task. Research found that the attributes of task difficulty, task priority, task interest, and task salience influence humans' task decisions (Wickens et al. 2015). Our reward function only models priority defined by the attainable points per task, deciding on participants' monetary reward. While task salience and interest can be assumed to be the same among all our task types, their task difficulty differs significantly. This difference affected participants' choices in our experiment. For example, task instances of type reading gave the most Continue and Leave shows the expected reward per state for the respective action on task type level. T ask NN shows the expected reward for entering a particular task type on root level reward and still were the least popular, possibly due to their difficulty (see Fig. 11). Not modeling task difficulty in the reward function caused the low prediction accuracy in participants' next task decisions. In contrast, it did not affect the prediction of whether to leave or to continue a task as these decisions are primarily made to avoid switching costs. In future work it will be interesting to consider the identified task attributes of Wickens et al. (2015) in the reward functions of task types. As factors such as task difficulty and interest differ from human to human, we will use ABC to fit such a reward function to participant data. Such a setting would allow to adapt to individual differences in the perception of task reward. Furthermore, to model real-world task interleaving problems, it will be interesting to learn entirely unknown reward and cost functions of tasks from human demonstrations (similar to (Krishnan et al. 2016)).
Comparing the RL models with all other baselines shows that they perform best in terms of attained reward. However, they do not match with participant behavior as good as the HRL-and the omniscient-myopic models. Results indicate that RL exhibits super human performance even after fitting its parameters to participant data. As converged policies of HRL and flat RL should perform the same (see "Comparison with Flat RL"), this difference can only be explained by the interplay of the ABC fitting procedure with the RL-respectively HRL model. HRL disentangles the current task from the next task in its model (see "Decision Processes"). Thus, ABC can consider the lower and higher level policy separately in the discrepancy function (see Eq. 7) and, hence, identify differences in perceived costs per task type and participant. In contrast, RL learns expected reward for a task switch over the current and the next task. Thus, it cannot distinguish between costs that are induced by the current task from costs that are induced by the next task in the discrepancy function (see Eq. 10). As a result, ABC cannot successfully identify perceived costs per participant and task type for the flat RL model. Thus, ABC's ability to identify the correct perceived costs of task types depends on the difference of how HRL and flat RL model task switching. Therefore, we conclude that the significantly better match of HRL models with participant behavior than RL models might indicate that humans hierarchically abstract task environments when learning decisions strategies for task interleaving. However, future research is necessary to confirm this hypothesis.
Participants of our study continued tasks at a much higher rate (95%) than reported in literature (60%, (Wickens et al. 2015)). This can be explained by the fact that in previous studies participants were asked whether they want to continue or switch to another task. In contrast, we assume participants to decide to stay within a task each time a state transition according to a task model is logged. The large number of states in these models explains the high task continuation rate of our study. In addition, the tasks of our study were more difficult than the analyzed study tasks of Wickens et al. (e.g., judgment tasks) which has shown to increase task continuation (Gutzwiller 2014).
In this paper, we have assumed that instead of learning policies per task instance, people learn expected rewards and costs per task type. This assumption is cognitively plausible, because it drastically reduces the amount of parameters, and because it allows generalizing past experience to previously unseen instances. How people recognize tasks as instances of types and how these expectations are generalized, is an exciting question for future work.

Conclusion
In this paper, we propose a theoretically justified hierarchical reinforcement learning model of human task interleaving. It overcomes the shortcomings of the previous state of the art by being able to adapt to multiple tasks and complex reward/cost structures. In addition, it can consider non-diminishing, delayed rewards for task switching decisions, which are common in many day-to-day activities. The model assumes a two-level supervisory control system and both levels approximate utility based on experience. The higher level keeps track of ongoing tasks and decides which task to conduct next. The lower level decides based on the expected reward within a task whether to continue. The hierarchically optimal decomposition of the RL-task interleaving problem enables our lower level to consider the expected rewards of all other available tasks in its decision.
This model has demonstrated to be capable to reproduce common phenomena of task interleaving like sensitivity to costs of resumption (Altmann and Trafton 2002;Gutzwiller et al. 2019;Iqbal and Bailey 2008) and the human tendency to push switches to task boundaries where switch costs are lower (Altmann and Trafton 2002;McFarlane 2002). In addition, the HRL agent proved to be sensitive to intask rewards (Horrey and Wickens 2006;Iani and Wickens 2007;Norman and Shallice 1986;Wickens and McCarley 2008) even when gratification was delayed. This allows the model to depict varying levels of executive control of individuals. These results corroborate emerging evidence for hierarchical reinforcement learning in supervisory control (Botvinick 2012;Frank and Badre 2011).
Our study has provided new evidence for hierarchical reinforcement learning (HRL) as a model of task interleaving. The resemblance between simulated and empirical data is very encouraging. Comparison against myopic baselines suggests that human interleaving is better described as optimal planning under uncertainty than by a myopic strategy. We have shown that hierarchically optimal value decomposition is a tractable solution to the planning problem that the supervisory control system faces. In particular, it (i) can achieve a high level of control via experience, (ii) adapts to complex and delayed rewards/costs, avoiding being dominated by immediate rewards, and (iii) can generalize task type knowledge to instances not encountered previously. Moreover, only a small number of empirical parameters was needed for characterizing individual differences.
One exciting remaining question is if humans indeed learn task interleaving from experience by hierarchically decomposing the computation of when to attend a new task and which task to attend. While our results provide first indications that this might be the case, further research is necessary to pinpoint the computational mechanism. In particular, it would be interesting to investigate if humans need to observe a particular task type to task type switch to compute a meaningful expectation of gratification or if, due to hierarchical abstraction, humans do not consider the current task when computing expectations of in-task rewards of next tasks.
Another promising direction concerns modeling further cognitive factors that are involved in human task interleaving. We model recall effort to be the sole source of switch costs. In future work, it is interesting to extend cost functions to account for other factors, e.g., cognitive load. Furthermore, we will model task interleaving as a nonstationary (restless) problem, where memory traces decay during processing of other tasks and require more attentional refreshing the longer they were not accessed (Monk et al. 2004;Oberauer and Lewandowsky 2011). Finally, it is interesting to model increased switching costs between similar tasks caused by interferences of recall structures in working memory (Edwards and Gronlund 1998;Kiesel et al. 2010).
Funding Open access funding provided by Swiss Federal Institute of Technology Zurich. This work was funded in parts by the Swiss National Science Foundation (UFO 200021L 153644).
Data Availability Collected data of crowd-sourcing task interleaving study will be published after acceptance.
Code Availability Code of model will be published after acceptance

Conflict of Interest
-Andrew Howes, University of Birmingham -Duncan P. Brumby, University College London -Christian P. Janssen, Utrecht University -James Hillis, Facebook Reality Labs -Bas van Opheusden, Princeton University Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommonshorg/licenses/by/4.0/.

Appendix 1. Learning a hierarchically optimal policy
For model-free training of our HRL agent, we use the discrete-time discounted reward HO-MAXQ algorithm proposed in Andre and Russell (2002). Algorithm 1 shows its pseudo code. Q c t (i, s, a) and Q e t (i, s, a) specify the respective part of the value function decomposition at time t and hierarchy level i. Q t (i, s, a) is the overall state action value function at time t and on level i. a is the subtask of action a that is taken in state s.

Appendix 2. Task order error
To calculate the task order error, we execute model policies in the task environments of participants' test trials. The task order error is the distance between a model's and a participant's task interleaving sequence: if v j not in u 0 e l s e, f or 0 ≤ i ≤ |u|, 0 ≤ j ≤ |v| where u is the sequence of tasks of the participant and v the sequence of tasks of the respective model. Note that by considering Eq. A2.1 in the discrepancy function of the parameter fitting procedure, the performance of the HRL agent with respect to this metric can be improved. However, this is a trade-off as with this additional discrepancy measure the agent performs worse in the other reported metrics.