Abstract
To make good decisions in the real world, people need efficient planning strategies because their computational resources are limited. Knowing which planning strategies would work best for people in different situations would be very useful for understanding and improving human decisionmaking. Our ability to compute those strategies used to be limited to very small and very simple planning tasks. Here, we introduce a cognitively inspired reinforcement learning method that can overcome this limitation by exploiting the hierarchical structure of human behavior. We leverage it to understand and improve human planning in large and complex sequential decision problems. Our method decomposes sequential decision problems into two subproblems: setting a goal and planning how to achieve it. Our method can discover optimal human planning strategies for larger and more complex tasks than was previously possible. The discovered strategies achieve a better tradeoff between decision quality and computational cost than both human planning and existing planning algorithms. We demonstrate that teaching people to use those strategies significantly increases their level of resourcerationality in tasks that require planning up to eight steps ahead. By contrast, none of the previous approaches was able to improve human performance on these problems. These findings suggest that our cognitively informed approach makes it possible to leverage reinforcement learning to improve human decisionmaking in complex sequential decision problems. Future work can leverage our method to develop decision support systems that improve human decisionmaking in the real world.
Similar content being viewed by others
Introduction
People make many decisions that have important ramifications for their lives and the lives of others. Previous research has found that people’s choices in important reallife domains, such as personal finance, health, and education, are sometimes bad for them in the long run (O’Donoghue & Rabin, 2015). Making good choices in those domains often requires longterm planning. Longterm planning is a complex problem because the number of possible action sequences grows exponentially with the number of steps and people’s cognitive resources are limited. Therefore, one of the reasons why people make decisions that are bad for them in the long run might be that they lack adequate cognitive strategies for longterm planning.
Previous research has developed different approaches to improving human decisionmaking. One approach is to outsource difficult decisions to computers or provide computerpowered decision support for a specific problem (Aronson et al., 2005). Another solution is to nudge people towards choices that society deems to be good, for instance, by exploiting people’s bias towards sticking with the default option (Johnson & Goldstein, 2003). Although these approaches are prominent, they leave people unsupported in the vast majority of situations for which there are no decision support systems or nudges. Here, we therefore pursue a third alternative: enabling people to make more farsighted decisions for themselves by teaching them effective planning strategies that they can use in novel realworld situations where decision support systems and nudges are unavailable.
Our approach is an instance of improving people’s decision competencies, which is known as boosting (Hertwig & GrüneYanoff, 2017). In contrast to the first two approaches, boosting protects and enhances people’s freedom to make their own decisions based on their own goals, views, and preferences. This is especially important for personal choices that people do not want to give up. Recent work has shown that teaching people clever decision strategies is a promising way to improve their decisionmaking in the real world (Hafenbrädl et al., 2016). This suggests that it might be possible to enable people to make their own farsighted decisions in a wide range of complex reallife situations by teaching them one or a few effective strategies for longterm planning.
Early work on boosting taught people the normative principles of probability theory, logic, and expected utility theory (Larrick, 2004). Although these educational interventions succeeded to improve people’s performance on simple textbook problems, they failed to improve people’s performance in the real world (Larrick, 2004). The likely reason is that the amount of mental computation that would be required to apply those normative principles to complex realworld problems far exceeds people’s cognitive capacities (Lieder & Griffiths, 2020b). For instance, applying the principle of expected utility theory to the game of chess would require people to evaluate more sequences of moves than there are atoms in the universe, and real life is much more complex than a game of chess. Thus, if a person seriously applied the normative principles of standard decision theory to planning their life, then they would spend all of their life planning without ever arriving at a decision for what to do first. This illustrates that normative principles that fail to consider people’s limited time and bounded cognitive resources are illsuited as practical prescriptions for which decision procedure a person should use to do well in the real world. Instead, it is necessary for people to rely on heuristics that allow them to arrive at reasonable decisions in a reasonable amount of time.
For a simple heuristic to lead to good decisions it has to be welladapted to the structure of a particular decision environment in which it is used (Simon, 1956; Gigerenzer & Selten, 2002; Todd & Gigerenzer, 2012). This means that to help people make better decisions in a particular class of situations we first have to discover adaptive decision strategies that efficiently exploit the structure of those kinds of situations. The main bottleneck of boosting is that we still do not know highly effective cognitive strategies for the vast majority of the important reallife decisions that people have to make for themselves. We know some reasonably effective fastandfrugal heuristics for a few simple kinds of decisions (Hafenbrädl et al., 2016). But we still do not know any excellent heuristics for solving complex sequential decision problems. That is, we do not know which heuristics people should use to solve complex sequential decision problems as well as possible given their finite time and limited cognitive resources. Discovering and articulating clever decision strategies that work well in the real world are very challenging for people. That might be why boosting has yet to be applied to the domain of longterm planning.
To make boosting people’s longterm planning skills possible, we leverage machine learning to discover optimal heuristics that are superior to people’s intuitive strategies. We develop an intelligent tutor that teaches people to use those strategies and demonstrate that people’s performance improves as they adopt the automatically discovered heuristics. In doing so, we extend our previous work on improving human decisionmaking in small toy problems (Lieder et al., 2019, 2020) to substantially larger, more complex, and more naturalistic sequential decision problems. In addition to laying the scientific and technological foundations for improving human decisionmaking, this line of work also elucidates the algorithmic principles of optimal—rather than typical—human planning.
Because our goal is to improve the mental strategies that people use to make their own decisions, our strategy discovery method strives to find cognitive strategies that are optimal for the human mind rather than optimal plans or planning algorithms that only computers can execute. To achieve this, we apply the normative principle of resourcerationality (Lieder & Griffiths, 2020b) to mathematically define planning strategies that are optimally adapted to the problems people have to solve and the cognitive resources people can use to solve those problems (Callaway et al., 2018b, 2020). The principle of resourcerationality defines the sweet spot between using an astronomical amount of computation to guarantee arriving at the best possible decision (expected utility theory) and making bad decisions without giving them any thought. Unlike previous work that used the theory of resourcerationality to develop descriptive models of how people normally make decisions, this article applies it to derive prescriptions for how people should make decisions to perform better than they normally do. Deriving such prescriptions serves to improve human decisionmaking. Here, we pursue this goal by teaching people automatically discovered resourcerational heuristics using intelligent tutors.
Previous work has established that it is possible to compute resourcerational planning strategies (Callaway et al., 2018b) by modeling the problem of deciding how to plan as a metalevel Markov Decision Processes (metalevel MDP; Hay et al., 2014). The metalevel MDP defines a resourcerational model of human planning by modeling not only the problem to be solved but also the cognitive operations people have available to solve that problem and how costly those operations are Callaway et al. (2018b, 2020), Lieder et al. (2017), and Griffiths (2020). We refer to this approach as automatic strategy discovery. This approach frames planning strategies as policies for selecting planning operations. Its methods use algorithms from dynamic programming and reinforcement learning (Sutton and Barto, 2018) to compute the policy that maximizes the expected reward of executing the resulting plan minus the cost of the computations that the policy would perform to arrive at that plan (Callaway et al., 2018a; Lieder & Griffiths, 2020b; Griffiths et al., 2019).
Recent work used dynamic programming to discover optimal planning strategies for planning up to three steps ahead in small instances of the MouselabMDP paradigm (Callaway et al., 2017), a processtracing paradigm that externals planning as information gathering. This work found that it is possible to improve human planning on those problems by teaching people the automatically discovered strategies (Lieder et al.2019, 2020). Subsequent work applied reinforcement learning to approximate strategies for planning up to six steps ahead in a task where each step entailed choosing between two options and there were only two possible rewards (Callaway et al., 2018a). Additionally, multiple experiments have shown that people are able to transfer the learned strategies to different environments and reward structures, demonstrating that people are able to generalize the learned strategies and benefit from the tutoring beyond the specific task they are being taught in Lieder et al. (2020). But none of the existing strategy discovery methods (Callaway et al., 2018a; Kemtur et al., 2020) is scalable enough to discover good planning strategies for more complex environments. This is because the run time of these methods grows exponentially with the size of the planning problem. This confined automatic strategy discovery methods to very small and very simple planning tasks. Discovering planning strategies that achieve—let alone exceed—the computational efficiency of human planning is still out of reach for virtually all practically relevant sequential decision problems.
To overcome this computational bottleneck, we developed a scalable method for discovering planning strategies that achieve a (super)human level of computational efficiency on some of the planning problems that are too large for existing strategy discovery methods. Our approach draws inspiration from the hierarchical structure of human behavior (Botvinick, 2008; Miller et al., 1960; Carver and Scheier, 2001; Tomov et al., 2020). Research in cognitive science and neuroscience suggests that the brain decomposes longterm planning into goalsetting and planning at multiple hierarchically nested timescales (Carver & Scheier, 2001; Botvinick, 2008). Furthermore, Solway et al. (2014) found that human learners spontaneously discover optimal action hierarchies. Inspired by these findings, we extend the nearoptimal strategy discovery method proposed in Callaway et al. (2018a) by incorporating hierarchical structure into the space of possible planning strategies. Concretely, the planning task is decomposed into first selecting one of the possible final destinations as a goal solely based on its own value and then planning the path to this selected goal.
We find that imposing hierarchical structure makes automatic strategy discovery methods significantly less computationally expensive without compromising the resourcerationality score of the discovered strategies. Our hierarchical decomposition leads to a substantial reduction in the computational complexity of the strategy discovery problem, which makes it possible to scale up automatic strategy discovery to many planning problems that were prohibitively large for previous strategy discovery methods. This allowed our method to discover planning strategies that achieve a superhuman level of computational efficiency on nontrivial planning problems in the sense that the discovered strategies had a higher resourcerationality score than the strategies used by people. The resourcerationality score is the difference between the sum of the rewards along the chosen path minus the cost of the informationgathering operations that were performed to select that path. That is, we use the term “computational efficiency” to refer to how well a planning strategy trades off the quality of its decisions against the cost of the time and computational resources it consumes to reach its decisions.
We demonstrate that this advance enables us to improve human decisionmaking in larger and more complex planning tasks that were previously intractable to solve by conducting three training experiments. Our findings suggest that people internalize the taught strategies and then spontaneously use them in subsequent decisions without any guidance. In addition, we have previously shown that people transfer the strategies taught by our intelligent tutors to new tasks and new domains (Lieder et al., 2019, 2020). We are therefore optimistic that it might be possible to leverage machine learning to help people learn general cognitive strategies that enable them to independently make their own farsighted decisions in novel situations where decision support is unavailable.
The plan for this article is as follows: We start by introducing the frameworks and methods that our approach builds on. We then present our new reinforcement learning method for discovering hierarchical planning strategies. Next, we demonstrate that our method is able to discover nearoptimal planning strategies for larger environments than previous methods were able to handle and characterize these optimal strategies. We then test whether the resulting advances are sufficient to improve human decisionmaking in complex planning problems and close by discussing the implications of our findings for the development of more intelligent agents, understanding human planning (Lieder and Griffiths, 2020b), and improving human decisionmaking (Lieder et al., 2019).
Background and Related Work
Before we introduce, evaluate, and apply our new method for discovering hierarchical planning strategies we now briefly introduce the concepts and methods that it builds on. We start by introducing the theoretical framework we use to define what constitutes a good planning strategy.
ResourceRationality
Previous work has shown that people’s planning strategies are jointly shaped by the structure of the environment and the cost of planning (Callaway et al., 2018b, 2020). This idea has been formalized within the framework of resourcerational analysis (Lieder & Griffiths, 2020b). Resourcerational analysis is a cognitive modeling paradigm that derives process models of people’s cognitive strategies from the assumption that the brain makes optimal use of its finite computational resources. These computational resources are modeled as a set of elementary information processing operations. Each of these operations has a cost that reflects how much computational resources it requires. Those operations are assumed to be the building blocks of people’s cognitive strategies. To be resourcerational a planning strategy has to achieve the optimal tradeoff between the expected return of the resulting decision and the expected cost of the planning operation it will perform to reach that decision. Both depend on the structure of the environment. The degree to which a planning strategy (h) is resourcerational in a given environment (e) can be quantified by the sum of expected rewards achieved by executing the plan it generates (R_{total}) minus the expected computational cost it incurs to make those choices, that is
where λ is the cost of performing one planning operation and N is the number of planning operations that the strategy performs. Throughout this article, we refer to this measure as resourcerationality score and use it as our primary criterion for the performance of planning algorithms, automatically discovered strategies, and people.
Discovering ResourceRational Planning Strategies by Solving Metalevel MDPs
Equation 1 specifies a criterion that optimal heuristics must meet. But it does not directly tell us what those optimal strategies are. Finding out what those optimal strategies are is known as strategy discovery.
Callaway et al. (2018b) developed a method to automatically derive resourcerational planning strategies by modeling the optimal planning strategy as a solution to a metalevel Markov Decision Process (metalevel MDP). In general, a metalevel MDP \(M = ({\mathscr{B}}, \mathcal {C}, T, r)\) is defined as an undiscounted MDP where \(b \in {\mathscr{B}}\) represents the belief state, \(T(b,c,b^{\prime })\) is the probability of transitioning from belief state b to belief state \(b^{\prime }\) by performing computation \(c \in \mathcal {C}\), and r(b,c) is a reward function that describes the costs and benefits of computation (Hay et al., 2014). It is important to note that the actions in a metalevel MDP are computations that are different from objectlevel actions—the former are planning operations and the latter are physical actions that move the agent through the environment. Previous methods for discovering nearoptimal decision strategies (Lieder et al., 2017; Callaway et al., 2018b; Callaway et al., 2018a) have been developed for and evaluated in a planning task known as the MouselabMDP paradigm (Callaway et al., 2017, 2020).
The MouselabMDP Paradigm
The MouselabMDP paradigm was developed to make people’s elementary planning operations observable. This is achieved by externalizing the process of planning as information seeking. Concretely, the MouselabMDP paradigm illustrated in Fig. 1 shows the participant a map of an environment where each location harbors an occluded positive or negative reward. To find out which path to take the participant has to click on the locations they consider visiting to uncover their rewards. Each of these clicks is recorded and interpreted as the reflection of one elementary planning operation. The cost of planning is externalized by the fee that people have to pay for each click. People can stop planning and start navigating through the environment at any time. But once they have started to move through the environment they cannot resume planning. The participant has to follow one of the paths along the arrows to one of the outermost nodes.
To evaluate the resourcerational performance metric specified in Eq. 1 in the MouselabMDP paradigm, we measure R_{total} by the sum of rewards along the chosen path, set λ to the cost of clicking, measure N by the number of clicks that a strategy made on a given trial.
The structure of a MouselabMDP environment can be modeled as a directed acyclic graph (DAG), where each node is associated with a reward that is sampled from a probability distribution, and each edge represents a transition from one node to another. In this article, we refer to the agent’s initial position as the root node, the most distant nodes as goal nodes, and all other nodes as intermediate nodes.
MouselabMDP environments were selected in which the variance of each node’s reward distribution increases with the node’s depth.^{Footnote 1} This models that the values of distant states are more variable than the values of proximal states. Therefore, the goal nodes have a higher variance than the intermediate nodes.
ResourceRational Planning in the MouselabMDP Paradigm
Using the MouslabMDP as a model of human planning allows to measure the effectiveness of resourcerational planning strategies. Alternatively to observing human planning, it is also possible to algorithmically find novel planning strategies that solve the MouslabMDP in a nearoptimal way (Callaway et al., 2018a), which we term strategy discovery. Since the MouslabMDP models human planning operations, we believe devising algorithms that solve the MouslabMDP is a highly valuable cause. Recent work by Callaway et al. (2018a) has shown that it is possible to discover nearoptimal planning strategies for small decision problems and Lieder et al. (2019, 2020) demonstrate that teaching these strategies to humans can improve their own planning strategies in the MouslabMDP paradigm.
To discover optimal planning strategies, we can draw on previous work that formalized resourcerational planning in the MouselabMDP paradigm as the solution to a metalevel MDP (Callaway et al., 2018b, 2020). In the corresponding metalevel MDP each belief state \(b \in {\mathscr{B}}\) encodes probability distributions over the rewards that the nodes might harbor. The possible computations are \(\mathcal {C} = \{ \xi _{1}, ..., \xi _{M}, c_{1,1}, ..., c_{M,N}, \perp \}\), where c_{g,n} reveals the reward at intermediate node n on the path to goal g, and ξ_{g} reveals the value of the goal node g. For simplicity, we set the cost of each computation to 1. When the value of a node is revealed, the new belief about the value of the inspected node assigns a probability of one to the observed value. The metalevel operation ⊥ terminates planning and triggers the execution of the plan. The agent selects one of the paths to a goal state that has the highest expected sum of rewards according to the current belief state.
Methods for Solving Metalevel MDPs
In their seminal paper, Russell and Wefald (1991) introduced the theory of rational metareasoning. In Russell and Wefald (1992), they define the value of computation VOC(c,b) to be the expected improvement in decision quality achieved by performing computation c in belief state b and continuing optimally, minus the cost of computation c. Using this formalization, the optimal planning strategy \(\pi _{\text {meta}}^{*}\) is a selection of computations that maximizes the value of computation (VOC), that is
When the VOC is nonpositive for all available computations, the policy terminates (c =⊥) and executes the best objectlevel action according to the current belief state. Hence, VOC(⊥,b) = 0. In general, the VOC is computationally intractable but it can be approximated (Callaway et al., 2018a). Lin et al. (2015) estimated VOC by the myopic value of computation (VOI_{1}), which is the expected improvement in decision quality that would be attained by terminating deliberation immediately after performing the computation. Hay et al. (2014) approximated rational metareasoning by solving multiple smaller metalevel MDPs that each defines the problem of deciding between one objectlevel action and its best alternative.
Bayesian Metalevel Policy Search
Inspired by research on how people learn how to plan Krueger et al. (2017), Callaway et al. (2018a) developed a reinforcement learning method for learning when to select which computation. This method uses Bayesian optimization to find a policy that maximizes the expected return of a metalevel MDP. The policy space is parameterized by weights that determine to which extent computations are selected based on the myopic VOC versus less shortsighted approximations of the value of computation. It thereby improves upon approximating the value of computation by the myopic VOC by considering the possibility that the optimal metalevel policy might perform additional computations afterward. Concretely, BMPS approximates the value of computation by interpolating between the myopic VOI and the value of perfect information, that is
where VPI(b) denotes the value of perfect information. VPI(b) assumes that all computations possible at a given belief state would take place. Furthermore, VPI_{sub}(c,b) measures the benefit of having full information about the subset of parameters that the computation reasons about (e.g., the values of all paths that pass through the node evaluated by the computation), cost(c) is the cost of the computation c, and w = (w_{1},w_{2},w_{3},w_{4}) is a vector of weights. Since the VOC and VPI_{sub} are bounded by the VOI_{1} from below and by the VPI from above, the approximation of VOC (i.e., \(\mathrm {\widehat {VOC}}\)) is a convex combination of these features, and the weights associated with these features are constrained to a probability simplex set. Finally, the weight associated with the cost function w_{4} ∈ [1,h], where h is the maximum number of available computations to be performed. The values of these weights are computed using Bayesian Optimization (Mockus, 2012). Discovery of the optimized weights is analogous to discovering the optimal policy in the environment.
Alternative Approaches
Alternative methods to solve metalevel MDPs include works by Sezener and Dayan (2020) and Svegliato and Zilberstein (2018). Sezener and Dayan (2020) solves a multiarm bandit problem using a Monte Carlo Tree Search based on the static and dynamic value of computations. In a bandit problem, unlike most models of planning, transitions depend purely on the chosen action and not on the current state. Svegliato and Zilberstein (2018) devised an approximate metareasoning algorithm using temporal difference (TD) learning to decide when to terminate the planning process.
Intelligent Cognitive Tutors
One of the reasons for the ineffectiveness of teaching people to make decisions according to normative principles that ignore practical constraints (i.e., limited time and bounded cognitive resources) is that the exhaustive planning that would be required to apply those principles to the real world would waste virtually all of a person’s time on the very first decision (e.g., “Should I get up or go back to sleep?”), thereby depriving them of the opportunities afforded by later decisions. By contrast, resourcerational heuristics allocate people’s limited time and bounded cognitive resources in such a way that they earn the highest possible sum of rewards across the very long series of decisions that constitutes life. This is why teaching resourcerational heuristics might succeed in improving people’s decisions in large and complex realworld problems where exhaustive planning would either be impossible or take a disproportionately large amount of time that could be better spent on other things.
Utilizing the resourcerational planning strategies discovered by solving metalevel MDPs, Lieder et al. (2019, 2020) have developed intelligent tutors that teach people the optimal planning strategies for a given environment. Most of the tutors let people practice planning in the MouselabMDP paradigm and provide them immediate feedback on each chosen planning operation. The feedback is given in two ways: (1) information about what the optimal planning strategy would have done; and (2) an affective element given as positive feedback (e.g., “Good job!”) or negative feedback. The negative feedback included a slightly frustrating timeout penalty during which participants were forced to wait idly for a duration that was proportional to how suboptimal their planning operation had been.
Lieder et al. (2020) found that participants were able to learn to use the automatically discovered strategies, remember them, and use them in novel environments with a similar structure. These findings suggest that automatic strategy discovery can be used to improve human decisionmaking if the discovered strategies are welladapted to the situations where people might use them. Additionally, Lieder et al. (2020) also found that video demonstration of click sequences performed by the optimal strategy is an equally effective teaching method as providing immediate feedback. Here, we build on these findings to develop cognitive tutors that teach automatically discovered strategies by demonstrating them to people.
Discovering Hierarchical Planning Strategies
All previous strategy discovery methods evaluate and compare the utilities of all possible computations in each step. As such, these algorithms have to explore the entire metalevel MDP’s state space which grows exponentially with the number of nodes.^{Footnote 2} As a consequence, these methods do not scale well to problems with large state spaces and long planning horizons. This is especially true of the Bayesian Metalevel Policy Search algorithm (BMPS) (Callaway et al., 2018a) whose run time is exponential in the number of nodes of the planning problem. In contrast to the exhaustive enumeration of all possible planning operations performed by those methods, people would not even consider making detailed lowlevel motor plans for navigating to a specific distant location (e.g., Terminal C of San Juan Airport) until they arrive at a highlevel plan that leads them to or through that location (Tomov et al., 2020). Here, we build on insights about human planning to develop a more scalable method for discovering efficient planning strategies.
Hierarchical Problem Decomposition
To efficiently plan over long horizons, people (Botvinick, 2008; Carver and Scheier, 2001; Tomov et al., 2020) and hierarchical planning algorithms (Kaelbling & LozanoPérez, 2010; Sacerdoti, 1974; Marthi et al., 2007; Wolfe et al., 2010) decompose the problem into first setting goals and then planning how to achieve them. This twostage process breaks large planning problems down into smaller problems that are easier to solve. To discover hierarchical planning strategies automatically, our proposed strategy discovery algorithm decomposes the problem of discovering planning strategies into the subproblems of discovering a strategy for selecting a goal and discovering a strategy for planning the path to the chosen goal. A pictorial representation is given in Fig. 2.
Formally, this is accomplished by decomposing the metalevel MDP defining the strategy discovery problem into two metalevel MDPs with smaller state and action spaces. Constructing metalevel MDPs for goalsetting and path planning is easy when there is a small set of candidate goals. Such candidate goals can often be identified based on prior knowledge or the structure of the domain (Schapiro et al., 2013; Solway et al., 2014). A lowlevel controller solves the goalachievement MDP whereas the highlevel controller solves the goalsetting MDP. When the controller is in control, a computation is selected from the corresponding metalevel MDP and performed. The metacontroller looks at the expected reward of the current goal with the expected reward of the next best goal and decides when control from the highlevel controller should be switched to the lowlevel controller. Hence, when the lowlevel controller discovers that the current goal is not as valuable as expected, the metacontroller allows for goalswitching.
The metalevel MDP model of the subproblem of goal selection (“GoalSetting Metalevel MDP”) only includes computations for estimating the values of a small set of candidate goal states (V (g_{1}),⋯ ,V (g_{M})). This means that goals are chosen without considering how costly it would be to achieve them. This makes sense when all goals are known to be achievable and the differences between the values of alternative goal states are substantially larger than the differences between the costs of reaching them. This is arguably true for many challenges people face in real life. For instance, when a high school student plans one’s career, the difference between the longterm values of studying computer science versus becoming a janitor is likely much larger than the difference between the costs of achieving either goal. This is to be expected when the time it will take to achieve the goals is short relative to a person’s lifetime.
The goalachievement MDP (“GoalAchievement Metalevel MDP”) only includes computations that update the estimated costs of alternative paths to the chosen goal by determining the costs or rewards of stateaction pairs r(s,a) that lie on those paths. This selection of computations within a selected goal leads to a possible issue of ignoring some computations that can be irrelevant in the goalachievement MDP but be highly valuable when considering the complete problem. One such example is when considering computations that reveal the value of nodes lying on an unavoidable path to the selected goal. This problem gets further accentuated if such a node has a possibility of having a highly positive or negative reward. To rectify this problem, a metacontroller has been introduced to facilitate goalswitching. A realworld example of the necessity to switch goals after discovering an unlikely highly negative event could be, for example, to switch from investing in the stock market to investing in real estate after discovering a likely stock market crash.
Decomposing the strategy discovery problem into these two components reduces the number of possible computations that the metareasoning method has to choose between from M ⋅ N to M + N, where M is the number of possible final destinations (goals) and N is the number of steps to the chosen goal (see Appendix 4). Perhaps the mostpromising metareasoning method for automatic strategy discovery is the Bayesian Metalevel Policy Search algorithm (BMPS; Callaway et al., 2018a; Kemtur et al., 2020). To solve the two types of metalevel MDPs introduced below more effectively, we also introduce an improvement of the BMPS algorithm in “Hierarchical Bayesian Metalevel Policy Search.” An additional computational optimization is described in Appendix 8.
GoalSetting Metalevel MDP
The optimal strategy for setting the goal can be formalized as the solution to the metalevel MDP \(M^{\mathrm {H}} = ({\mathscr{B}}^{\mathrm {H}}, \mathcal {C}^{\mathrm {H}}, T^{\mathrm {H}}, R^{\mathrm {H}})\), where the belief state \(b^{\mathrm {H}}(g) \in {\mathscr{B}}^{\mathrm {H}}\) denotes the expected cumulative reward that the agent can attain starting from the goal state \(g \in \mathcal {G}\). The highlevel computations are \(\mathcal {C}^{\mathrm {H}} = \{\xi _{1}, ..., \xi _{M}, \perp ^{\mathrm {H}} \}\), where ξ_{g} reveals the value V (g) of the goal node g. ⊥^{H} terminates the highlevel planning leading to the agent to select the goal with the highest value according to its current belief state. The reward function is R^{H}(b^{H},c^{H}) = −λ^{H} for c^{H} ∈{ξ_{1},...ξ_{M}} and \( R^{\mathrm {H}}(b^{\mathrm {H}}, \perp ^{\mathrm {H}}) = \max \limits _{k \in \mathcal {G}}\mathbb {E}[b^{\mathrm {H}}(k)] \).
GoalAchievement Metalevel MDP
Having set a goal to pursue, the agent has to find the optimal planning strategy to achieve the goal. This planning strategy is formalized as the solution to the metalevel MDP \(M^{\mathrm {L}} = ({\mathscr{B}}^{\mathrm {L}}, \mathcal {C}^{\mathrm {L}}, T^{\mathrm {L}}, R^{\mathrm {L}})\), where the belief state \(b \in {\mathscr{B}}^{\mathrm {L}}\) denotes the expected reward for each node. The agent can only perform a subset of metaactions \(\mathcal {C}_{g,\mathrm {L}} = \{ c_{g,1}, ..., c_{g,N}, \perp ^{\mathrm {L}}\}\), where c_{g,n} reveals the reward at node n in the goal set \(h_{g} \in {\mathscr{H}}\). A goal set \(h_{g} \in {\mathscr{H}}\) refers to all nodes, including the goal node, which lie on all paths leading to goal \(g \in \mathcal {G}\). Furthermore, ⊥^{L} terminates planning and leads the agent to select the path with the highest expected sum of rewards according to the current belief state. The reward function is R^{L}(b,c_{g}) = −λ^{L} for c_{g} ∈{c_{g,1},...,c_{g,N}} and \( R^{\mathrm {L}}(b, \perp ^{\mathrm {L}}) = \max \limits _{p \in \mathcal {P}} {\sum }_{n \in p} \mathbb {E}[b_{n}] \), where \(\mathcal {P}\) is the set of all paths, and b_{n} is the belief of the reward for node n.
Hierarchical Bayesian Metalevel Policy Search
Having introduced the hierarchical problem decomposition, we now present how this decomposition can be leveraged to make BMPS and other automatic strategy discovery methods more scalable. BMPS approximates the value of computation (VOC) according to Eq. 3. We propose to utilize BMPS to solve the goal selection metalevel MDP and the goalachievement metalevel MDP separately. The metacontroller then decides when the policy discovered should run. A detailed analysis of the computational time is presented in Appendix 4.
Highlevel Policy Search
The VOC for the highlevel policy is approximated using three features: (1) the myopic utility for performing a goal state evaluation (\(\mathrm {VO{I_{1}^{H}}}\)), (2) the value of perfect information about all goals (VPI^{H}), and (3) the cost of the respective computation (cost^{H}).
where \(w_{1}^{\mathrm {H}}, w_{2}^{\mathrm {H}}\) are constrained to a probability simplex set, \(w_{3}^{\mathrm {H}} \in \mathbb {R}_{[1, M]}\), and M is the number of goals. Additionally, the cost cost^{H}(c^{H}) is defined as
Lowlevel Policy Search
In a similar manner as for the highlevel policy, the value of computation for the lowlevel policy is approximated by using a mixture of VOI features and the anticipated cost of the current computation and future computations, that is:
where the weights \(w_{1}^{\mathrm {L}}\), \(w_{2}^{\mathrm {L}}\), \(w_{3}^{\mathrm {L}}\) are constrained to a probability simplex set, \(w_{4}^{\mathrm {L}} \in \mathbb {R}_{[1, h_{g}]}\), and h_{g} is the number of nodes in goal set h_{g}. The weight values for both levels are optimized in 100 iterations with Bayesian Optimization (Mockus, 2012) using the GPyOpt library (The GPyOpt Authors, 2016).
The cost feature of the original BMPS algorithm introduced by Callaway et al. (2018a) only considered the cost of a single computation whereas its VOI features consider the benefits of performing a sequence of computations. As a consequence, policies learned with the original version of BMPS are biased towards inspecting nodes that many paths converge on, even when the values of those nodes are irrelevant. To rectify this problem, we redefine the cost feature so that it considers the costs of all computations assumed by the VOI features (for an explanation of the features, see Appendix 1). Concretely, to compute the lowlevel policy, we define the cost feature of BMPS as the weighted average of the costs of generating the information assumed by the VOI features \(\mathcal {F}=\lbrace \mathrm {VO{I_{1}^{L}}}, \mathrm {VPI^{L}}, \mathrm {VPI_{sub}^{L}}\rbrace \), that is
where \(\mathbb {I}(c,f,n)\) returns 1 if node n is relevant when computing feature f for computation c and 0 otherwise.
In the remainder of this article, we refer to the resulting strategy discovery algorithm as hierarchical BMPS and refer to the original version of BMPS as nonhierarchical BMPS.
Evaluating the Performance, Scalability, and Robustness of Our Method for Discovering Hierarchical Planning Strategies
We evaluated our new method for discovering resourcerational strategies for human planning in large and complex sequential decision problems in terms of the resourcerationality of the discovered strategies and the scope of its applicability. Concretely, we measure the resourcerationality of the discovered strategies by the resourcerationality score defined in Eq. 1 (see “ResourceRationality” and “The MouselabMDP Paradigm”). In brief, we found that our new method makes it possible to discover resourcerational strategies for substantially larger, and hence more naturalistic, sequential decision problems than previous strategy discovery methods could handle. Despite the approximations that went into making our method so scalable, the discovered strategies are no worse than the strategies discovered by stateoftheart strategy discovery methods and substantially more resourcerational than both the strategies that people use intuitively and standard planning algorithms that people could be thought to use instead. These findings hold even when the structure of the sequential decision problem violates the assumptions of our method. This section presents the details of this evaluation. Readers who are primarily interested in the effects of teaching the automatically discovered strategies to people may skip ahead to the next section.
To evaluate our method for discovering resourcerational strategies for human planning in large sequential decision problems, we benchmarked its performance in the two types of environments illustrated in Figs. 3 and 6. The first type of environment conforms to the structure that motivated our hierarchical problem decomposition (i.e., the variability of rewards increases from each step to the next) and the second type does not. In the second type of environment, we introduced a highrisk node on the path to each goal (see Fig. 6). This violates the assumption that motivated the hierarchical decomposition and makes goalswitching essential for a high resourcerationality score.
For each environment, we assess the degree to which the automatically discovered strategies are resourcerational against the resourcerationality of human planning, existing planning algorithms, and the strategies discovered by stateoftheart strategy discovery methods. The MouselabMDP planning task and the corresponding metalevel MDP are set up in such a way that the scores of people and automatically discovered strategies measure their level of resourcerationality rather than just their performance. This is because the score in this task is the sum of the external rewards along the chosen path minus the cost of planning that was invested to select the path. We therefore refer to it as the resourcerationality score. A strategy will achieve the highest resourcerationality score if it selects the most informative planning operations (computations/clicks) to maximally improve its plan with as few computations as necessary. Each planning operation has an associated cost which is used to model the cost of thinking. A strategy is said to be the most computationally efficient if it achieves the highest possible resourcerationality score.
In addition to evaluating the resourcerationality of the discovered strategies, we also evaluated the scalability of our method to larger planning problems with longer horizons in terms of the largest environments for which it can discover resourcerational heuristics (see Appendix 2). In brief, our method can solve metalevel MDPs with 5^{2484} times as many possible belief states than what previous methods were able to handle. This improvement increases the size of the largest problem for which resourcerational planning strategies can be discovered to planning 5 steps ahead when there are five possible actions in each state that each lead to a different outcome.
Evaluation in Environments That Conform to the Assumed Structure
We first evaluate the performance of our method in environments whose structure conforms to the assumptions that motivated the hierarchical problem decomposition. To do so, we compare the resourcerationality of the discovered strategies against the resourcerationality of existing planning algorithms, the strategies discovered by previous strategy discovery methods, and human resourcerationality in four increasingly challenging environments of this type with 2–5 candidate goals. The reward of each node is sampled from a normal distribution with mean 0. The variance of rewards available at nongoal nodes was 5 for nodes reachable within a single step (level 1) and doubled from each level to the next. The variance of the distribution from which the reward associated with the goal node was sampled starts from 100 and increases by 20 for every additional goal node, up to a maximum of 180, after which the variance resets to 100. The environment was partitioned into one subgraph per goal. Each of those subgraphs contains 17 intermediate nodes, forming 10 possible paths that reach the goal state in maximum 5 steps (see Fig. 3). The cost of planning is 1 point per click (λ = 1).
To estimate an upper bound on the resourcerationality score of existing planning algorithms on our benchmark problems, we selected Backward Search and Bidirectional Search (Russell and Norvig, 2002) because—unlike most planning algorithms—they start by considering potential final destinations, which is optimal for planning in our benchmark problems. These search algorithms terminate when they find a path whose expected return exceeds a threshold, called its aspiration value. The aspiration was selected using Bayesian Optimization (Mockus, 2012) to get the best possible performance from the selected planning algorithm. We also evaluated the resourcerationality score of a randomsearch algorithm, which chooses computations uniformly at random from the set of metalevel operations that have not been performed yet.
In addition to those planning algorithms, our baselines also include three stateoftheart methods for automatic strategy discovery: the greedy myopic VOC strategy discovery algorithm (Lin et al., 2015), which approximates the VOC by its myopic utility (VOI_{1}), BMPS (Callaway et al., 2018a), and the Adaptive Metareasoning Policy Search algorithm (AMPS) (Svegliato & Zilberstein, 2018) which uses approximate metareasoning to decide when to terminate planning. Our implementation of the AMPS algorithm uses a deep Qnetwork (Mnih et al., 2013) to estimate the difference between values to stop planning and continue planning, respectively. It learns this estimate based on the expected termination reward of the best path. The planning operations are selected by maximizing the myopic value of information (VOI_{1}). When applying hierarchical BMPS to this environment, goalswitching is ineffective since the cumulative variance of the intermediate nodes was less than the variance of the goal nodes, rendering goalswitching unnecessary. Therefore, goalswitching does not occur in the automatically discovered strategies for solving this environment. To illustrate the versatility of our hierarchical problem decomposition, we also applied it to the greedy myopic VOC strategy discovery algorithm (Fig. 4).
How ResourceRational Are the Automatically Discovered Strategies?
Table 1 and Fig. 5a compare the resourcerationality score of the strategies discovered by hierarchical BMPS and the hierarchical greedy myopic VOC method against the resourcerationality score of the strategies discovered by the two stateoftheart methods, two standard planning algorithms, and human resourcerationality scores (see “How Does the Performance of the Automatically Discovered Strategies Compare to Human Planning?”) on the benchmark problems described above (“Evaluation in Environments That Conform to the Assumed Structure”). These results show that the strategies discovered by our new hierarchical strategy discovery methods^{Footnote 3} outperform extant planning algorithms and the strategies discovered by the AMPS algorithm across all of our benchmark problems (p < .01 for all pairwise Wilcoxon ranksum tests). Critically, imposing hierarchical constraints on the strategy search of BMPS and the greedy myopic VOC method had no negative effect on the resourcerationality score of the resulting strategies (p > .770 for all pairwise Wilcoxon ranksum tests).
As illustrated in Fig. 4, the planning strategy our hierarchical BMPS algorithm discovered for this type of environment is qualitatively different from all existing planning algorithms. In general, the strategy is as follows: it first evaluates the goal nodes until it finds a goal node with a sufficiently high reward. Then, it plans backward from the chosen goal to the current state. In evaluating candidate paths from the goal to the current state, it discards each path from further exploration as soon as it encounters a high negative reward on that path. This phenomenon is known as pruning and has previously been observed in human planning (Huys et al., 2012).^{Footnote 4} The nonhierarchical version of BMPS also discovered this type of planning strategy. This suggests that goalsetting with backward planning is the resourcerational strategy for this environment rather than an artifact of our hierarchical problem decomposition. Unlike this type of planning, most extant planning algorithms plan forward and the few planning algorithms that plan backward (e.g., Bidirectional Search and Backward Search) do not preemptively terminate a path exploration.
Evaluation in Risky Environments
To accommodate environments whose structure violates the assumption that more distant rewards are more variable than more proximal ones, the hierarchical strategies discovered by our method can alternate between goal selection and goal planning.^{Footnote 5} We now demonstrate the benefits of this goalswitching functionality by comparing the resourcerationality score of our method with versus without goalswitching. In particular, we demonstrate that switching goals leads to better performance if the assumption of increasing variance is violated and does not harm performance when that assumption is met. Firstly, we compare the resourcerationality score of the two algorithms in an environment where switching goals should lead to an improvement in performance. This environment has a total of 60 nodes split into four different goals, each consisting of 15 nodes in the lowlevel MDP. The difference to previously used environments is that one of the unavoidable intermediate nodes has a 10% probability to harbor a large loss of − 1500 (see Fig. 6). The cost of computation in this environment is 10 points per click (λ = 10). The optimal strategy for this environment selects a goal, checks this highrisk node on the path leading to the selected goal, and switches to a different goal if it uncovers the large loss. We compare the performance of hierarchical BMPS with goalswitching to the performance of hierarchical BMPS without goalswitching, the nonhierarchical BMPS method with tree contraction,^{Footnote 6} and human performance. The three strategy discovery algorithms were all trained in the same environment following the same training steps. Their resourcerationality scores are noted in Table 2.
All resourcerationality scores do not follow a normal distribution as tested with a ShapiroWilk test (p < .001 for each). The performance between the individual algorithms was compared with a Wilcoxon rankSum rest, adjusting the critical alpha value via Bonferroni correction. Comparing the score of goalswitching to both our method without goalswitching (W = 14.07,p < .001) and the original BMPS algorithm (W = 11.38, p < .001), shows a significant benefit of goalswitching. Comparing the performance of the original BMPS method to the no goalswitching algorithm, the original BMPS version performs significantly better (W = 18.7,p < .001).
To show that enabling our algorithm’s capacity for goalswitching has no negative effect on its performance even when the assumption of the hierarchical decomposition is met, we perform a second comparison on the twogoal environment with increasing variance as in “How ResourceRational Are the Automatically Discovered Strategies?” Since in this environment the rewards are most variable at the goal nodes, switching goals should usually be unnecessary. Therefore, due to the environment structure, we do not expect the goalswitching strategy to perform better than the purely hierarchical strategy. By comparing the resourcerationality score in this environment we observe that both versions of the algorithm perform similarly well (see Table 3). A Wilcoxon ranksum test (W = 0.03,p = .98) shows no significant difference between the two. This demonstrates that the addition of goalswitching to the algorithm does not impair the performance, even when goalswitching is not beneficial.
How Does the Performance of the Automatically Discovered Strategies Compare to Human Planning?
To be able to compare the performance of the automatically discovered planning strategies to the performance of people, we conduct experiments on Amazon Mechanical Turk (Litman et al., 2017).
Methods
We measured human resourcerationality scores in Flight Planning tasks that are analogous to the environments we use to evaluate our method (see Fig. 7).
For the environments that conform to the increasing variance structure (e.g., Fig. 3), we recruited 78 participants for each of the four environments (average age 34.71 years, range: 19–70 years; 46 female). Participants were paid $2.00 plus a performancedependent bonus (average bonus $1.52). The average duration of the experiment was 25.1 min. For the risky environment (see Fig. 6), we recruited 48 participants (average age 36.98 years, range: 19–70 years; 25 female). Participants were paid $1.75 plus a performancedependent bonus (average bonus $0.34). The average duration of the experiment was 14.86 min. Following instructions that informed the participants about the range of possible reward values, participants were given the opportunity to familiarize themselves with the task in 5 practice trials of the Flight Planning task. After this, participants were evaluated on 15 test trials of the Flight Planning task for the first type of environments and 5 test trials for the second type of environments. To ensure high data quality, we applied the same predetermined exclusion criterion throughout all presented experiments. We excluded participants who did not make a single click on more than half of the test trials because not clicking is highly indicative of a participant not engaging and speeding through the task. In the first environment type, we excluded 3 participants and in the second environment type, we excluded 22 participants.
Results
The results of this experiment are shown in the last row of Table 1. Surprisingly, we found that contrary to what has been observed in small sequential decision problems (Callaway et al., 2018b, 2020), human performance is far from resourcerational in large sequential decision problems.
Concretely, in the 15 increasing variance environments, human participants performed much worse than the strategy discovered by our hierarchical method regardless of the number of goals (p < .02 for all pairwise Wilcoxon ranksum tests). Compared to the human baseline, the hierarchical strategies discovered by our method appear to achieve a higher level of computational efficiency on multiple instances of environments with an increasing variance reward structure. The computational efficiency of a strategy is defined by its resourcerationality score which combines the quality of the solution (the reward earned by moving through the environment) with the computational cost of planning.
In the highrisk environment, the average human resourcerationality score was only − 79.92 (see Table 2) whereas our method achieved a resourcerationality score of 41 on the same environment instances. A ShapiroWilk detected no significant violation of the assumption that participants’ average scores are normally distributed (p. = 33). We, therefore, compared the average human performance to our method in a onesample ttest. We found that human participants performed significantly worse than the strategy discovered by our method (t(25) = − 13.06,p < .001). This suggests that the strategy discovered by our method performs on a superhuman level at achieving the trade off between gathering information and the associated time cost, resulting in a higher overall resourcerationality score.
Improving Human Decisionmaking by Teaching Automatically Discovered Planning Strategies
Having shown that our method discovers planning strategies that achieve a superhuman resourcerationality score (i.e., the strategy discovered by our method performs fewer clicks than the strategies that people use intuitively), we now evaluate whether we can improve human decisionmaking by teaching them the automatically discovered strategies. Building on the MouselabMDP paradigm introduced in “The MouselabMDP Paradigm,” we investigate this question in the context of the Flight Planning task illustrated in Fig. 7. Participants are tasked to plan the route of an airplane across a network of airports. Each flight gains a profit or a loss. Participants can find out how much profit or loss an individual flight would generate by clicking on its destination for a fee of $1. The participant’s goal is to maximize the sum of the flights’ profits minus the cost of planning. Participants can make as few or as many clicks as they like before selecting a route using their keyboard.
To teach people the automatically discovered strategies, we developed cognitive tutors (see “Intelligent Cognitive Tutors”) that show people stepbystep demonstrations of what the optimal strategies for different environments would do to reach a decision (see Fig. 8). In each step, the strategy selects one click based on which information has already been revealed. At some point, the tutor stops clicking and moves the airplane down the best route indicated by the revealed information. Moving forward we will refer to cognitive tutors teaching the hierarchical planning strategies discovered by hierarchical BMPS as hierarchical tutors and refer to the tutors teaching the strategies discovered by nonhierarchical BMPS as nonhierarchical tutors.
To evaluate the effectiveness of these demonstrationbased cognitive tutors, we conduct two training experiments in which participants are taught the optimal strategies for flight planning problems equivalent to the two types of environments in which we evaluated our strategy discovery method in “Evaluating the Performance, Scalability, and Robustness of Our Method for Discovering Hierarchical Planning Strategies.” To assess the potential benefits of the hierarchical tutors enabled by our new scalable strategy discovery method, these experiments compare the performance of people who were taught by hierarchical tutors against the performance of people who were taught by nonhierarchical tutors, the performance of people who were taught original feedbackbased tutor for small environments (Lieder et al., 2019, 2020), and the performance of people who practiced the task on their own. We developed the best version of each tutor possible given the limited scalability of the underlying strategy discovery method. The increased scalability of our new method enabled the hierarchical tutor to demonstrate the optimal strategy for the task participants faced whereas the other tutors could only show demonstrations on smaller versions of the task. We found that showing people a small number of demonstrations of the optimal planning strategy significantly improved their decisionmaking not only when the assumption underlying our method’s hierarchical problem decomposition is met (Experiment 1) but also when it is violated (Experiment 2).
In “Control Experiment,” we perform an additional experiment where we show that teaching our method through demonstrations leads to improved decisionmaking and closer similarity with the optimal strategy than teaching people with random demonstrations that reveal similar information about the environment’s structure to the participants.
Training Experiment 1: Teaching People the Optimal Strategy for an Environment with Increasing Variance
In Experiment 1, participants were taught the optimal planning strategy for an environment in which 10 final destinations can be reached through 9 different paths comprising between 4 and 6 steps each (see Fig. 7). The most important property of this environment is that the variance of available rewards doubles from each step to the next, starting from 5 in the first step. Therefore, in this environment, the optimal planning strategy is to first inspect the values of alternative goals, then commit to the best goal one could find, and then plan how to achieve it without ever reconsidering other goals.
Methods
We recruited 168 participants on Amazon Mechanical Turk (average age 34.9 years, range: 19–70 years; 98 female) (Litman et al., 2017). Participants were paid $2.50 plus a performancedependent bonus (average bonus $2.86). The average duration of the experiment was 46.9 min. Participants could earn a performancedependent bonus of 1 cent for every 10 points they won in the test trials.
All participants had to first agree to a consent form stating they were above 18, a US citizen residing in the USA, and fluent in English. After this, instructions about the range of possible rewards (− 250 to 250), cost of clicking ($1), and the movement keys were presented. Then, participants went through 5 trials to familiarize themselves with the experiment. Following this, participants were either given additional 10 practice trials or had 10 trials with the cognitive tutor depending on their experimental condition. Finally, the participant was given 15 test trials in the flight planning task with 10 possible final destinations illustrated in Fig. 7. Participants started with 50 points at the beginning of the test block.
To evaluate the efficacy of cognitive tutors, participants were assigned to 4 groups. In the experimental group, participants were taught by the hierarchical tutor. The first control group was taught by the nonhierarchical tutor. The second control group was taught by the feedbackbased cognitive tutor (Lieder et al., 2019, 2020) illustrated in Fig. 9. The third control group practiced the Flight Planning task 10 times without feedback. The hierarchical tutor taught the strategy discovered by the hierarchical BMPS algorithm discovered for the task participants had to perform in the test block. It first demonstrated 3 trials with the goal selection strategy; it then showed three demonstrations of the goal planning strategy and finally presented 4 demonstrations of the complete strategy combining both parts. The nonhierarchical tutor showed 10 demonstrations of the strategy that nonhierarchical BMPS discovered for the largest version of the Flight Planning task it could handle (i.e., 2 goals instead of 10 goals). Computational bottlenecks confined the feedbackbased tutor to a threestep planning task with six possible final destinations shown in Fig. 9 (Lieder et al., 2019, 2020). Participants received feedback on each of their clicks and their decision when to stop clicking as illustrated in Fig. 9. When the participant chose a suboptimal planning operation they were shown a message stating which planning operation the optimal strategy would have performed instead. In addition, they received a timeout penalty whose duration was proportional to how suboptimal their planning operation had been.
Counterbalanced assignment ensured that participants were equally distributed across four experimental conditions (i.e., 42 participants per condition). To ensure high data quality, we applied a predetermined exclusion criterion. We excluded 7 participants who did not make a single click on more than half of the test trials because not clicking is highly indicative of speeding through the experiment without engaging with the task.
On each trial, the participant’s expected resourcerationality score was calculated as the expected reward of the path they chose minus the cost of the clicks they had made.^{Footnote 7} We analyzed the data using the robust f1.ld.f1 function of the nparLD R package (Noguchi et al., 2012). The function performs a nonparametric ANOVAtype statistic with Box approximation (Box & et al., 1954) to determine the main effect and additional ANOVAtype statistics (ATS) with the denominator degrees of freedom set to infinity for post hoc comparisons (Noguchi et al., 2012).
Results
Table 4 shows the average resourcerationality scores of the four groups on the test trials. According to a ShapiroWilk test, participants’ scores on the test trials were not normally distributed in any of the four groups (all p < .001). Therefore, we tested our hypothesis using a nonparametric ANOVAtype statistic with Box approximation (Box & et al., 1954) to test if there are any significant differences between the groups in our repeatedmeasures design. We found that people’s performance differed significantly across the four experimental conditions (F(2.85,144.17) = 3.92, p = .0113). Pairwise post hoc ATS comparisons confirmed that teaching people strategies discovered by the hierarchical method significantly improved their performance (204.48 points/trial) compared to the nofeedback control condition (177.36 points/trial, F(1) = 7.57, p = .006), the feedbackbased cognitive tutor (167.02 points/trial, F(1) = 11.89, p < .001), and the nonhierarchical demonstration (181.58 points/trial, F(1) = 5.02, p = .025). By contrast, neither the feedbackbased cognitive tutor (F(1) = 0.57, p = .45) nor the nonhierarchical demonstration (F(1) = 0.19, p = .67) were more effective than letting people practice the task on their own.
We investigated which percentage of participants followed the general strategy of planning which goal to pursue and then planning a path to the chosen goal. We measured this by checking if the participant first clicked any number of goal nodes and then clicked one of the nodes leading to the best discovered goal. A participant is considered to have learned this aspect of the strategy, if the described behavior is shown in over half of the five test trials. The results in Table 5 showed that the participants from the nonhierarchical demonstration (95%) and hierarchical demonstration (97.56%) conditions learned goal planning to a high degree. Participants of the nofeedback (78.05%) and feedbackbased tutor (76.92%) conditions still performed well in mastering this general aspect of the optimal strategy. We compared the proportion of participants who learned this behavior and received the hierarchical demonstration to other conditions using a ztest and corrected the pvalues for multiple comparisons using the BenjaminiHochberg method (Benjamini & Hochberg, 1995). This showed a significant difference between participants of the hierarchical demonstration condition versus the nofeedback condition (z = 2.7, p = .01) and the feedbackbased tutor condition (z = 2.79, p = .01). We observed no significant difference between participants of the hierarchical demonstration condition and nonhierarchical demonstration condition (z = 0.61, p = .542). To test how well participants learn to follow the optimal strategy, we further calculated to what extent each participant’s planning operations match the nearoptimal strategy discovered by our method. These additional agreement measures were calculated per planning operation the participant performed. For each planning operation, we reconstructed the current belief state and then used our method to calculate the set of optimal actions and compare them to the action taken by the participant. We found that participants in the hierarchical demonstration condition match the automatically discovered strategy better (across all test trials performed by participants of this condition, participants chose one of the optimal planning operations in 35.84% of their actions with a standard deviation of ± 15%) than participants in the nofeedback control condition (22.08% ± 12%), the feedback condition (27.44% ± 13%), and the nonhierarchical demonstrations condition (28.51% ± 13%). The ANOVAtype statistic with Box approximation (Box & et al., 1954) showed a significant effect of the condition on click agreement (F(3.00,156.70) = 21.73, p < .001). Post hoc ATS comparisons showed significant differences between hierarchical demonstrations and the control conditions: the nofeedback condition (F(1) = 64.00, p < .001), the feedbackbased cognitive tutor (F(1) = 20.90, p < .001), and the nonhierarchical demonstration (F(1) = 16.59, p < .001). Additional significant effects are present between the nofeedback condition and the feedbackbased cognitive tutor (F(1) = 11.44, p < .001) as well as the nonhierarchical demonstration (F(1) = 16.36, p < .001). No significant difference was found between the participants who were taught by the feedbackbased cognitive and the nonhierarchical demonstration (F(1) = 0.35, p = .56), respectively. We further analyzed which aspects of the optimal strategy participants learned. Results of the extended analysis consisting of 6 additional measures can be found in Table 5. In this analysis, we split the click agreement measure into three subcategories: goal agreement, path agreement, and termination agreement. Goal agreement and path agreement measure, how well participants’ clicks match the optimal strategy when planning which goal to pursue (goal agreement) and how to achieve the selected goal (path agreement), respectively. The higher their agreement with the optimal strategy was the better they performed. We further split the termination measure into overall termination agreement, the goal termination agreement, and the path termination agreement. The overall termination agreement measured, how well participants stop planning when it is no longer beneficial. Goal and path termination agreement measured, how well participants switch between planning which goal to pursue and how to achieve the selected goal. When calculating path agreement and path termination agreement we allowed for mistakes in goal selection (i.e., planning an optimal path to a suboptimally chosen goal will still result in a high path agreement score). The termination agreement measures were calculated using the balanced accuracy measure to equally account for both not terminating too early and not terminating too late. Participants of all conditions were able to match the optimal strategy for selecting goals roughly equally well with agreement scores ranging from 32% to 38%. Participants trained by the feedbackbased tutor performed slightly better at selecting goals matching the optimal strategy with 38.08% than participants of the hierarchical demonstration condition (34.86%). These differences were not statistically significant (F(2.94,151.95) = 2.6, p = .056). The agreement measure for planning how to achieve the selected goal showed significant differences (F(2.95,152.54) = 28.45, p < .001). Participants trained by the hierarchical demonstration achieved a higher agreement score (44.45%) than participants of other conditions: nofeedback (17.82%; F(1) = 74.57, p < .001), feedbackbased tutor (23.00%; F(1) = 42.39, p < .001), and nonhierarchical demonstration (30.37%; F(1) = 15.88, p < .001). Overall, path and goal agreement were relatively low across conditions with under 50% of participants’ clicks matching the resourcerational strategy demonstrated by our intelligent tutor. We identified the complicated reward structure of the goal nodes as a potential cause for the low overall agreement. The increase in the variance of goal rewards from 100 to 180 (see “Evaluation in Environments That Conform to the Assumed Structure”) resulted in an optimal strategy that first investigates goals 5 and 10, while investigating other goal nodes first was slightly suboptimal. This detail of the optimal strategy was not well understood by participants: while participants learned to click a goal node as the first planning action in the vast majority of trials (93.66% for the nofeedback condition, > 99% for all other conditions), participants learned to click either goal 5 or goal 10 as the first action in only slightly above half of the trials (50.73% for participants of the nofeedback condition, 56.58% for participants of the feedbackbased tutor condition, 57.67% for participants of the nonhierarchical demonstration condition, and 62.11% for participants of the hierarchical demonstration condition). Lastly, we compared the termination agreement for overall termination, goal termination, and path planning termination. Across all termination agreement measures and conditions participants learned when to terminate quite well (> 60% agreement). Participants trained by the hierarchical demonstration matched the optimal strategy slightly more closely than participants of the other conditions for all three measures. For termination agreement, conditions differed significantly (F(2.93,151.19) = 3.29, p = .023): the post hoc ATS showed that participants of the hierarchical demonstration condition achieved a significantly higher agreement than participants of the nonhierarchical demonstration condition (F(1) = 11.22, p < .001), but not the nofeedback condition or the feedbackbased tutor condition (p > .1 for both). We did not detect a significant difference between the conditions for either goal termination agreement (F(2.90,148.66) = 2.14, p = .099) or path termination agreement (F(2.84,144.14) = 2.62, p = .057). Overall, these results showed that the hierarchical method is better able to discover and teach resourcerational strategies for environments in which previous methods failed.
Training Experiment 2: Teaching People the Optimal Strategy for a Risky Environment
In Experiment 2, participants were taught the strategy our method discovered for the 8step decision problem illustrated in Fig. 6. Critically, in this environment, each path contains one risky node that harbors an extreme loss with a probability of 10%. Therefore, the optimal strategy for this environment inspects the risky node while planning how to achieve the selected goal and then switches to another goal when it encounters a large negative reward on the path to the initially selected goal.
Method
To test whether our approach can also improve people’s performance in environments with this more complex structure, we created two demonstrationbased cognitive tutors that teach the strategies discovered by hierarchical BMPS with goalswitching and hierarchical BMPS without goalswitching, respectively, and a feedbackbased tutor that teaches the optimal strategy for a 3step version of the risky environment.
This experiment used a Flight Planning Task that is analogous to the environment described in “Evaluation in Risky Environments” (see Fig. 6). Specifically, the environment comprises 4 goal nodes and 60 intermediate nodes (i.e., 15 per goal). Although each goal can be reached through multiple paths all of those paths lead through an unavoidable node that has a 10% risk of harboring a large loss of − 1500. The aim of this experiment is to verify that we are still able to improve human planning even when the environment requires a more complex strategy that occasionally switches goals during planning. To test this hypothesis we showed the participants demonstrations of the strategy discovered by our method in the experimental condition, and compared their performance to the resourcerationality scores of three control groups. The first control group was shown demonstrations of the strategy discovered by the version of our method without goalswitching; the second control group discovered their own strategy in five training trials; the third control group practiced planning on a threestep task with a feedbackbased tutor (see “Intelligent Cognitive Tutors”) (Lieder et al., 2019, 2020). The environment used by the feedbackbased tutor mimicked the highrisk environment. To achieve this we changed the reward distribution of the intermediate nodes so that there is a 10% chance of a negative reward of − 96, a 30% chance of − 4, a 30% chance of + 4, and a 30% chance of + 8. We then recomputed the optimal feedback using dynamic programming (Lieder et al., 2019, 2020).
We recruited 201 participants (average age 34.01 years, range:19–70 years; 101 female) on Amazon’s Mechanical Turk (Litman et al., 2017) over three consecutive days. All but two of them completed the assignment. Applying the same predetermined exclusion criterion as we used in Experiment 1 (i.e., excluding participants who do not engage with the environment in more than half of the test trials) led to the exclusion of 30 participants (15%), leaving us with 169 participants. Participants were paid 1.30$ and a performancedependent bonus of up to 1$. The average bonus was 0.56$ and the average time of the experiment was 16.28 min. Participants were randomly assigned to one of four experimental conditions determining their training in the planning task. All groups were tested in five identical test trials. The data was analyzed using the robust f1.ld.f1 function of the nparLD R package (Noguchi et al., 2012). As in Training Experiment 1, the participant’s resourcerationality score for each trial was calculated as the expected reward of the path they chose minus the cost of the clicks they had made.^{Footnote 8}
Results
The results of the experiment are summarized in Table 6. Since the ShapiroWilk test showed that none of the four conditions is normally distributed (p < .001 for all), we again used the nonparametric ANOVAtype statistic with Box approximation (Box & et al., 1954) to evaluate the data. The test showed significant differences between the four groups (F(2.79,150.28) = 20.19, p < .001). Pairwise robust ATS post hoc comparisons showed that participants trained with demonstrations of the strategy discovered by hierarchical BMPS with goalswitching significantly outperform all other conditions: participants trained by purely hierarchical demonstrations without goalswitching (F(1) = 58.72, p < .001), participants who did not receive demonstrations (F(1) = 31.13, p < .001), and participants who had practiced planning with optimal feedback on a smaller analogous environment (F(1) = 32.07, p < .001). The performance of the three control groups was statistically indistinguishable (all p ≥ 0.18). Participants trained by purely hierarchical demonstrations did not perform significantly better than participants that trained with optimal feedback (F(1) = 1.82, p = .18) or participants that did not receive demonstrations (F(1) = 0.34, p = .56). Additionally, there was no significant difference between the optimal feedback condition and the nofeedback condition (F(1) = 0.28, p = .6).
The participants from the goalswitching demonstration condition learned goal planning to a high degree (i.e., 60%). We again used a ztest and applied the BenjaminiHochberg method (Benjamini & Hochberg, 1995) to compare the goal planning performance between conditions. This showed that participants of the goalswitching demonstration condition learned goal planning better than the other conditions: nofeedback (35.71%; z = 2.265, p = .023), feedbackbased tutor (20.93%; z = 3.725, p < .001), and hierarchical demonstration (25.64%; z = 3.164, p = .002). We repeated our statistical analysis for the click agreement measure, finding a significant effect of the experimental condition (F(2.58,132.19) = 47.45, p < .001). Comparing participants’ strategies to the strategy discovered by our method using the post hoc ATS showed that the planning operations of participants trained with our goalswitching strategy matched those of the optimal strategy more often (59.94% ± 29%) than those of participants in the nofeedback condition (24.33% ± 24%; F(1) = 56.62, p < .001), the feedback tutor condition (17.22% ± 14%; F(1) = 164.80, p < .001), and the hierarchical strategy without goalswitching (42.98% ± 23%; F(1) = 10.82, p = .001). We found additional significant differences between the no goalswitching hierarchical demonstration condition and the other two control conditions: optimal feedback (F(1) = 82.61, p < .001) and nofeedback (F(1) = 22.80, p < .001). Comparing the optimal feedback and nofeedback condition revealed no significant difference (F(1) = 3.01, p = .082). We again analyzed which aspects of the optimal strategy participants learned using the additional click agreement measures reported in “Results.” The results for the additional measures can be found in Table 7. We found significant differences in how well participants planned which goal to pursue (F(2.41,120.22) = 32.33, p < .001). Participants in the goalswitching demonstration condition (78.21%) and the hierarchical demonstration condition (77.73%) performed best at selecting goals according to the optimal strategy and did not differ significantly (F(1) = 0.05, p = .83). Participants in the goalswitching demonstration condition had a significantly higher goal agreement than participants in the nofeedback condition (57.76%; F(1) = 11.36, p < .001), and participants of the feedback tutor condition (36.01%; F(1) = 110.5, p < .001). The path planning agreement also showed significant differences for the different conditions (F(2.45,117.43) = 19.12, p < .001). Participants trained by the goalswitching demonstration achieved a agreement score of 72.71%, whereas the other conditions achieved a significantly lower agreement: nofeedback (24.47%; F(1) = 42.53, p < .001), hierarchical demonstration (21.50%; F(1) = 36.96, p < .001), and feedback tutor (30.12%; F(1) = 37.89, p < .001). Similarly, the participants trained by the goalswitching demonstration achieved higher termination agreement scores, matching the optimal strategy’s termination rule more closely than participants of the nofeedback, feedbackbased tutor, and hierarchical demonstration conditions across all three termination agreement measures. For all three termination measures, the differences between conditions are statistically significant: termination agreement (F(2.96,161.39) = 10.69, p < .001), goal termination agreement (F(2.77,145.41) = 17.62, p < .001), and path termination agreement (F(2.86,152.32) = 10.54, p < .001). When deciding when to stop planning (termination agreement), participants who received goalswitching demonstrations matched the optimal strategy to a higher percentage (76.35%) than participants of the nofeedback condition (69.19%; F(1) = 5.65, p = .017), the hierarchical demonstration condition (68.35%; F(1) = 7.41, p = .006), and the feedback tutor condition (63.05%; F(1) = 34.31, p < .001). When deciding when to stop planning which goal to pursue (goal termination agreement), participants of the goalswitching demonstration condition (84%) performed better than participants who received nofeedback (71.75%; F(1) = 8.62, p = .003) and participants who were in the feedbacktutor condition (57.31%; F(1) = 61.80, p < .001), but there was no significant difference to participants who received hierarchical demonstrations (78.19%; F(1) = 2.65, p = .103). Lastly, participants who received the goalswitching demonstration (79.66%) also performed significantly better than all other conditions when deciding when to stop planning how to achieve the pursued goal (path term agreement): nofeedback (61.85%; F(1) = 12.26, p < .0.001), hierarchical demonstration (54.91%; F(1) = 21.35, p < .001), and feedbacktutor (58.21%; F(1) = 26.83, p < .001).
The results of this experiment showed that we can significantly improve human decisionmaking by showing them demonstrations of the automatically discovered hierarchical planning strategy with goalswitching. This is a unique advantage of our new method because none of the other approaches was able to improve people’s decisionmaking in this large and risky environment. By comparing human performance to the optimal performance of our algorithm in the same environment (see Table 2) we can see that even though we were able to improve human performance, participants still did not fully understand the strategy based on the demonstrations alone. This reveals the limitations of teaching planning strategies purely with demonstrations, especially for more complex strategies. Improving upon the pedagogy of our purely demonstrationbased hierarchical tutor is an important direction for future work. The different click agreement measures further showed that while participants who received demonstrations of our method did not entirely mastering the optimal strategy, they were able to find the optimal action in a significantly higher amount of trials.
Control Experiment
To investigate a potential confound we ran an additional experiment using the 8step decision problem shown in Fig. 6 (i.e., the environment used in Experiment 2). The goal of this experiment was to investigate whether participants truly benefit from demonstrations of the nearoptimal strategy or if they merely learn about the environment’s structure and then use this knowledge to inform their own strategy.
Methods
The hypothesis we want to verify in this experiment is that participants trained by our tutor would learn to follow the optimal strategy more closely than participants who learn the same information about the environment structure (e.g., potential rewards of different nodes) but do not see any demonstrations of the strategy discovered by our method. To test this hypothesis, we compared the performance of participants who were shown demonstrations of our method (analogous to the goalswitching demonstration condition in Experiment 2) to a new control condition. Participants in the control condition were shown random demonstrations where the demonstrator clicks random nodes and then moves along a random path through the environment. The actions of the random demonstrations were matched to the types of nodes clicked in the demonstration of our method (e.g., if the demonstration of our method revealed two goal nodes and two risky nodes, the random demonstration in that environment would also click two random goal nodes and two random risky nodes). Additionally, the random demonstrations also contained 46 randomly selected additional clicks that were not included in our method’s demonstration, arguably teaching participants more information about the structure of the environment than the demonstration of our method.
We recruited 100 participants (average age 25.06 years, range: 18–48, 53 female) on Prolific (http://www.prolific.co) [2021]. Applying the exclusion criterion used in Experiment 1 (participants who didn’t engage with the task) led to the exclusion of 20 participants (20%). Two additional participants were removed because they indicated they did not try their best to achieve a high score. Participants were paid 2.00£ plus a performancedependent bonus of up to 1.00£. The average bonus was 0.50£ and the average completion time 18.9 min. Participants were randomly assigned one of the two conditions. The data was analyzed analogous to Experiment 1 and Experiment 2 using the nparLD R package (Noguchi et al., 2012).
Results
The results for this experiment can be found in Table 8 and Table 9. We used the nonparametric ANOVAtype statistic with Box approximation (Box & et al., 1954) to analyze the data. Our analysis showed a significant difference between the resourcerationality scores of the two conditions (F(1,74.86) = 5.53,p = .021). Participants of the goalswitching demonstration condition achieved a higher resourcerationality score than participants who observed random demonstrations. With 59.52%, a significantly higher percentage of participants of the goalswitching demonstration condition learned to follow the general goal planning strategy of first selecting a goal and then planning how to achieve the goal than participants of the control condition (22.22%, z = 3.324, p < .001). We again compared how well the participants matched the optimal strategy (click agreement) and observed a higher match for the goalswitching demonstration condition (48.71% ± 28%) than the random demonstration condition (25.28% ± 23%). Analogous to the analysis of the previous Experiments, we used the ANOVAtype statistic to confirm the statistical significance of this difference (F(1,74.92) = 29.43,p < .001). Comparing the detailed click agreement measures reported in Table 9, participants who received the goalswitching demonstration of the optimal strategy achieved a significantly higher match with the optimal strategy across the following measures: goal agreement (F(1,64.17) = 8.09, p = .006), path agreement (F(1,75.59) = 5.38, p = .023), and termination agreement (F(1,72.97) = 4.73, p = .033). We did not detect a significant difference between the two conditions for the goal termination agreement (F(1,71.91) = 1.77, p = .188), and path termination agreement (F(1,74.26) = 0.39, p = .53).
The results of this experiment showed that teaching participants the discovered strategy leads to significantly better resourcerationality scores than purely teaching them about the reward structure of the environment. While participants of the random demonstration condition received even more information about the environment’s rewards, they were unable to achieve the same level of resourcerationality. The click agreement measures further showed that the participants of the goalswitching demonstration condition learned the optimal strategy to a higher extent. We observed that participants in the goalswitching demonstration condition of this experiment performed worse than participants of the same condition in Training Experiment 2 (see Table 6). Apart from some randomness due to sampling from a different online crowdsourcing platform we believe this has mostly been caused by a small number of outliers. Comparing the median expected scores shows a much smaller discrepancy between the experiments: − 57.5 (interquartile range: 145) for this experiment and − 20 (interquartile range: 140) for Training Experiment 2. Importantly, in both Experiments we observe a statistically significant improvement in expected resourcerationality score and click agreement.
General Discussion
To make good decisions in complex situations people and machines have to use efficient planning strategies because planning is costly. Efficient planning strategies can be discovered automatically. But computational challenges confined previous strategy discovery methods to tiny problems. To overcome this problem, we devised a more scalable machine learning approach to automatic strategy discovery. The central idea of our method is to decompose the strategy discovery problem into discovering goalsetting strategies and discovering goalachievement strategies. In addition, we made a substantial algorithmic improvement to the stateoftheart method for automatic strategy discovery (Callaway et al., 2018a) by introducing the tree contraction method. We found that this hierarchical decomposition of the planning problem, together with our tree contraction method, drastically reduces the time complexity of automatic strategy discovery without compromising on the quality of the discovered strategies in many cases. Furthermore, by introducing the tree contraction method we have extended the set of environment structures that automatic strategy discovery can be applied to from trees to directed acyclic graphs. These advances significantly extend the range of strategy discovery problems that can be solved by making the algorithm faster, more scalable, and applicable to environments with more complex structures. This is an important step towards discovering efficient planning strategies for realworld problems.
Recent findings suggest that teaching people automatically discovered efficient planning strategies is a promising way to improve their decisions (Lieder et al., 2019, 2020). Due to computational limitations, this approach was previously confined to sequential decision problems with at most three steps (Lieder et al., 2019, 2020). The strategy discovery methods developed in this article make it possible to scale up this approach to larger and more realistic planning tasks. As a proofofconcept, we showed that our method makes it possible to improve people’s decisions in planning tasks with up to 7 steps and up to 10 final goals. We evaluate the effectiveness of showing people demonstrations of the strategies discovered by our method in two separate experiments where the environments were so large that previous methods were unable to discover planning strategies within a time budget of 8 h. Thus, the best one could do at training people with previous methods was to construct cognitive tutors that taught people the optimal strategy for a smaller environment with a similar optimal strategy or having people practice without feedback. Evaluating our method against these alternative approaches we found that our approach was the only one that was significantly more beneficial than having people practice the task on their own. To the best of our knowledge, this makes our algorithm the only strategy discovery method that can improve human performance on sequential decision problems of this size. This suggests that our approach makes it possible to leverage reinforcement learning to improve human decisionmaking in problems that were out of reach for previous intelligent tutors.
Our empirical findings have several important implications for the debate about human rationality. First, we found that across many instances of two different classes of large sequential decisionmaking problems the resourcerationality score of human planning was substantially lower than that of the approximately resourcerational planning strategies discovered by our model. Second, we found that teaching people the automatically discovered planning strategies significantly improved their resourcerationality score in both environments. These findings suggest that, unlike in small problems (Callaway et al., 2018b, 2020), the planning strategies people use in large sequential decision problems might be far from resourcerational. This interpretation should be taken with a grain of salt since some of the discrepancies could be due to unaccounted cognitive costs. But people choosing their planning operations suboptimally likely plays an important role as well. Our findings add three important nuances to the resourcerational perspective on human rationality (Lieder & Griffiths 2020a, b). First, it suggests that human cognition might systematically deviate from the principles of resourcerationality. Second, it suggests that the magnitude of these deviations depends on the complexity of the task. Third, it suggests that those suboptimalities can be ameliorated by teaching people resourcerational strategies.
Our method’s hierarchical decomposition of the planning problem exploits that people can typically identify potential mid or longterm goals that might be much more valuable than any of the rewards they could attain in the short run. This corresponds to the assumption that the rewards available in more distant states are more variable than the rewards available in more proximal states. When this assumption is satisfied, our method discovers planning strategies much more rapidly than previous methods and the discovered strategies are as good as or better than those discovered with the best previous methods. When this assumption is violated, the goalswitching mechanism of our method can compensate for that mismatch. This allows the discovered strategies to perform almost as well as the strategies discovered by BMPS. Our method relies on this mechanism more the more strongly its assumption is violated. In doing so, it automatically trades off its computational speedup against the quality of the resulting strategy. This shows that our method is robust to violations of its assumptions about the structure of the environment; it exploits simplifying structure only when it exists.
Some aspects of our work share similarities with recent work on goalconditioned planning (Nasiriany et al., 2019; Pertsch et al., 2020), although the problem we solved is conceptually different. For comparison, both aforementioned methods optimize the route to a given final location, whereas our method learns a strategy for solving sequential decision problems where the strategy chooses the final state itself. Furthermore, while Nasiriany et al. (2019) specified a fixed strategy for selecting the sequence of goals, our method learns such a strategy itself. Critically, while policies learned by Nasiriany et al. (2019) select physical actions (e.g., move left), the metalevel policies learned by our method select planning operations (i.e., simulate the outcome of taking action a in state s and update the plan accordingly). Finally, our method explicitly considers the cost of planning to find algorithms that achieve the optimal tradeoff between the cost of planning and the quality of the resulting decisions.
Our method’s scalability has its price. Since our approach decomposes the full sequential decision problem into two subproblems (goal selection and goal planning), its accuracy can be limited by the fact that it never considers the whole problem space at once. This is unproblematic when the environment’s structure matches our method’s assumption that the rewards of potential goals are more variable than more proximal rewards. But it could be problematic when this assumption is violated too strongly. We mitigated this potential problem by allowing the strategy discovery algorithm to switch goals. Even with this adaptation, the discovered strategy is not optimal in all cases: Since the representation of the alternative goal reward is defined as its average expected reward, the algorithm will only switch goals if the current goal’s reward is below average. However, if the current goal’s expected return is above average, the discovered strategy will not explore other goals even when that would lead to a higher reward. On balance, we think that the scalability of our method to large environments outweighs this minor loss in resourcerationality score.
The advances presented in this article open up many exciting avenues for future work. For instance, our approach could be extended to plans with potentially many levels of hierarchically nested subgoals. Future work might also extend our method so that any state can be selected as a goal.
In its current form, our algorithm always selects only the environment’s most distant states (leaf nodes) as candidate goals. Future versions might allow the set of candidate goals to be chosen more flexibly such that some leaf nodes can be ignored and some especially important intermediate nodes in the tree can be considered as potential subgoals. A more flexible definition and potentially a dynamic selection of goal nodes could increase the strategy discovery algorithm’s performance and possibly allow us to solve a wider range of more complex problems. This would mitigate the limitations of the increasing variance assumption by considering all potentially valuable states as (sub)goals regardless of where they are located. Another important direction for future work is to define more realistic decision problems within the metaMDP framework. Since we show in this work that our method is able to improve human planning in a number of abstract example environments, a logical next step is to apply our method to a wider range of more realistic scenarios and validate that people are still able to learn and benefit from the discovered strategies in these environments.
The advances reported in this article have potential applications in artificial intelligence, cognitive science, and humancomputer interaction. First, since the hierarchical structure exploited by our method exists in many realworld problems, it may be worthwhile to apply our approach to discovering planning algorithms for other realworld applications of artificial intelligence where information is costly. This could be a promising step towards AI systems with a (super)human level of computational efficiency. Second, our method also enables cognitive scientists to scale up the resourcerational analysis methodology for understanding the cognitive mechanisms of decisionmaking (Lieder and Griffiths, 2020b) to increasingly more naturalistic models of the decision problems people face in real life. Third, future work will apply the methods developed in this article to train and support people in making realworld decisions they frequently face. Our approach is especially relevant when acquiring information that might improve a decision is costly or timeconsuming. This is the case in many realworld decisions. For instance, when a medical doctor plans how to treat a patient’s symptoms acquiring an additional piece of information might mean ordering an MRI scan that costs $1000. Similarly, a holiday planning app would have to be mindful of the user’s time when deciding which series of places and activities the user should evaluate to efficiently plan their road trip or vacation. Similar tradeoffs exist in project planning, financial planning, and time management. Furthermore, our approach can also be applied to support the information collection process of hiring decisions, purchasing decisions, and investment decisions. Such complicated tasks require a hierarchical approach in discovering the optimal strategy. Our approach could be used to help train people to be better decisionmakers with intelligent tutors (Lieder et al., 2019, 2020). Alternatively, the strategies could be conveyed by decision support systems that guide people through reallife decisions by asking a series of questions. In this case, each question the system asks would correspond to an adaptively chosen informationgathering operation. For example, more powerful versions of our approach could be used to help people plan what they want to do to achieve a high quality of life until the age of 80. Such problems require multistep planning and are affected by one’s decisions early on. Even if people do not engage in multistep decisionmaking, the problem persists and our approach would be able to improve human performance where the outcome depends on a long series of actions. In summary, the reinforcement learning method developed in this article is an important step towards intelligent systems with a (super)humanlevel computational efficiency, understanding how people make decisions, and leveraging artificial intelligence to improve human decisionmaking in the real world. At a high level, our findings support the conclusion that incorporating cognitively informed hierarchical structure into reinforcement learning methods can make them more useful for realworld applications.
Availability of Data and Material (Data Transparency)
All materials of the behavioral experiments we conducted are available at https://github.com/RationalityEnhancement/SSD_Hierarchical/master/HumanExperiments. Anonymized data from the experiments is available at https://github.com/RationalityEnhancement/SSD_Hierarchical/master/HumanExperiments.
Code Availability (Software Application or Custom Code)
The code of the machine learning methods introduced in this article is available at https://github.com/RationalityEnhancement/SSD_Hierarchical.
Notes
A node’s depth is defined as the length of the longest path connecting this node to the root node.
The results presented in the paper have up to 5^{90} possible belief states.
The performance with and without goalswitching is identical due to the lack of goalswitching performed by the discovered optimal strategy in the given environment structure.
While this strategy was discovered assuming that the cost of evaluating a potential goal node is the same as the cost of evaluating an intermediate node, we found that the discovered strategy remained the same as we increased the cost of evaluating goal nodes to 2, 5, or 10.
Detailed analysis of the robustness to different variance ratios is available in Appendix 3.
Without our tree contraction method, the original version of BMPS would not have been scalable enough to handle this environment.
As in the simulations, the cost per click in this environment was 1.
As in the simulations, the cost per click was 10.
The continuous normal distribution is discretized to 4 bins. So including the undiscovered state, each node has 5 possible state conditions.
The calculation for a loose upper bound can be trivially obtained for the case with goalswitching. The upper bound for the goalsetting phase would not change, and the time taken for the goalachievement phase can be added for each goal switch. The maximum number of goal switches possible is M and hence the time taken for the goalachievement phase can be added M times leading to the overall upper bound as O(M^{2} ⋅ B + B^{M} + M ⋅ N^{2} ⋅ B^{N}).
Two nodes are consecutive if they are in a direct parentchild relation.
References
Aronson, J E, Liang, T P, & MacCarthy, R V. (2005). Decision support systems and intelligent systems (Vol. 4). Upper Saddle River: Pearson PrenticeHall.
Benjamini, Y, & Hochberg, Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300.
Botvinick, M M (2008). Hierarchical models of behavior and prefrontal function. Trends in Cognitive Sciences, 12(5), 201–208.
Box, G E, et al. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems, i. Effect of inequality of variance in the oneway classification. The Annals of Mathematical Statistics, 25(2), 290–302.
Callaway, F, Lieder, F, Krueger, PM, & Griffiths, TL (2017). MouselabMDP: a new paradigm for tracing how people plan. In The 3rd multidisciplinary conference on reinforcement learning and decision making. https://osf.io/vmkrq/. Ann Arbor.
Callaway, F, Gul, S, Krueger, P M, Griffiths, T L, & Lieder, F. (2018a). Learning to select computations. Uncertainty in Artificial Intelligence. 34th Conference on Uncertainty in Artificial Intelligence 2018 (pp. 776–785).
Callaway, F, Lieder, F, Das, P, Gul, S, Krueger, P, & Griffiths, T. (2018b). A resourcerational analysis of human planning. In C. Kalish, M. Rau, J. Zhu, & T. Rogers (Eds.) CogSci 2018.
Callaway, F, van Opheusden, B, Gul, S, Das, P, Krueger, P, Lieder, F, & Griffiths, T. (2020). Human planning as optimal information seeking. Manuscript under review.
Carver, C S, & Scheier, M F. (2001). On the selfregulation of behavior. Cambridge: Cambridge University Press.
Gigerenzer, G, & Selten, R. (2002). Bounded rationality: the adaptive toolbox. Cambridge, MA, USA: MIT Press.
Griffiths, T L (2020). Understanding human intelligence through human limitations. Trends in Cognitive Sciences, 24(11), 873–883.
Griffiths, T L, Callaway, F, Chang, M B, Grant, E, Krueger, P M, & Lieder, F (2019). Doing more with less: metareasoning and metalearning in humans and machines. Current Opinion in Behavioral Sciences, 29, 24–30.
Hafenbrädl, S, Waeger, D, Marewski, J N, & Gigerenzer, G (2016). Applied decision making with fastandfrugal heuristics. Journal of Applied Research in Memory and Cognition, 5(2), 215–231.
Hay, N, Russell, S, Tolpin, D, & Shimony, SE. (2014). Selecting computations: theory and applications. arXiv:14082048.
Hertwig, R, & GrüneYanoff, T (2017). Nudging and boosting: steering or empowering good decisions. Perspectives on Psychological Science, 12(6), 973–986.
Huys, Q J, Eshel, N, O’Nions, E, Sheridan, L, Dayan, P, & Roiser, J P (2012). Bonsai trees in your head: how the Pavlovian system sculpts goaldirected choices by pruning decision trees. PLoS Computational Biology, 8(3), e1002410.
Johnson, E J, & Goldstein, D. (2003). Do defaults save lives?
Kaelbling, L P, & LozanoPérez, T. (2010). Hierarchical planning in the now. In Workshops at the twentyfourth AAAI conference on artificial intelligence.
Kemtur, A, Jain, Y, Mehta, A, Callaway, F, Consul, S, Stojcheski, J, & Lieder, F. (2020). Leveraging machine learning to automatically derive robust planning strategies from biased models of the environment. In CogSci 2020, CogSci.
Krueger, P M, Lieder, F, & Griffiths, T. L. (2017). Enhancing metacognitive reinforcement learning using reward structures and feedback. In Proceedings of the 39th annual conference of the cognitive science society. Cognitive Science Society.
Larrick, R P. (2004). Debiasing. Blackwell handbook of judgment and decision making, pp 316–338.
Lieder, F, & Griffiths, T L (2020a). Advancing rational analysis to the algorithmic level. Behavioral and Brain Sciences, 43, e27.
Lieder, F, & Griffiths, T L (2020b). Resourcerational analysis: understanding human cognition as the optimal use of limited computational resources. Behavioral and Brain Sciences, 43, e1.
Lieder, F, Krueger, P M, & Griffiths, T. (2017). An automatic method for discovering rational heuristics for risky choice. In CogSci.
Lieder, F, Callaway, F, Jain, Y, Krueger, P, Das, P, Gul, S, & Griffiths, T. (2019). A cognitive tutor for helping people overcome present bias. In RLDM 2019.
Lieder, F, Callaway, F, Jain, Y R, Das, P, Iwama, G, Gul, S, Krueger, P, & Griffiths, T L. (2020). Leveraging artificial intelligence to improve people’s planning strategies. Manuscript in revision.
Lin, C H, Kolobov, A, Kamar, E, & Horvitz, E. (2015). Metareasoning for planning under uncertainty. In Twentyfourth international joint conference on artificial intelligence.
Litman, L, Robinson, J, & Abberbock, T (2017). Turkprime.com: a versatile crowdsourcing data acquisition platform for the behavioral sciences. Behavior Research Methods, 49(2), 433–442.
Marthi, B, Russell, S J, & Wolfe, J. A. (2007). Angelic semantics for highlevel actions. In Seventeenth international conference on automated planning and scheduling (pp. 232–239).
Miller, G A, Galanter, E, & Pribram, K H. (1960). Plans and the structure of behavior.
Mnih, V, Kavukcuoglu, K, Silver, D, Graves, A, Antonoglou, I, Wierstra, D, & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv:13125602.
Mockus, J. (2012). Bayesian approach to global optimization: theory and applications (Vol. 37). Springer Science & Business Media.
Nasiriany, S, Pong, V, Lin, S, & Levine, S. (2019). Planning with goalconditioned policies. In Advances in neural information processing systems (pp. 14843–14854).
Noguchi, K, Gel, YR, Brunner, E, & Konietschke, F (2012). nparLD: an R software package for the nonparametric analysis of longitudinal data in factorial experiments. Journal of Statistical Software, 50 (12), 1–23. http://www.jstatsoft.org/v50/i12/.
O’Donoghue, T, & Rabin, M (2015). Present bias: lessons learned and to be learned. American Economic Review, 105(5), 273–79.
Pertsch, K, Rybkin, O, Ebert, F, Finn, C, Jayaraman, D, & Levine, S. (2020). Longhorizon visual planning with goalconditioned hierarchical predictors. arXiv:200613205.
Russell, S, & Norvig, P. (2002). Artificial intelligence: a modern approach.
Russell, SJ, & Wefald, E. (1991). Do the right thing: studies in limited rationality. Cambridge, MA, USA: MIT Press.
Russell, S, & Wefald, E (1992). Principles of metareasoning. Artificial Intelligence, 49(13), 361–395.
Sacerdoti, E D (1974). Planning in a hierarchy of abstraction spaces. Artificial Intelligence, 5(2), 115–135.
Schapiro, A C, Rogers, T T, Cordova, N I, TurkBrowne, N B, & Botvinick, M M (2013). Neural representations of events arise from temporal community structure. Nature Neuroscience, 16(4), 486.
Sezener, E, & Dayan, P. (2020). Static and dynamic values of computation in mcts. In Conference on uncertainty in artificial intelligence, PMLR (pp. 31–40).
Simon, H A (1956). Rational choice and the structure of the environment. Psychological Review, 63(2), 129.
Solway, A, Diuk, C, Córdova, N, Yee, D, Barto, A G, Niv, Y, & Botvinick, M M (2014). Optimal behavioral hierarchy. PLoS Computational Biology, 10(8), e1003779.
Sutton, RS, & Barto, AG. (2018). Reinforcement learning: an introduction. Cambridge, MA, USA: MIT Press.
Svegliato, J, & Zilberstein, S. (2018). Adaptive metareasoning for bounded rational agents. In CAIECAI workshop on architectures and evaluation for generality, autonomy and progress in AI (AEGAP). Stockholm.
The GPyOpt Authors. (2016). GPyOpt: a Bayesian optimization framework in Python. http://github.com/SheffieldML/GPyOpt.
Todd, PM, & Gigerenzer, GE. (2012). Ecological rationality: intelligence in the world. Oxford: Oxford University Press.
Tomov, M S, Yagati, S, Kumar, A, Yang, W, & Gershman, S J (2020). Discovery of hierarchical representations for efficient planning. PLoS Computational Biology, 16(4), e1007594.
Wolfe, J, Marthi, B, & Russell, S. (2010). Combined task and motion planning for mobile manipulation. In Twentieth international conference on automated planning and scheduling.
Acknowledgements
The authors would like to thank Yash Rah Jain, Frederic Becker, Aashay Mehta, and Julian Skirzynski for helpful discussions.
Funding
Open Access funding enabled and organized by Projekt DEAL. This project was funded by grant number CyVyRF201902 from the Cyber Valley Research Fund.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics Approval
The experiments reported in this article were approved by the IEC of the University of Tübingen under IRB protocol number 667/2018BO2 (“OnlineExperimente über das Erlernen von Entscheidungsstrategien”).
Consent to Participate
Informed consent was obtained from all individual participants included in the study.
Consent for Publication
Not applicable
Competing Interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Saksham Consul and Lovis Heindrich contributed equally to this work.
Appendices
Appendix 1: VOC features
The features used to approximate the value of information can be explained using a simplified MouselabMDP environment. The ground truth value of each node is depicted in Fig. 10a. The rewards are sampled from a Categorical equiprobable distribution of {− 2,− 1, 1, 2}, {− 8,− 4, 4, 8} and {− 48,− 24, 24, 48} for nodes of depth 1, 2 and 3 respectively. In this example, the value of information for two metalevel actions is considered, marked in Fig. 10b.
To compute the VOC, the possible values of a subset of nodes are considered and the cumulative reward accumulated from the maximal path is computed to find the expected return if the node values were known. Subtracting this from the cumulative reward from the maximal path given the current belief state gives the value of information of knowing performing the metalevel action. The greater the difference in the two quantities implies greater information is possibly gained for performing the computation.
In case of myopic VOC computation, the subset of nodes considered is just the node whose reward would be revealed by the metalevel action. VPI considers the entire subset of nodes in the environment. VPI_{sub} considers all nodes lying on paths which pass through the node revealed by the computation. For example, when considering the VPI_{sub} for the computation which reveals the node marked in blue in Fig. 10b, nodes 1, 2 and 4 would be considered. Whereas, nodes 1, 3 and 7 would be considered for the computation corresponding to revealing the value of the green node.
Appendix 2: Scalability analysis for discovery of hierarchical planning strategy
We analyze the scalability of algorithms by comparing the computation time and the maximum size of solvable planning environments to previous computational methods. Table 10 and Fig. 5b compare the run times of our hierarchical strategy discovery methods against their nonhierarchical counterparts and adaptive metareasoning policy search as discussed in “Evaluating the Performance, Scalability, and Robustness of Our Method for Discovering Hierarchical Planning Strategies.” This comparison shows that imposing hierarchical structure substantially increased the scalability of BMPS and the greedy myopic VOC method. The improved run time profile reflects a reduction in the asymptotic upper bound on the algorithms’ run times when hierarchical structure is imposed on the strategy space as seen in Appendix 4. As shown in the last column of Table 10, this reduction in computational complexity increases the size of environments for which we can discover resourcerational planning strategies by a factor of 14–15, depending on the required quality of the planning strategy. This makes it possible to automatically discover planning strategies for sequential decision problems with up to 2520 states. Consequently, our method scales to metalevel MDPs with up to 5^{2520} possible belief states whereas the original version of BMPS was limited to problems with only up to 5^{36} possible belief states.^{Footnote 9} This shows that our approach increased the scalability of automatic algorithm discovery by a factor of 5^{2484}. This is a significant step towards discovering resourcerational planning strategies for the complex planning problems people face in the real world. While the hierarchical greedy myopic VOC method is the most scalable strategy discovery method, the fastest method on our four benchmarks was our hierarchical BMPS algorithm with tree contraction. Comparing the first two rows shows that the tree contraction method described in Appendix 8 significantly contributed to this speedup described in detail in Appendix 7.
Appendix 3: Robustness to violations of assumed structure
To determine the range of planning problems for which our hierarchical decomposition can be used to discover resourcerational planning strategies at scale, we varied the variance structure of the 2goal environment and compared the performance of BMPS with and without hierarchical structure (see Table 12). We found that the usefulness of the hierarchical decomposition depends on the variance ratios of the goal values versus path costs. Concretely, the performance loss of the algorithm utilizing the hierarchical structure is below 5% when the variance of goal values is at least as high as the variance of the path costs, but increases to 10% as the variance of goal values drops to only onethird of the variance of the path costs.
Appendix 4: Time complexity upper bound analysis
In this section we analyze the computational time upper bound of the methods that we used. For simplicity, in the hierarchical case, we assume that goalswitching is not possible.^{Footnote 10} That means that the highlevel controller would run once, followed by the lowlevel controller. To ensure readability, we explicitly define the notations used in this section and throughout the paper:

N: number of intermediate nodes to goal

M: number of goal states

B: number of bins to discretize a continuous probability distribution

RUN: number of unrevealed nodes relevant to compute feature
The hierarchical decomposition reduced the time complexity of the myopic strategy discovery problem from O((M ⋅ N)^{2} ⋅ B) to O((M^{2} + N^{2}) ⋅ B). For the BMPS algorithm, the hierarchical structure reduces the computational time upper bound from O((M ⋅ N)^{2} ⋅ B^{N}) to O(M^{2} ⋅ B + B^{M} + N^{2} ⋅ B^{N}). The reduction in the upper bound implies that algorithms that use hierarchical structure are scalable to more complex environments.
As discussed in “Methods for Solving Metalevel MDPs,” metalevel actions are selected based on maximizing the approximate VOC. At each step, a metalevel action that maximizes the approximate VOC is chosen. Selection of a metalevel action converts the probability distribution of the chosen node to a Dirac delta function concentrated at the revealed node value. The calculation of VOI features for continuous probability distributions requires computations of multiple cumulative distribution functions (CDF). In general, this procedure is computationally expensive and an approximation of it inherently requires discretization. Hence, the probability density function (PDF) of a continuous probability distribution associated with a node in the environment is discretized into B bins. As the number of bins B increases, the discrepancy between the approximation and the true PDF/CDF decreases and it shrinks to 0 as \(B \to \infty \) at the cost of higher computation cost.
The number of relevant unrevealed nodes (RUN) is the count of unrevealed nodes relevant to compute a VOI feature. The RUN varies for different algorithm features and directly affects the time required to compute their values for a given stateaction pair. For calculating the approximate VOC, the number of possible values to be compared is B^{RUN}. Each possible value set requires \(\alpha \in \mathbb {R}_{\neq 0}\) time to compute the highest cumulative return from all the possible outcomes of the set. Hence, the time required to calculate the approximate VOC scales with α ⋅ B^{RUN}. For myopic VOC estimation, the number of relevant unrevealed nodes is 1. For BMPS, the number of relevant nodes for each of the VOI features \(\mathcal {F}=\lbrace \mathrm {VOI_{1}}, \text {VPI}, \mathrm {VPI_{sub}} \rbrace \) is different. The VOI_{1} feature requires the least time for computation since its value depends on only one node in the environment. On the other hand, the most timeconsuming calculation is for the VPI feature since its value depends on the all nodes in the environment. The VPI_{sub} feature considers values of all paths that pass through a node evaluated by the metalevel action. Hence, the most timeconsuming calculation of VPI_{sub} is for metalevel actions that correspond to the goal node. Speaking about algorithmic complexity in terms of the bigO notation, it takes O(B) time to calculate the myopic VOC value for a given stateaction pair. On the contrary, it takes O(B^{RUN}) time to calculate the VPI_{sub} value for a given stateaction pair.
The maximum amount of computational time to calculate the approximate VOC directly depends on the selection of all possible metalevel actions, for which we prove upper bounds for both the nonhierarchical (in Appendix E) and hierarchical strategy discovery problem (in Appendix F).
Appendix 5: Time complexity of the nonhierarchical strategy discovery problem
In the setting of the nonhierarchical strategy discovery problem, the metalevel policy has to initially select the best metalevel action from M ⋅ (N + 1) + 1 possible actions. This selection requires M ⋅ (N + 1) VOI feature computations. Since computation of VPI is required only once for a given state, the next computationally mostexpensive feature is VPI_{sub}, which is computed for each possible action. From all possible actions, the ones that demand most of the computational time to compute VPI_{sub} are the actions that correspond to the goal nodes. In this case, the computation of VPI_{sub} takes O(B^{N}) time.
In general, if there are M goals in the environment and each goal consists of N + 1 nodes (i.e., N intermediate nodes + 1 goal node), the maximum number of metalevel actions performed including the termination action is M ⋅ (N + 1) + 1.
In the nonhierarchical BMPS strategy discovery problem, the computational time upper bound to perform all metalevel actions and terminate is
For the greedy myopic strategy discovery algorithm, the number of relevant nodes (RUN = 1) reduces the second term in the equation to O(B). In this case, the computational time upper bound to perform all metalevel actions and terminate is
Appendix 6: Time complexity of the hierarchical strategy discovery problem
In the setting of the hierarchical strategy discovery problem, the action space shrinks severely for each metalevel action selection. During the goalsetting phase of the procedure, the number of highlevel actions including the highlevel policy termination is M + 1. The selection of a highlevel action, requires the 1 computation of VPI^{H} and at most M computations of \(\mathrm {VO{I_{1}^{H}}}\). Therefore, the computational time upper bound for this phase is
Similarly, during the goalachievement phase of the algorithmic procedure, the number of lowlevel metalevel actions for each goal is N + 1. The most timeconsuming feature calculated in the goalachievement phase is \(\mathrm {VPI_{sub}^{L}}\). Therefore, the computational time upper bound for this phase per goal is
To calculate the upper bound for the hierarchical strategy discovery algorithm, the combined computational time for the highlevel and lowlevel policy is the sum of the computational time consumed on both levels independently. Additionally, since the value of a goal node is not required in the goalachievement procedure, the computational time upper bound of metalevel actions at the low level is bounded by the number of intermediate nodes N for each goal separately.
The computational time upper bound of the hierarchical BMPS is
In the case of myopic approximation, the number of relevant nodes at each level is 1.
Therefore, the computational time upper bound to perform all metalevel actions and terminate is
Appendix 7: Analysis of the speedup achieved by the tree contraction method
Table 14 shows an example computation of a single VPI feature (Callaway et al., 2018a) computation. The computational speedup allows us to solve larger environments in the same amount of time and contributes to scaling the algorithm to more realistic problems.
Appendix 8: Tree contraction method for faster BMPS feature computation
To further increase the scalability of BMPS, we make an additional improvement to how it computes the features used to approximate the value of computation (Callaway et al., 2018a). Specifically, we aim to improve the computational efficiency by combining nodes in the meta MDP according to a set of predefined conditions, ultimately reducing the complexity of the necessary computations. The node combination is performed by merging two nodes into a single new node with a probability distribution that represents their combined reward value. The aggregated state is then used to speed up the calculation of VOC features and performed on the fly for each considered feature. The main planning operations are still performed in the full, uncontracted meta MDP, the contraction is only applied to the VOC feature calculation.
The algorithm consists of three different operations that combine node distributions. A list of conditions determines an operation to apply and the algorithm stops when the distributions of all nodes within the MDP are collapsed to a single root node. Graphical examples of how the contraction operations are applied can be found in Fig. 11.

Add: Combines the distribution of two consecutive nodes^{Footnote 11} by adding their distributions. This operation can be applied to two consecutive nodes in the tree as long as the parent node does not have other child nodes and the child node does not have other parents. The two nodes are combined by taking the Cartesian product of their Categorical reward distribution where the rewards are added together and the probabilities are multiplied. Example: Assuming both nodes 1 and 2 in the Add example of Fig. 11 follow a Categorical distribution with possible outcomes {− 5(p = 0.5), 5(p = 0.5)}, there are four possible combinations of node values: {(− 5,− 5), (− 5, 5), (5,− 5), (5, 5)}. Adding the reward values and multiplying the corresponding probabilities results in the Categorical distribution for the combined node: {− 10(p = 0.25), 0(p = 0.5), 10(p = 0.25)}.

Maximise: Combines two parallel nodes by taking the maximum value for each combination of values of nodes can take, combining the nodes distributions while taking into account that the optimal path will always lead through the higher node of the two. This operation can be applied to two nodes that have a single identical parent and, optionally, child node. For this operation, when combining the two nodes’ Categorical reward distribution, the maximum of the reward values is used for each combination of outcomes. The probabilities of the outcomes are multiplied. Example: Similar to the example for the add operation, we assume both nodes 2 and 3 in the Maximize example of Fig. 11 follow a Categorical distribution with possible outcomes {− 5(p = 0.5), 5(p = 0.5)} with four possible combinations of node values: {(− 5,− 5), (− 5, 5), (5,− 5), (5, 5)}. For each combination, the maximum value is used to create a new node: {− 5(p = 0.25), 5(p = 0.75)}.

Split: Splits a child node into two separate nodes by duplicating that node. The whole tree is then duplicated as many times as the node has possible values, fixing the node’s distribution to each possibility. The duplicated trees are then individually reduced to single root nodes and the individual root nodes are combined to a single tree by pairwise application of the add operation. This operation can be applied to nodes that have multiple parent nodes where each of the individual nodes after splitting is only connected to one of its parent nodes. Example: We again define node 3 in the Split example of Fig. 11 to follow a Categorical distribution with possible outcomes {− 5(p = 0.5), 5(p = 0.5)}. Applying the Split operation results in two copies of the shown tree structure with 5 nodes. In one of the two copies, both nodes 2 and 4 will be assigned the value − 5 and in the other copy both nodes will be assigned the value 5. This operation simplifies the tree structure in a way that allows subsequent applications of Add and Maximize rules.
The split operation is the most computationally expensive operation and is therefore only applied when the add and maximize operations are insufficient to reduce the tree to a single node. Specifically, this happens when a node that needs to be reduced by the multiply operation has an additional parent or child node. Since the structure of the environment stays identical while the rewards and discovered states vary, we precompute the necessary operations to reduce the tree and then apply the reduction individually for each problem instance.
Our adjustment is purely algorithmic and it does not change the value of computation. Therefore, it does not impair the performance of the discovered strategies. An additional effect of the tree contraction method is that it extends the types of environments solvable by BMPS. Previously, BMPS was only able to handle environments with a branching tree structure: nodes can have multiple children but never multiple parents. Our new formulation allows us to compute the BMPS features for tree structures in which nodes have multiple parent nodes as well. This is possible through the application of the maximize operation, which allows combining multiple parent nodes into a single node, making them solvable through the value of computation calculation. The range of solvable environments is therefore extended from trees to directed acyclic graphs. This extension is especially relevant for environments containing goal nodes since it is often the case that multiple intermediate nodes converge to the same goal node. The tree contraction improvement is compatible with both our new Hierarchical BMPS formulation and the original version of BMPS. We applied tree contraction to the original BMPS version as well to allow comparisons on environments hierarchical BMPS was unable to solve before.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Consul, S., Heindrich, L., Stojcheski, J. et al. Improving Human Decisionmaking by Discovering Efficient Strategies for Hierarchical Planning. Comput Brain Behav 5, 185–216 (2022). https://doi.org/10.1007/s42113022001283
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42113022001283