Automatic Discovery of Interpretable Planning Strategies

When making decisions, people often overlook critical information or are overly swayed by irrelevant information. A common approach to mitigate these biases is to provide decision-makers, especially professionals such as medical doctors, with decision aids, such as decision trees and flowcharts. Designing effective decision aids is a difficult problem. We propose that recently developed reinforcement learning methods for discovering clever heuristics for good decision-making can be partially leveraged to assist human experts in this design process. One of the biggest remaining obstacles to leveraging the aforementioned methods for improving human decision-making is that the policies they learn are opaque to people. To solve this problem, we introduce AI-Interpret: a general method for transforming idiosyncratic policies into simple and interpretable descriptions. Our algorithm combines recent advances in imitation learning and program induction with a new clustering method for identifying a large subset of demonstrations that can be accurately described by a simple, high-performing decision rule. We evaluate our new AI-Interpret algorithm and employ it to translate information-acquisition policies discovered through metalevel reinforcement learning. The results of three large behavioral experiments showed that the provision of decision rules as flowcharts significantly improved people's planning strategies and decisions across three different classes of sequential decision problems. Furthermore, a series of ablation studies confirmed that our AI-Interpret algorithm was critical to the discovery of interpretable decision rules and that it is ready to be applied to other reinforcement learning problems. We conclude that the methods and findings presented in this article are an important step towards leveraging automatic strategy discovery to improve human decision-making.

Our framework for improving human decision-making through automatic strategy discovery [24,25]. As illustrated in the upper row, the approach starts with modeling the decision problems people face in everyday life and how they can make those decisions in the framework of metalevel MDPs [5,7,17,24,25,26,27,28]. The optimal algorithm for human decision-making can be discovered by computing the optimal metalevel policy through metalevel reinforcement learning [6,28]. The contribution of this paper is to develop an algorithm called AI-Interpret that translates the resulting metalevel policies into flowcharts that people can follow to make better decisions.

Introduction
Human decision-making is plagued by many systematic errors that are known as cognitive biases [16,41]. To mitigate these biases, professionals, such as medical doctors, can be given decision aids such as decision trees and flowcharts, that guide them through a decision process that considers the most important information [18,23,30]. To be practical in real-life, the strategies suggested by decision aids have to be simple [13,15] and mindful of the decision-maker's valuable time and the constraints on what people can and cannot do to arrive at a decision [27]. Previous research has identified a small set of simple heuristics that satisfy these criteria and work well for specific decisions [13,15,27]. In principle, this approach could be applied to help people in a wide range of different situations but discovering clever strategies is very difficult. Our recent work suggests that this problem can be solved by leveraging machine learning to discover near-optimal strategies for human decision-making automatically [24,25,26,27] (see Section 2). Equipped with an automatically discovered decision strategy, we may tackle many real-world problems for which there are no existing heuristics, but which are nevertheless crucial for industrial applications ( Figure 2). One of the biggest remaining challenges is to formulate the discovered strategies in such a way that people can readily understand and apply them. This is especially problematic when strategies are discovered in the form of complex stochastic black-box policies. Here, we address this problem by developing an algorithm that transforms decision policies discovered through reinforcement learning into human-interpretable decision rules. As illustrated in Figure 1, the resulting algorithm may be incorporated into the reinforcement learning framework for automatic strategy discovery, enabling us to automatically discover flowcharts that people can follow to arrive at better decisions.
We start with describing the background of our approach in Section 2 and present our problem statement in Section 3. Section 4 focuses on related work. In Section 5, we introduce a new approach to interpretable RL -AI-Interpret -along with a pipeline for generating decision aids through automatic strategy discovery. In Section 6, using behavioral experiments, we demonstrate that decision aids designed with the help of automatic strategy discovery and AI-Interpret can significantly improve human decision-making. The results in Section 7 show that AI-Interpret was critical to this success. We close by discussing potential real-world applications of our approach and directions for future work. Formalizing the optimal algorithms for human decision-making in the real-world as the solution to a metalevel MDP. a) Illustration of a real-life decision-problem and an efficient heuristic decision strategy for making such decisions. In this example, an employee of a purchasing department was tasked to quickly select a camera for somebody in their company. Critically, such decision problems and the optimal strategies for making such decisions can be formalized in the computational framework of metalevel MDPs illustrated in Panel b. The decision-maker's goal can be modelled as maximizing the expected value of the chosen option minus the time cost of making the decision [27]. The expected subjective value of an alternative a given the acquired information B T at the time of the decision (T ) can be modelled as a weighted sum of its scores on several attributes (e.g., E [U (a)|B T ] = 0.2 · isBestSeller(a) + 0.3 · customerSatisfaction(a) + 0.4 · inexpensiveness(a) + ·0.05 · ... where the weights reflect the company's preferences). To estimate the alternatives' subjective values, the decision-maker has to perform computations C by acquiring information (e.g., the customer rating of the second camera) and updating its beliefs (B) accordingly. Each computation has a cost (cost(B T , Ct)). The optimal decision strategy maximizes the expected subjective value of the final decision minus the cumulative cost of the decision operations that had to be performed to reach that decision. b) Illustration of a metalevel MDP (see Definition 2). A metalevel MDP is a Markov Decision Process where actions are computations (C) and states encode the agent's beliefs (B). The rewards for computations (R 1 , R 2 , · · · ) measure the cost of computation and the reward for terminating deliberation (R T ) is the expected return for executing the plan that is best given the current belief state (B T ). Discovering the optimal strategies corresponds to computing the optimal metalevel policy, which achieves an optimal trade-off between decision quality and computational cost.

Background
In this section, we define the formal machinery that the methods presented in this article are based on. We start with discussing the basics of reinforcement learning (RL). Next, we describe the theory of resource rationality as it underpins the framework of automatic strategy discovery. Then, we move to the topic of imitation learning to describe the family of methods that our algorithm for interpretable RL belongs to. We finish with defining disjunctive normal form formulas which constitute the formal output of our algorithm.

Basic Notions in Reinforcement Learning
In general, AI-Interpret considers reinforcement learning policies defined on finite Markov Decision Processes. A Markov Decision Process (MDP) is a formal tool that represents problems which involve agents that interact with some environment, i.e. observe states and take actions. To satisfy our primary goal of teaching humans, however, we use this algorithm on policies found for metalevel Markov Decision Processes that instead of states and actions, model beliefs and computations.
Definition 1 (Markov Decision Process) A Markov decision process (MDP) is a finite process that satisfies the Markov property (each state in the process is independent of the history). It is represented by a tuple (S, A, T , R, γ) where S is a set of states; A is a set of actions; T (s, a, s ) = P(s t+1 = s | s t = s, a t = a) is a state transition function; γ ∈ (0, 1) is a discount factor; R : S → R is a reward function.
Note that R could be also represented as a function dependent on state-action pairs R : S ×A → R. Policy π : S → A denotes a deterministic function that controls agent's behavior in an MDP and a nondeterministic π : S → P rob(A) defines a probability distribution over the actions.

Definition 2 (Metalevel Markov Decision Process)
A metalevel MDP is a finite process represented by a 4-tuple (B, C, T meta , R meta ) where B is a set of beliefs; C is a set of computational primitives; T meta (b, c, b ) = P(b t+1 = b | b t = b, c t = c) is a belief transition function; R meta : C ∪ {⊥} → R is a reward function which captures the cost of computations in C and the utility of the optimal course of actions after terminating the computations with ⊥.
Analogically to the previous case, π meta : B → C is a deterministic metalevel policy that controls how the agent is making computations, and π meta : B → P rob(C) defines a probability distribution over the computations.
A class of methods that learn the optimal policy π * or π * meta which maximizes the reward through trial and error optimization process is called reinforcement learning (RL).

Resource Rationality
Discovering strategies that people can be later taught is rooted in the theory of resource rationality [27]. This theory describes planning/decision-making heuristics people are capable of using as such that maximize the value of resource-rationality.
Definition 3 (Resource Rationality) Resource rationality (RR) is the extent to which a strategy h makes optimal use of the limited computational resources of the agent's brain B for solving the problems posed by the environment E, that is: where s 0 = (o, b 0 ) comprises what the agent's observations about the environment (o) and its internal state b 0 , u(result) is the agent's utility of the outcomes of the decisions that the strategy h might make, and cost(t h , ρ) denotes the total opportunity cost of investing the cognitive resources ρ used or blocked by strategy h until it terminates deliberation at time t h . Expectations are taken with respect to the posterior probability distribution of possible results given the environment E and the agent's observations o about its current state.
Note that the execution time and the possible results of the strategy depend on the situation in which it is applied.

Definition 4 (Resource-rational Strategy)
A strategy h is said to be resource-rational for the environment E under the limited computational resources of the agent's brain B if that is when h * maximizes the value of resource rationality among all the strategies H B implementable by the agent's brain B.
Discovering resource-rational strategies can be expressed as a problem of finding the optimal policy for a metalevel MDP where the states represent the agent's beliefs, the actions represent the agent's computations, and the rewards are inherited from the costs of computations and the value of terminating under the current belief state. It is possible to solve for this strategy using dynamic programming [5] and reinforcement learning [6,28].

Imitation Learning
Our method for transforming a strategy to an interpretable form belongs to the family of imitation learning (IL) methods. IL is the problem of finding a policyπ that mimicks transitions provided in a dataset of trajectories D = {(s i , a i )} M i=1 where s i ∈ S, a i ∈ A [33]. Unlike canonical applications of IL that focus on imitating behavioral policies, our application focuses on metalevel policies for selecting computations.

Boolean Logics
The formal output of the algorithm we introduce is a logical formula in disjunctive normal form (DNF).
Equality 3 defines that a predicate is in DNF if it is a disjunction of conjunctions and every predicate appears only once in each conjunction.

Problem Definition and Proposed Solution
The main goal of the presented research is to develop a method that takes in a model of the environment and a model of people's cognitive architecture and returns a verbal or graphical description of an effective decision strategy that people will understand and enable them to make better decisions. This problem statement goes beyond the standard formulation of interpretable AI by requiring that when people are given an "interpretable" description of a decision strategy they can execute that strategy themselves. Our approach to discovering human interpretable decision strategies comprises three steps: 1) formalizing the problem of strategy discovery as a metalevel MDP (see Figure 2, Definition 2), developing reinforcement learning methods for computing optimal metalevel policies, and describing the learned policies by simple and human-interpretable flowcharts. Having proposed a solution to the first two sub-problems in previous work [5,6,26,27,28], we now turn to the third sub-problem.
The cited literature [5,6,26,27,28] proposed strategy discovery methods which yield blackbox policies that select which piece of information should be processed at which point in the decision processes depending on what is already known (see Figure 2). These policies may exhibit complex idiosyncratic behavior that is difficult to understand. In this article, we strive to develop a systematic method for transforming complex, block-box policies into simple human-interpretable flowcharts that allow people to approximately follow the near-optimal decision strategy expressed by the black-box policy. As a proof of concept, we study this problem in the domain of planning.
To advance the research agenda detailed above, we develop the Adaptive Imitation-Interpretation algorithm (AI-Interpret). Our algorithm captures the essence of the input policy by finding its simpler representation which performs almost as well. To accomplish that, AI-Interpret builds on the Logical Program Policies (LPP) method by Silver et al. [40]. We start by gathering demonstrations of the policy by running it multiple times. Then, we create a domain-specific language of predicates which captures features of states that could be encountered and actions that could be taken in the environment under consideration. AI-Interpret uses the constructed predicates to separate the set of demonstrations into clusters. Doing so, enables it to consider increasingly smaller sets of demonstrations and employ Logical Program Policies method in a structured search for simple logical formulas constructed from those predicates. To improve human decision making, we use AI-Interpret in our strategy discovery pipeline (see Figure 1). After modeling the planning problem as a metalevel MDP and using RL algorithms to compute its optimal policy, the pipeline uses AI-Interpret to find a set of candidate formulas and transforms them into decision trees. The decision trees are then visualized as flowcharts that people can follow to execute the strategy in real-life.

Related Work
Historically, discovering strategies that people can use to make better decisions and developing training programs and decision aids that help people execute such strategies was a manual process that relied exclusively on human expertise [18,23,30]. Recent work has been increasingly more concerned with discovering human decision strategies automatically [3,5,6,7,21,26,28] and with using cognitive tutors [24,25]. In our previous studies on this topic [3,5,6,7,21,26,28], we enabled automatic discovery of optimal decision strategies through leveraging reinforcement learning. In technical terms, we formalized decision strategies that make the best possible use of the decision-maker's precious time and finite computational resources [27] within the framework of metalevel MDPs (see Figure 2). This approach, however, led to stochastic black-box metalevel policies whose behavior can be idiosyncratic.
Most approaches to interpretable AI in the domain of reinforcement learning try to predict the behavior of Deep Reinforcement Learning (DRL) policies. Liu et al. [29], for instance, approximated a neural policy using an on-line mimic learner algorithm that returns Linear Model U-Trees (LMUT). LMUT represent Q-functions for a given MDP in a regression decision tree structure [4] and are learned using the stochastic gradient descent algorithm. By extracting rules form the learned LMUT, it is possible to comprehend action decisions for a given state considering conditions that are imposed on its feature-representation. Similar approaches to mimic neural networks with tree structures were presented in [1,11,12,19,22]. The Neurally Directed Program Search (NDPS) method, proposed by Verma et al. [47] combines DRL with an efficient program policy search. NDPS defines a domain-specific language of atomic statements with an imposed syntax and allows describing neural network-based policies with programs. These programs mimic the policies and may be found using imitation learning methods. Experiments in TORCS car-racing environment [48] showed that this approach can learn well-performing if-else programs [47]. Similar approaches that aim to learn programmatic policies via imitation learning were introduced in [2,35,45,46].
Our approach to discovering interpretable decision strategies uses the Logical Program Policies (LPP) method [40]. LPP is a Bayesian imitation learning method that combines decision tree induction with program search. Given a set of demonstrations D = {(s i , a i )} M i=1 , LPP outputs a posterior distribution over logical formulas in disjunctive normal form (DNF; see Definition 5) that best describe the generated data. For that purpose, the authors restrict the considered set of solutions to formulas {h 1 (s, a), ..., h n (s, a)} defined in a domain-specific language (DSL) -a set of predicates f i (s, a) : S × A → {0, 1}, called (simple) programs. Programs are understood as feature detectors which assign truth values to state-action pairs, and formulas over programs are the titular logical programs. To find the best logical programs, the authors employ maximum a posteriori estimation (MAP). Having found K programs h MAP i with the highest joint probability P (D, h MAP i ), they approximate their posterior probabilities P h MAP i | D by Each program h MAP i induces a stochastic policy which is a uniform distribution over all actions a for which h MAP i is true in state s. The programlevel policyπ integrates out the uncertainty about the possible programs and selects the action Importantly for our setting, the DNF formulas h i are learned through decision tree induction [37]. The set of demonstrated state-action pairs (s i , a i ) ∈ D is used to create binary feature vectors After applying an off-the-shelf decision tree induction method (e.g. [34,42]) on all v i s and v − i s, the formulas are extracted from the tree by treating each path leading to a positive decision as a conjunction of predicates. Note that there might also be multiple other paths which lead to a negative decision or a decision that is supported by an equal number of positive and negative examples. They are, however, of no interest to an imitation learning algorithm, because it establishes how to generate the actions, instead of how to not generate them. Considering all positive decision paths as an alternative results in a DNF formula. This shows an important distinction, namely, that the decision tree form which the formula came is not always equivalent to this formula, and in reality may be much more complex. The extracted formula, on the other hand, lists conditions that need to be satisfied in order for an action to be chosen by the expert in a given state, thereby allowing the user to comprehend how the expert generated the demonstrations. Having a DNF formula allows one to perform a mapping back to the decision tree space. Because of these observations and because of the fact that a decision tree is viewed as a strong potential candidate for generating interpretable descriptions of procedures ( [13,18]), we decided to use LPP in our quest of constructing an algorithm for interpretable RL. Further in the text, the notation LP P (D) will stand for the formula generated by Logical Program Policies method on the set of demonstrations D.

Algorithm for Interpretability
In this section, we introduce Adaptive Imitation-Interpretation (AI-Interpret), an algorithm that transforms the policy learned by a reinforcement learning agent into a human-interpretable form (see Section 3). We begin with explaining what contribution AI-Interpret makes and then provide a birds-eye view of how its components -LPP and clustering -work together to produce humaninterpretable descriptions. Afterwards, we detail our approach and present a heuristic method for choosing the number of demonstration-clusters used by AI-Interpret. In the last part of this section, we analyze the whole pipeline that uses the introduced algorithm to automatically discover interpretable planning strategies.

Technical Contribution
The main drawback of LPP is that its performance is highly sensitive to how well the set of domain predicates is paired with the set of demonstrations. What does it mean formally? In practice, the set of demonstrations D is equally divided into two disjoint sets D 1 , D 2 where one is used for learning programs h 1 , h 2 , · · · , h n , and the other gives an unbiased estimate of the likelihood P(D | h i ) ∝ P(D 2 | h i ). If the predicates are too specific or too general, many, if not all the considered programs, could generate a likelihood of 0 for D 2 , given that they were chosen to account well for the data in D 1 . It can also happen that no formula can be found for D 1 itself because the dataset D contains very rare examples of the policy's behavior (rarely encountered state-action pairs) that the predicates cannot explain. This can happen even when the predicates are sufficient to compose a reasonable solution. Despite the modeler's prior knowledge, considerable optimization might be needed to obtain a set of acceptable predicates for which neither of those issues arise. Fig. 3 Flowchart of the AI-Interpret algorithm. Demonstrations are turned into feature vectors using the Domain Specific Language of predicates, and then clustered into sets encompassing some type of the policy's behavior. The clusters are then ordered based on how interpretable they are. The LPP method tries to iteratively construct a logical formula that imitates the policy on the demonstrations and meets the input criteria. After every failed iteration, AI-Interpret removes the least interpretable cluster and the process repeats.
Another drawback of the LPP method is the limited interpretability of the solutions it finds. In the original formulation of LPP, the user has no control over the final form of the output other than specifying its DSL which needs to be optimized to work well with LPP. Similarly, it is unclear how well the policy induced by LPP performs in the environment in question, and how it compares to the policy that is being interpreted.
We propose a solution that uses the virtues of LPP and overcomes its limitations. The algorithm we introduce a) makes it possible to discover interpretable descriptions of RL policies even if the predicates are insufficient to describe all demonstrations, b) guarantees that the discovered interpretable descriptions are the best possible ones from a Bayesian point of view (i.e., they are MAP estimates), and c) simplifies complex policies into simple and understandable decision rules that perform almost as well.

Overview of the Algorithm
To simplify the process of creating the domain-specific language and to address the concerns mentioned in the previous subsection, we introduce an adaptive manipulation of the dataset D. Algorithm 1 revolves around LPP, but outputs an approximate solution even in situations in which LPP would not be able to find one. Figure 3 depicts a diagram with the workflow of AI-Interpret. Similarly to the LPP method, the computation starts with a set of demonstrations, and a domain specific language of predicates that describe the environment under consideration. The algorithm turns each of the demonstrations into a binary vector with one entry for each predicate, and with this data uses LPP to find a maximum a posteriori DNF formula that best explains the demonstrations. Contrary to the vanilla LPP, however, it does not stop after an unsuccessful attempt at interpretation that finds no solution or a solution that does not meet the input constraints. Instead, it searches for a subset of demonstrations that can be described by an appropriate interpretable decision rule. Concretely, AI-Interpret clusters the binary vectors into J separate sets and simultaneously assigns each a heuristic value. Intuitively, this value describes how simple it is to incorporate the demonstrations of that cluster into the final interpretable description. It then successively removes the clusters with the lowest values until LPP finds a MAP formula that is consistent with all of the remaining demonstrations and abides the specification provided by the constraints.

Adaptive Imitation-Interpretation
In this section we describe the algorithm in more detail. Firstly, AI-Interpret accepts a set of parameters that affect the final quality and interpretability of the result. Secondly, it takes four important steps (see steps 2, 3, 7 and 9) that need to be elaborated on.
We begin with a short explication of the parameters. Note that as it was stated in Section 4, a logical formula f induces a policy π f which assumes a uniform distribution over all the actions a accepted by f in state s, that is π f (a | s) = 1 |{a:f (s,a)=1}| . While describing the parameters, we will refer to this policy as the interpretable policy.
Aspiration Value α Parameter α specifies a threshold on the return ratio. For an interpretable policy to be accepted as a solution the ratio between its return and the return of the demonstrated policy has to be at least αδ (see the tolerance parameter below).
Number of Rollouts L Parameter L is a case-dependent parameter that specifies how many times to initialize a new state and run the interpretable policy in order to reliably estimate its performance (within the bounds specified by the tolerance parameter, see below). L should be chosen according to the problem.
Tolerance δ The parameter δ allows the user to express how much better a more complex decision rule would have to perform than a simpler rule to be preferable. Formally, the return ratio r 2 of the simplified strategy is considered to be significantly better than the return ratio r 1 of another strategy if r 2 − r 1 > δ.
Mean Reward m The mean reward of policy π is what the interpretable policy's return tries to match in expectation. The maximum deviation from m is controlled by the aspiration value, whereas the expected return of the interpretable policy is calculated by performing rollouts. Maximum Depth d Parameter d sets an upper bound on the depth of the tree that is a graphical representation of the algorithm's output. In equivalent terms, the formulas returned by AI-Interpret are required to use at most d predicates in each of its conjunctions. The depth of the tree (or the size of the conjunctions) is a proxy for interpretability. Decreasing the depth parameter d can force the solution to use fewer predicates; this can make the formula less accurate but more interpretable. Increasing the depth may allow the method to use more predicates; this could result in overfitting and a decline of interpretability.
Number of Clusters N The number of clusters N determines how coarsely to divide the demonstrations in D based on the similarity of their predicate values. A proper division enables selecting a subset D sub ⊆ D such that D sub guarantees a high probability of being captured with existing predicates, and lowers the chance of the validation set D 2 ⊂ D sub being largely different from Cut-size for the Clusters X If a cluster contains less than X% of the demonstrations, then AI-Interpret will disregard it (see step 4). Choosing representative clusters allows to remove the outliers. Since X could be in fact kept fixed for virtually any problem, it is treated as a hyperparameter and does not constitute the input to the algorithm.
Train and Formula-validation Set Split S Another hyperparameter of our method defines how to divide the set of demonstrations D to find a formula using one subset and compute its likelihood using the second subset. The split is applied to each cluster separately. Similarly to the cut-size, it could be kept fixed irrespective of the problem under consideration and hence does not constitute the input to AI-Interpret. We assume that splitting is performed implicitly by LPP. Now we move to the explanation of the steps taken by the algorithm and start with the isolated case of step 7. In the original formulation of LPP, the authors take all the actions that were not taken in a demonstration (s, a) to serve as the negative examples (s, a ), a = a. Since in our problem we do have access to the policy, we use a more conservative method and select only the state-action pairs which are sub-optimal with respect to π. This helps the algorithm find more accurate solutions.

8:
Initialize a set of candidate formulas F ; Compute the number of distinct predicates used by f and denote it p f ;

12:
Generate L rollouts for f and compute its mean reward m f ; 13: F ← F ∪ {f }; 14: end for 15: 16: Choose f best = arg min f ∈F p f ; 17: if m f best /m ≥ α then 18: return Formula simple formula f best which obtains a similar reward to π 19: end if 20: C min = arg min C∈C V (C); 21: C ← C \ {C min }; 22: until C = ∅ 23: return No solution for the considered set of predicates In step 2, the algorithm uses feature vectors corresponding to predicate values and clusters them into separate subsets. It is done through hierarchical clustering with the UPGMA method [31], as this method captures the intuition that there may exists a core of predicates which evaluate to the same value for the demonstrations forming the cluster, and that there might also exist irrelevant predicates making up the noise. The elements of the cluster identified by hierarchical clustering are hence well poised to capture different sub-behaviors of the demonstrated policy.
With step 3 the algorithm measures which of the clusters are indeed well described with the predicates. The Bayesian heuristic value of a cluster (Definition 6) is defined as the MAP estimate of its interpretable description found by Logical Program Policies, weighted by the size of the cluster relative to the size of the whole set. The larger the value, the more similar behavior is encompassed by the elements of the cluster, and the more representative it is. Note, that through step 3 (and after applying the cut-size in step 4) it becomes possible to rank order the clusters. In case of a failure in interpreting the policy with existing examples, the cluster with the lowest rank can be removed -see step 21. In this way, the algorithm may disregard a set of demonstrations that are not described with existing predicates as well as the others, and continue with the remaining ones.
In step 9, our algorithm uses the LPP method to extract formulas from progressively deeper decision trees, up to depth d specified by the user. It then selects formulas which are not significantly worse than other found ones (according to the tolerance parameter, see step 15), and eventually chooses the formula with the fewest predicates (step 16). This allows our algorithm to consider all decision rules that could be generated for the same (incomplete) demonstration dataset, and return the best and the simplest among them.
The solution is outputted as soon as the expected reward of the interpreted policy is close enough to the expected reward of the original policy (step 17). If that never happens, the algorithm concludes that the set of predicates is insufficient to satisfy the input constraints.
Definition 6 (Bayesian Heuristic Value) For a subset C ⊆ D extracted from a dataset of demonstrations D the Bayesian heuristic value of this set is given by:

Choosing the Number of Clusters
In this section, we introduce a heuristic that helps to narrow down the list of candidates for parameter N of the algorithm, that is the number of clusters.
In more detail, we adapt the popular elbow heuristic. Our version of this heuristic (see Procedure 1) allows to choose a subset of values for the number of clusters, by specifying how fine-grained the clustering is required to be to most drastically change the Clustering Value (see Definition 7). The Clustering Value of N , CV (N, X), is the sum of Bayesian heuristic values (Definition 6) of all the clusters found by hierarchical clustering with size is at least X% of the whole set. Practically, we use the same X that serves as the cut-size hyperparameter for AI-Interpret, see Section 5.3. A leap in the values of CV conveys that the clusters are relatively big and much better described in terms of the predicates as they were for a coarser clustering. We search for an elbow in the Clustering Values because we would like the clusters to be maximally distinct while keeping their number as small as possible. Finally, the heuristic returns a set of candidate elbows since a priori the granularity of the data revealed by the predicates is unknown.
where V stands for the Bayesian heuristic value function, N denotes the number of clusters C 1 , . . . , C N identified by hierarchical clustering on dataset D, that is N i=1 C i = D and C i ∩ C j = ∅ for i = j, and X stands for the cut-size value for the size of clusters measured proportionally to D.

Procedure 1 (Elbow Heuristic)
To decide the number of clusters, fix the cut-size hyperparameter X and evaluate the Clustering Value function CV on the set of m candidate values N 1 , . . . , N m .
identify the elbows. The elbows heuristically identify clustering solutions for which the elements within each cluster are similar to one another, can be appropriately described by the predicates, and convey that clusters are reasonably large.
If the value does not ever increase substantially, then the predicates do not capture the general structure of the data. An example of how to use the elbow heuristic is shown in Figure 4.

Pipeline for Interpretable Strategy Discovery
To go from a problem statement to an interpretable description of a strategy that solves that problem we take three main steps: 1) formulate the problem in formal terms, 2) discover the optimal strategy using this formulation, 3) interpret the discovered strategy. The first two steps can be broken down to sub-problems that our previous work has already solved ( [5,6,7]), and the last step is feasible through AI-Interpret. We show a pseudo-code which implements the pipeline for automatic discovery of interpretable planning strategies in Algorithm 2.  and choose a sufficient number of rollouts L. 7: Fix common cut-size hyperparameter for the clusters' sizes X and split for the demonstration set S. 8: Choose K and use the elbow heuristic to determine K candidates N 1 , · · · , N K defining how many clusters to use to separate the demonstrations. 9: for i ∈ 1, · · · , K do 10:

11:
Turn f i into a graphical representation of a decision tree dt(f i ). 12: Our pipeline starts with modeling the problem as a metalevel MDP and then solving it to obtain the optimal policy. To use AI-Interpret, the found policy is used to generate a set of demonstrations. We also create a DSL of predicates that is used to provide an interpretable description of this policy. We then establish the input to AI-Interpret. The mean reward of the metalevel policy is extracted directly from its Q function taking the maximum at the initial state. The number of clusters N is identified automatically by the elbow heuristic (Procedure 1). Since the elbow heuristic returns K candidates for N , our pipeline outputs a set of K possible interpretable descriptions, each for a different clustering. Having a dataset of candidate interpretable descriptions (decision trees) outputted by the pipeline, one may use background knowledge or a pre-specified criterion to choose the most interpretable tree. Criteria include, but are not limited to, choosing the tree with the least amount of nodes, the interpretability ratings of human judges, or the performance of people who are assisted by alternative decision trees. Out method of extracting the final result is detailed in Section 6.2.

Improving Human Decision-Making
Having developed a computational pipeline for discovering high-performing and easily-comprehensible decision rules, we now evaluate whether this approach meets our criteria for interpretability. As a proof-of-concept, we test our approach on three types of planning tasks that were developed to study human decision-making [7,25]. The central question whether we can support decision-makers in the process of planning by providing them with flowcharts discovered through our computational pipeline for interpretable strategy discovery (see Figure 1). To achieve that, we perform one large behavioral experiment for each of the three types of tasks. We find that our approach allows people to largely understand the automatically discovered strategies and to benefit from them.

Planning Problems
Human planning can be understood as a series of information gathering operations that update the person's beliefs about the relative goodness of alternative courses of actions [9]. Planning a road trip, for instance, involves gathering information about the locations one might visit, estimating the value of alternative trips, and deciding when to stop planning and execute the best plan found so far. The Mouselab-MDP paradigm [8] is a computer-based task that emulates these kinds of planning problems (see Figure 5). It asks people to choose between multiple different paths, each of which involves a series of steps. To choose between those paths, people can gather information about how much reward they will receive for visiting alternative locations by clicking on the corresponding location. Since people's time is valuable, gathering this information is costly (each click costs $1) but it can also improve the quality of their decisions. Therefore, a good planning strategy has to focus the decision-maker's attentions on the most valuable pieces of information.
To test our approach to improving human decision-making, we rely on three route-planning tasks that were designed to capture important aspects of why it is difficult for people to make good decisions in real-life [7]. For instance, the first task captures that certain steps that are very valuable in the long-run (e.g., filing taxes) are often unrewarding in the short-run whereas activities that are rewarding in the short-run (e.g., watching cat videos on YouTube) often have little value in the long-run. The three tasks have been previously used to study how people plan [8,7], to train people how to make more far-sighted decisions [25,24], and to compare the effectiveness of different ways to improve human decision-making [25].
The route planning problems we presented to our participants used the tree environment illustrated in Figure 5. The node at the bottom of this tree served as the starting node and was connected to 3 other nodes, each of those was connected to one additional node, and finally each of these single connections led to 2 further nodes. We will call this a 3-1-2 structure and refer to the nodes that can be reached in 1, 2, and 3 steps as level 1, level 2, and level 3, respectively. [7] defined the three environments we are using in terms of the distribution of rewards for nodes on each level. These environments differ in that the uncertainty about the rewards either increases, decreases, or stays the same from each step to the next. They created three discrete sequences of discrete uniform distributions, further called variance structures. Building on these structures, we defined three types of environments. Within each type of an environment, the rewards of all nodes at the same level were drawn from the same discrete uniform distribution. As shown below, what distinguishes the environments is the assignment of reward distributions to levels (read as level : support): Prior research on the constant and increasing variance environments indicates that tasks in the Mouselab-MDP paradigm come as a challenge to many people [7,25]. In the case of the constant variance structure, even extensive practice is not sufficient for participants to arrive at nearlyoptimal strategies [7]. We aimed to help people adopt good approximations to those strategies by the virtue of showing them a decision aid for playing a game defined in the world of Mouselab MDPs. To evaluate our method's potential for helping people make better decisions in this game, we designed a series of online experiments with real people. In each, participants were making decisions with versus without the support of an interpretable flowchart. In the game we were studying, participants helped a monkey to climb up a tree through a path that enables it to get the highest possible reward, see Figure 5. Their decisions regarded uncovering the hidden nodes in search of such paths, knowing that each uncovered reward takes some money away from the monkey ($1).

Designing Decision Aids with AI-Interpret and Automatic Strategy Discovery
To apply our interpretable strategy discovery pipeline (see Figure 1) to the benchmark problems described in Section 6.1, we model the optimal planning strategy for each of the three types of sequential decision problems as the solution to a metalevel MDP (see Definition 2) as it was previously done by [7]. The belief state of the metalevel MDP corresponds to which rewards have been observed at which locations. The computational actions of the metalevel MDP correspond to the clicks people make to reveal additional information. The cost of those computations is the fee that participants are charged for inspecting a node. We obtained the optimal metalevel policies for the three metalevel MDPs using the dynamic programming method developed by [7].
We then generated a set of 64 demonstrations by running the optimal metalevel policies on their respective metalevel MDPs. We then applied AI-Interpret to this set of 64 demonstrations. To allow AI-Interpret to describe the demonstrated policy in logical sentences via the LPP algorithm we provided it with a domain specific language (DSL). The DSL supplies LPP with the basic building blocks ("words") for describing the environment and the demonstrated information gathering operations. The domain specific language in which AI-Interpret describes the recommended strategies comprised six types of predicates: A probabilistic context-free grammar with 14 base P RED predicates, 15 GEN ERAL_P RED predicates, and 12 AM ON G_P RED predicates generated the final DSL according to the abovementioned types. This resulted in a set containing a total of 14206 elements. More information on the DSL can be found in the Supplementary Material. The predicates found in the flowcharts we used for the benchmark problems were of the following types: GEN ERAL_P RED, AM ON G(P REDS) and AM ON G(P REDS, AM ON G_P RED).
Other parameters necessary to employ AI-Interpret in the search of interpretable descriptions comprised number of rollouts, aspiration value, tolerance, maximum depth, number of clusters, mean reward of the expert policy. Preliminary runs performed to establish the expected return of the optimal metalevel policy or of policies which behaved similarly, revealed that L = 100000 is the number of rollouts appropriate for all the studied problems. The aspiration value α was fixed at 0.7 and the tolerance parameter δ was equal 0.025. The maximum depth d had a limit value of 5 and the number of clusters N was chosen by the elbow heuristic employed in Algorithm 2. Eventually, N was set to 18 for the increasing and constant variance environments and set to 23 for the decreasing variance environment. Clusters were created based on the output of the UPGMA hierarchical clustering with l 1 distance and average linkage function ( [31]). We extracted the mean rewards of the optimal policies by inspecting their Q function in the initial belief state b 0 . We also used 2 hyperparameters. To reject the outliers, both in the elbow heuristic and the algorithm, any cluster whose size was less than X = 2.5% of the whole set of demonstrations was disregarded. The split S for the demonstration set to validate the formulas in each iteration was equal to 0.7.
Applying AI-Interpret with this DSL and parameters to the demonstrations induced the formulas that were most likely to have generated the selected demonstrations. These formulas were subject to inductive constraints of the DSL and the simplicity required by the listed parameters. The formal output of the pipeline for automatic discovery of interpretable strategies which employed AI-Interpret (Algorithm 2) comprised a set of K = 4 decision trees defined in terms of logical predicates. We chose one output per decision problem by selecting the tree with the fewest nodes, breaking ties in favor of the decision tree with the lowest depth. To obtain fully comprehensible decision aids we turned those decision trees into human-interpretable flowcharts by manually translating their logical predicates into natural language questions. A translation of the GEN ERAL_P RED predicates depended on the particular characteristic they were capturing. For example, a predicate is_previous_observed_max was translated as "Was the previously observed value a 48?". The translation of AM ON G predicates was constructed based on the following prototype: "Is this node/it PRED AMONG_PRED", "Is this node/it PRED and PRED" or "Is this node/it AMONG_PRED among PRED nodes". For instance, among(not(is_observed), has_largest_depth) was translated as "Is it on the highest level among unobserved nodes?". The flowcharts led to two possible high-level decisions, which we named "Click it" and "Don't click it". The termination decision was reached when all the possible actions led to "Don't click it" decision. The flowcharts were eventually polished in the pilot experiments by asking participants for their semantic preferences and by incorporating comments that they submitted.
By applying this procedure to the three types of sequential decision problems described above, we obtained the three flowcharts shown in Figure 6. The flowchart for the increasing variance environment (Figure 6a), advises people to inspect nodes of the third level until they uncover the best possible reward (+48) and then suggests to stop planning and take action. Despite being simpler, this strategy performs almost as well as the optimal metalevel policy that it imitates (39.17 points/episode vs. 39.97 points/episode). The flowchart for the decreasing variance environment ( Figure 6b) allows people to move after clicking all level 1 nodes. This strategy is also simpler that the optimal metalevel policy that it imitates, and performs almost as well (28.47 points/episode vs. 30.14 points/episode). The flowchart for the constant variance environment (Figure 6c) instructs a person to click level 1 or 2 nodes that lie on the path with the highest expected return until observing a reward of +10. Then, it suggests to click the two level 3 nodes that are above the +10, and then either get back to clicking on level 1 or 2 or, if the best path is a path that passes through the +10, stop planning and take action. In this case we also note that this strategy is simpler than the optimal metalevel policy that it imitates and that it once again performs almost as well (7.03 points/episode vs. 9.33 points/episode).

Evaluation in Behavioral Experiments
We evaluated the interpretability of the decision aids designed with the help of automatic strategy discovery and AI-Interpret in a series of 3 behavioral experiments. These three experiments evaluated whether the flowcharts generated by our approach were able to improve human decisionmaking in the Mouselab-MDP environments with increasing variance, decreasing variance, and constant variance, respectively.
In each experiment participants were posed a series of sequential decision problems using the interface illustrated in Figure 5. In each round, the participant's task was to collect the highest possible sum of rewards, further called the score, by moving a monkey up a tree along one of the six possible paths. Nodes in the tree harbored positive or negative rewards and were initially occluded; they could be made visible by clicking on them for a fee of $1 or moving on top of them after planning. The participant's score was the sum of the rewards along their chosen path minus the fees they had paid to collect information.
Our experiments focused on two outcome measures: the expected score and the click agreement. The expected score is the sum of revealed rewards on the most promising path in the round right before the participant started to move, minus the cost of his or her clicks. This is true because the expected reward of an occluded node was 0 in all of the chosen decision problems. The fact that the expected score is equal to the value of the termination operation of the corresponding metalevel MDP, makes it the most principled performance metric we could choose. It is also the most reliable measure of the participant's decision quality because it is their expected performance across all possible environments that are consistent with the observed information. The total score, by contrast, includes additional noise due to the rewards underneath unobserved nodes. Our second outcome measure, click agreement, quantifies a person's understanding of the conveyed strategy by measuring how many of his or her clicks or consistent versus inconsistent with that strategy. When the participant clicked a node for which the flowchart said "Click it" this was considered a consistent click. When the participant clicked a node that the flowchart evaluated as "Don't click it" this was considered as an inconsistent click. We defined the click agreement as the proportion of consistent clicks in relation to all performed clicks, that is agreement = n consistent n consistent + n inconsistent .
When people made fewer clicks than the flowchart suggested, then the difference between the number of clicks made by the strategy shown in the flowchart and the participant's number of clicks was counted towards the number of inconsistent clicks. The number of clicks made by the strategy was estimated by its average number of clicks across 1000 simulations.

Experiment 1: Improving Planning in Environments with Increasing Variance
In the first experiment, we evaluated whether the flowchart presented in Figure 6a can improve people's performance in sequential decision problems where the uncertainty about the reward increases from each step to the next.
Procedure Each participant was randomly assigned to either a control condition in which no strategy was taught or an experimental condition in which a flowchart conveyed the strategy. The control condition consisted of an introduction to the Mouselab-MDP paradigm including three exploration trials, a quiz to test the understanding of the instructions, ten test trials, and a short survey. We instructed participants to maximize their score using their own strategy and incentivized them by communicating to pay an undefined score-based bonus. After the experiment, they received 2 cents for each virtual dollar they had earned in the game. The experimental condition consisted of an introduction to the Mouselab-MDP paradigm including one exploration trial, an introduction to flowcharts and their terminology including two practice trials, a quiz to test the understanding of the instructions, ten test trials, and a short survey. The practice and test trials displayed a flowchart next to the path-planning problem. The flowchart used in the practice block did not convey a reward enhancing strategy to avoid a training effect, whereas the one in the test block did. We instructed participants to act according to the displayed flowchart. Specifically, participants were asked to first click all nodes for which following the flowchart led to the "Click it" decision and to then move the agent (a monkey) along the path with the largest sum of revealed rewards. To incentivize participants, they were told to receive a bonus depending on how good they followed the flowchart and they received a bonus that was proportional to their click agreement score.
Participants We recruited 172 people on Amazon Mechanical Turk (average age: 37.9 years, range: 18-69 years; 85 female). Each participant received a compensation of $0.15 plus a performance based bonus of up to $0.65. The mean duration of the experiment was 10.3 minutes. On average, participants needed 1.4 attempts to pass the quiz. Because not clicking is highly indicative of speeding through the experiment without engaging with the task, we excluded 15 participants (i.e., 8.72%) who did not perform any click in the test block. This yielded 78 participants for the control condition and 79 participants for the experimental condition.

Results
The mean click agreement was 47% (SD = 26%) in the control condition and 67% (SD = 28%) in the experimental condition (see Figure 7a). The proportion of people who achieved the click agreement above 80% increased from 9% without the flowchart to 48% with the flowchart. Similarly, the proportion of participants who achieved the click agreement above 50% increased from 46% to 70%. We observed that the distribution of click agreements is highly left-skewed and thus used a Mann-Whitney U-test for statistical comparisons. A two-sided test revealed that the click agreements in the experimental condition were significantly higher than in the control condition (U = 1773.5, p < .001). Thus, participants confronted with the flowchart followed its intended strategy more often than participants without the flowchart. The average expected score per trial in the control condition was 28.41 (SD = 13.33) and 34.47 (SD = 11.85) in the experimental condition. This corresponds to 71% and 86% of the score of the optimal strategy, respectively. The distribution of the mean score was slightly left-skewed. A two-sided t-test showed that the expected scores in the experimental condition were significantly higher than in the control condition (t = 2.99, p = .003). Similar results were obtained using the Mann-Whitney U-test (U = 2110, p < .001). Thus, the flowchart positively affected people's planning strategies as participants assisted by the flowchart revealed more promising paths before moving than participants who acted at their own discretion.
In total, participants both understood the strategy conveyed by the flowchart (higher click agreement) and used it, which increased their scores (higher expected rewards).

Experiment 2: Improving Planning in Environments with Decreasing Variance
In the second experiment, we evaluated whether the flowchart presented in Figure 6b can improve people's performance in environments where the uncertainty about the reward decreases from each step to the next.
Procedure The experimental procedure used in this study was identical to the one presented for Experiment 1 (see Section 6.3.1) except that the task used the decreasing variance environment instead of the increasing variance environment and that participants in the experimental condition where correspondingly shown the flowchart in Figure 6b.
Participants We recruited 60 people on Amazon Mechanical Turk (average age 33.7 years, range: 19-68 years; 25 female). Each participant received a compensation of $0.50 plus a performance based bonus up to $0.50. The mean duration of the experiment was 8.7 minutes. The participants needed 1.5 attempts to pass the quiz on average. We excluded 5 participants (8.33%) who did not perform any click in the test block. This resulted in 26 participants for the control condition and 29 participants for the experimental condition.

Results
The mean click agreement in the control condition was 48% (SD = 14%) and 79% (SD = 24%) in the experimental condition (see Figure 8a). The proportion of people who achieved the click agreement above 80% increased from 0% without the flowchart to 59% with the flowchart. Similarly, the proportion of participants who achieved the click agreement above 50% increased from 42% to 76%. Participants confronted with the flowchart followed its intended strategy significantly more often than participants without the flowchart (t = 5.74, p < .001), (U = 126.0, p < .001).
The average expected score per trial was 23.11 (SD = 7.11) in the control condition and 25.33 (SD = 7.54) in the experimental condition. This corresponds to 77% and 84% of the score of the optimal strategy, respectively. Although the difference between the experimental condition and the control condition was not statistically significant ((t = 1.1, p = .277), (U = 316, p = .154)), higher click agreement was significantly correlated with higher expected score (r(53) = .43, p = .001).
Thus, similarly to the previous experiment, we observed that participants did understand the strategy conveyed by the flowchart what resulted in significantly higher click agreement. Still, a small sample size and a less challenging environment (looking at immediate rewards is more intuitive than inspecting distant outcomes) prevented us from detecting a significant difference in the expected rewards.

Experiment 3: Improving Planning in Environments with Constant Variance
In the third experiment, we evaluated whether the flowchart presented in Figure 6c can improve people's performance in an environment where the uncertainty about the reward is the same in each step.  Procedure Since the flowchart for the constant variance environment is more complex than the flowcharts for the other two environments, participants in Experiment 3 were trained more extensively than participants in Experiments 1 and 2. The goal of this procedure was to familiarize the experimental group with the flowchart as well as possible so that they could use it during the testing phase. To minimize differences between the experimental condition and the control condition that could lead to asymmetric attrition, both groups went through the same training procedure, but only the experimental group was supported by the flow chart during the test trials. Both conditions consisted of an introduction to the Mouselab-MDP paradigm including one exploration trial, an introduction to the terminology used in the flowcharts, a quiz and a training phase on the introduced notions, an introduction to flowcharts per se, a second quiz, and a practice phase on flowcharts. Each quiz consisted of 3 simple questions to check attentiveness. During training, participants answered three different questions about highlighted nodes in a partially revealed training tree. These questions had the same structure as the questions used in the testing flowchart but asked for different values, e.g. is it an unobserved node lying in a path with -8. They were given feedback on their answers and could advance to the next question only after answering correctly. In each training round participants were sequentially quizzed about six randomly selected nodes. After the participant answered two questions about a node, the node was uncovered and the selection mark moved to another node. There were at least 3 and at most 10 training rounds. A participant was allowed to end the training after he or she had answered each of the flowchart's three questions at least 15 times and achieved an accuracy of at least 75% on each of them. The training phase was followed by an introduction to flowcharts and another quiz on understanding the task. The last block before the test phase comprised 2 practice rounds with a practice flowchart. This flowchart used only the questions presented in the training phase and, as previously, so as to minimize the effect of the shared training block on people's choices in the test block. Participants were required to first select a candidate node and sequentially answer the questions that the flowchart asked about it until the flowchart reached a decision about whether or not to click on the node. According to this decision, they were either allowed to reveal the selected node or not. Participants could not move the monkey before they had revealed all nodes that the flowchart suggested clicking. Finally, due to a large number of training trials, we used an increasing variance structure throughout the non-test trials, eliminating the possibility of implicit learning. The test block presented participants with planning problems in the constant variance environment. The experimental condition differed from the control condition only in the setup of the test block. That is, in the 10 test trials the experimental group was assisted by the flowchart whereas the control condition was not. To minimize differences in the duration of the test block, the control group completed 15 additional problems after the 10 test rounds; those additional problems were not considered in the analysis.
In contrast to the previous experiments, in Experiment 3 the flowchart was not visible as a whole during the test rounds. Rather, participants had to go through the flowchart by answering two consecutive questions interactively until they reached a decision about whether or not to click the selected node. Participants did not receive feedback on their answers, nor were they bound to the flowchart's decision. In addition, when a participant in the control condition attempted to move after having revealed less than three nodes, a dialogue informed them that "Many people overlook some of the nodes that the flowchart allows clicking and miss the bonus." and asked them "Are you sure you want to move?". The control group was promised and paid 2 cents of bonus for each dollar they scored in the game, whereas the experimental group could earn or loose 10 cents of bonus depending on whether a click was congruent with the flow chart or not.
Participants We recruited 149 people on Amazon Mechanical Turk (average age 33.7 years, range: 18-65 years; 62 female). Each participant received a compensation of $3 plus a performance based bonus up to $6. The mean duration of the experiment was 51 minutes. The participants needed two attempts on average to pass a quiz. We excluded 30 participants (20.1%) who required four or more attempts on one of the quizzes. This resulted in 60 participants for the control condition and 59 participants for the experimental condition.

Results
The mean click agreement was 30.95% (SD = 15.91%) in the control condition and 58.98% (SD = 24.44%) in the experimental condition (see Figure 9a). The proportion of people who achieved the click agreement above 80% increased from 0% without the flowchart to 25.42% with the flowchart. Similarly, the proportion of participants who achieved the click agreement above 50% increased from 10% to 57.62%. Participants confronted with the flowchart followed its intended strategy significantly more often than participants without flowchart (t = 7.36, p < .001), (U = 651.0, p < .001).
The mean expected score per trial was 4.17 (SD = 2.63) in the control condition and 5.95 (SD = 2.75) in the experimental condition. This corresponds to 45% and 64% of the score of the optimal strategy, respectively. The presence of the flowchart increased the expected score significantly ((t = 3.59, p < .001), (U = 1097.5, p < .001)). Moreover higher click agreement was significantly correlated with higher expected score (r(117) = .546, p < .001).
These findings indicate that participants did understand the strategy conveyed by the flowchart (higher click agreement), which improved their planning behaviour illustrated by the increased expected score.

Discussion of Findings on Improving Human Decision-Making
The results of Experiments 1-3 show that AI-Interpret succeeded to approximate the optimal planning strategies for three different sequential decision problems by simple, human-interpretable decision rules. Presenting these decisions rules in the form of flowcharts succeeded to align the way in which people arrived at their decisions more closely with those decision rules. As a consequence, the quality of people's decisions improved. This improvement was statistically significant in the increasing variance environment and in the constant variance environment. The three decision problems differ in how difficult they are for people and in the complexity of strategy that people would have to follow to solve them optimally. Sequential decision problems with decreasing variance are easiest for people because people's intuitive tendency to inspect the immediate rewards first is optimal in this environment. As one would expect based on that, we found that the benefits of our approach were smaller in the decreasing variance environment than in the other two environments where people's intuitive strategies fare poorly. Next, the optimal strategy for the constant variance environment is much more complex than the optimal strategies for the increasing variance environment and the decreasing variance environment. Consequently, we found that people's ability to follow this strategy with or without a flowchart was significantly lower than their ability to follow the optimal strategies for the increasing variance environment and the decreasing variance environment, respectively. Taken together, our findings suggests that interpretable strategy discovery is a promising way to leverage machine learning for designing decision aids. Our approach holds the greatest promise for problems where people's intuitive strategies fare poorly and the optimal strategy is relatively simple. The literature on cognitive biases suggests that there are numerous situations in which people's intuitive strategies perform very poorly [20] and the literature and heuristic decision-making suggests that there are simple heuristics that people could use to perform much better [14]. This makes interpretable strategy discovery a promising approach for improving human decision making in the real world.

Benefits of AI-Interpret Over Simpler Alternatives
After showing that our relatively complex AI-Interpret algorithm generates human-interpretable decision rules, we now demonstrate that its sophisticated method for selecting a subset of the demonstrations is essential to its success. To achieve this, we compare AI-Interpret against LPP and an ablated version of AI-Interpret. We compare the approaches in terms of the proportion of benchmark problems for which they can find a solution, how consistently they find that solution, and the average performance of found decision rules. Our results support the use of AI-Interpret in the Automatic Discovery of Interpretable Planning Strategies pipeline (see Algorithm 2 and Figure 1) tested in the previous section. Before delving into this comparison, we briefly introduce the benchmark problems and the baseline methods against which AI-Interpret will be evaluated.

Benchmark Problems
To check the performance of our algorithm we tested it on a set of benchmark problems. In each benchmark problem, the algorithm has to find an interpretable description of a reinforcement learning policy. Specifically, we considered the optimal policy for a metalevel MDP (see Definition 2) corresponding to different versions of the planning problem introduced in Section 6, found with the dynamic programming method developed by Callaway et al. [5]. In addition to the three planning problems used in Experiments 1-3, the benchmark problems include a fourth type of a planning problem where the rewards at the first, second, and third level are drawn from discrete uniform Besides different classes of MDPs, the benchmark problems described in Table 1 also vary the size of the demonstration set from x = 8, to x = 64, and x = 128 trajectories (i.e., sequences of b-c pairs where b is a belief state and c is a computation) starting in the initial belief state (b 0 ) and ending with the termination operation (⊥). Each of these trajectories was generated by applying one of the optimal policies to the corresponding metalevel MDP, as described in in Section 6.2.This resulted in 12 benchmark problems in total (see Table 1). For simplicity, we will refer to them by the number of trajectories and the variance structure of the environment.

LPP
To show the beneficial effects of adding clustering to our algorithm we compared AI-Interpret against the vanilla LPP. In the case of Logical Program Policies, the method has just one shot at the interpretation. The set of demonstrations is split into train and formula-validation sets and based on the supplied DSL, the algorithm searches for a disjunctive normal form formula that provides MAP approximation to the demonstrations. It either finds an interpretable formula, or concludes that the input set of demonstrations is impossible to be described and returns a trivial solution equivalent to the boolean False. We use LPP in a loop over maximum depths up to input depth d to make its results fully comparable to the results of AI-Interpret which also performs this step. The main source of randomness for LPP comes from which demonstrations are assigned to the train set and which are assigned to the formula-validation set. AI-Interpret mitigates these sources of noise by sampling the train and validation sets directly from the clusters it previously creates.

Binary-Interpret
To show that the clustering method used in AI-Interpret is the key enabler of our algorithm's success, we compared AI-Interpret against a simpler approach to selecting a subset of the demonstrations, which we call Binary Interpretation (Binary-Interpret). Binary-Interpret uses the principles of the binary search algorithm and accepts the following parameters: aspiration level α, tolerance δ, number of rollouts L, maximum depth d, mean expert reward m, and an additional parameter patience. It also uses one hyperparameter -train-validation split S. Binary-Interpret starts by trying to find a formula that satisfies the input constraints using all of the demonstrations, trying to extract an interpretable formula from gradually deeper decision trees up to depth d. In case of a failure, however, it does not stop but discards half of the demonstrations at random, and tries again on the remaining half of the demonstrations. In case of a success, it increases the size of the demonstration set by the half the size of the previously removed set (if any demonstrations were previously removed) and randomly re-samples the demonstrations. The process continues until the algorithm finds a solution, but fails in the next step or when the difference in size of the currently checked demonstration set and the previous one is equal to or smaller than the patience parameter. In each step of Binary-Interpret, the train and formula-validation sets are sampled from each of the demonstrations proportionally to S = 0.7. In our tests, we used patience = 8, meaning that Binary-Interpret stopped when there were only 4 more demonstrations left to consider re-including or removing. The remaining four parameters were shared with AI-Interpret. We used the same values as those listed in Section 6.2, namely α = 0.7, δ = 0.025, L = 100000, d = 5 and m dependent on the studied problem, respectively. The pseudo-code detailing Binary-Interpret can be found in the Supplementary Material.
Binary-Interpret is built on the assumption that the more demonstrations there are, the larger is their variety. Since demonstrations can include rare special cases that cannot be captured with the available predicates or cannot be incorporated into the final decision tree, Binary-Interpret checks increasingly smaller sets of demonstrations. It thereby uses the same underlying assumptions as AI-Interpret. We mention it is a clustering method because it performs clustering between demonstrations. Specifically, Binary-Interpret assigns each demonstration to a separate cluster and allows more than one of them to be removed in a single step. In the light of this specification, Binary-Interpret may be viewed as an ablation of AI-Interpret which lacks the component of intelligent clustering.

Quantitative Results
The random split of the demonstrations into a training set and a formula-validation set renders the outputs of AI-Interpret, Binary-Interpret and LPP stochastic. Nevertheless, we found the outputs of AI-Interpret to be highly consistent across runs and robust to variations in the set of demonstrations. This made it sufficient to run each algorithm 10 times on our benchmark problems. Despite the error bars of the baselines methods being wide, the amount of data was sufficient to ascertain that the performance of AI-Interpret is significantly better than the performance of the baseline methods. The results showcase that AI-Interpret is very reliable and consistently induces near-optimal decision rules.
To evaluate each method's performance in the context of interpretable strategy discovery, it was inserted into our pipeline for automatically discovering interpretable planning strategies (see Algorithm 2 and Figure 1). The exact setup and parameters we used for our algorithm, the baselines and Algorithm 2 can be found in Section 6.2. In the case of Binary-Interpret and LPP, the pipeline returned just one decision tree since there was no loop over candidate cluster sizes, whereas for our algorithm, the output comprised a set. The exact candidates for the number of clusters that led to the creation of this set were selected by the elbow heuristic applied during the execution of pipeline from Algrithm 2. As detailed in Section 6.2, to choose the most interpretable output for each run, we picked the tree with the fewest nodes, and in case of a tie, with the lowest depth. The statistics we use to describe our results in this section are the following: -Performance ratio (PERF): assuming f is the formula that was turned into a decision tree, this parameter is the average fraction , where m f and m π * meta denote the mean reward after 100000 rollouts of the policy induced by the formula f , and the mean reward of the expert policy π * meta , respectively. -Complexity: the number of nodes of the output tree.
-Support: the mean proportion of state-action pairs which were used to find the most interpretable result. -Entropy (ENTR): the entropy of solutions (including a failure solution) generated across 10 runs in total. Lower values indicate that the method is more reliable because its outputs are less variable. -Success rate (SUCC): the proportion of times the algorithm generated a formula out of 10 runs in total. Measuring the success rate helps us to numerically capture the effectiveness of each method.
When LPP is unable to find a decision rule it returns false which entails no planning. Thus, when any of the three methods is unable to find any formula that is consistent with (any subset of) the demonstrations, the resulting decision strategy uses zero information about the environment. The performance ratio of this no-planning strategy is 0 because the expected value of the reward distributions we used in our benchmark problems is always 0. Runs that did not output any decision tree counted as unsuccessful and were considered in the calculation of the entropy metric. Table 1 presents the performance of all methods on each of the 12 benchmark problems in terms of three metrics: performance ratios with 95% confidence intervals, entropy, and success ratios. Rows correspond to different benchmarks and columns relate to the algorithms. Additional statistics are reported in the main text.

Benchmark Evaluation
By inspecting the evaluation presented in Table 1, we see that AI-Interpret consistently managed to discover simple, high-performing decision rules. While AI-Interpret succeeded to find an interpretable decision rule on every single one of its 120 runs on the benchmark problems, LPP and Binary-Interpret succeeded on only 26/120 and 40/120 runs on the benchmark problems, respectively. Two χ 2 -tests for contingency tables confirmed that our algorithm succeeds in finding interpretable descriptions significantly more often than Binary-Interpret (χ 2 (4) = 31.82, p < .0001) and LPP (χ 2 (4) = 31.82, p < .0001 Secondly, we compared AI-Interpret with the baselines on the basis of the performance ratio (PERF). On average across all benchmark problems, the decision rules induced by AI-Interpret achieved 87.3% ± 2.14% of the return of the optimal metalevel policy. By contrast, the performance ratios of LPP and Binary Interpret were merely 19.5% ± 6.85% and 31.1% ± 7.96%, respectively. Mann-Whitney U-tests 1 confirmed that these differences are statistically significant (AI-Interpret  Table 1 Statistics of the solutions found by the tested interpretation algorithms ran on variable-sized demonstration datasets and Mouselab MDPs with increasing, decreasing, constant, and different variances. The performance metric PERF regards the estimated proportion of the reward a decision tree-induced policy obtains with respect to the optimal policy it is imitating. ENTR specifies the entropy of the distinct results generated over 10 test runs (failure in interpretation is also considered a distinct result). Entropy reflects how stochastic the outputs of the method are across multiple runs on the same dataset. The lower the ENTR measure, the more consistent is the algorithm. SUCC denotes the proportion of succeeded vs. failed runs over the recorded 10 attempts. The error bars enclose 95% confidence intervals. The best results for each benchmark and statistic are bolded.
vs. Binary-Interpret: U = 31.73.0, p < .0001; AI-Interpret vs. LPP: t = 2061.0, p < .0001). As shown in Table 1, the performance benefit of AI-Interpret was consistently present across all of the benchmark problems.
The entropy metrics shown in Table 1 suggest that AI-Interpret always outputs the same solution when 64 or 128 demonstrations are provided but is less stable when only 8 demonstrations are provided. On average, the descriptions generated by Binary-Interpret and LPP had a reasonably low entropy too. But while AI-Interpret achieved consistency by consistently finding good decision rules (100% success rate, 85.16% lower confidence bound on performance), LPP and Binary-Interpret consistently failed to find any solution on the majority of the benchmark problems (Binary-Interpret with 31.3% ± 29.7% success rate and 39.05% upper confidence bound on performance; LPP with 13.75% ± 18.5% success rate and 26.35% upper confidence bound on performance).
Knowing that our algorithm is largely superior than both LPP and Binary-Interpret, we moved to descriptively describing the formulas it finds (or decision trees if used in Algorithm 2). In this way, we noted that the smallest decision trees induced from AI-Interpret's output had merely 1 node, whereas the biggest ones needed 8 nodes. Still, the mean complexity was very low and equaled 2.75 ± 0.31. All but one of the found interpretable descriptions were discovered on a modified input dataset that excluded some of the demonstrations; on average AI-Interpret had a support of 59.42% ± 3.44% of all state-action pairs. Moreover, variations within the environments were not significantly large. For constant variance benchmark problems the support equaled 52.43%±6.08%; for different variance problems it was 48.64& ± 5.47%; for the decreasing variance problems it was 52.83% ± 1.56%; and for the increasing variance problems it was 83.77% ± 4.22%. These measures indicate that AI-Interpret chose the demonstration set proportion adaptively depending on the environment. Furthermore, inspecting the clustering value as a function of the number of clusters revealed that the elbow heuristic is a useful criterion for choosing that number to work well with AI-Interpret (see Supplementary Material).
These findings highlight that AI-Interpret is clearly superior to LPP and Binary-Interpret. Intelligent clustering enables it to find solutions when LPP and Binary-Interpret fail. The performance ratios of these solutions indicate that AI-Interpret discovers policies of high quality. The low entropy in terms of the different outputs and high success rate pinpoint that clusters indeed capture similar demonstrations, and make the output reliable and trustworthy. We therefore conclude that the innovations of AI-Interpret were critical to the success of interpretable strategy discovery at improving human decision-making reported in Section 6 and that our algorithm is ready to be applied to other reinforcement learning problems requiring human-interpretability.

General Discussion and Future Directions
Decision aids, such as decision trees and flowcharts, help professionals (e.g., medical doctors) make better decisions by guiding them through a more systematic decision strategy that prioritizes the most important information. Recent advances in cognitive science make it possible to leverage reinforcement learning methods to discover optimal versions of such strategies automatically [5,17,27,28]. In this article, we extended a reinforcement learning method that automatically discovers near-optimal decision-making strategies through an addition of interpretable RL algorithm AI-Interpret. This extension enabled the mentioned method to automatically generate near-optimal decision aids instead of outputting compound RL policies, as it has done before. The pipeline for automated discovery of interpretable strategies takes four main steps: 1) it models a decision problem as a metalevel MDP, 2) it solves for the optimal metalevel policy, 3) it interprets this policy with AI-Interpret, and 4) it turns the resulting formula to a human-interpretable flowchart. Our proofof-concept demonstrations showed that the decision-aids generated by this method can improve people's planning strategies and the quality of the resulting decisions. While AI-Interpret builds on a promising Bayesian program induction approach to imitation learning (i.e., LPP [40]), we found that its innovations are in fact critical. The original version of LPP and simpler extensions were not robust enough to tackle the real-world challenges of approximating complex, stochastic, and irregular policies with simple decision rules that can be readily understood and applied by people. AI-Interpret achieves this robustness by clustering the set of demonstrations and identifying the largest possible set of behaviors that can be captured by an easily comprehensible logical formula. The results of our quantitative experiments clearly indicate a beneficial effect of clustering the set of demonstrations. The ablated version of AI-Interpret managed to find reliable decision rules only for one out of four types of sequential decision problems, whereas AI-Interpret consistently found well-performing, simple rules for all of them.
These findings indicate that AI-Interpret is an important step towards leveraging reinforcement learning to boost people's decision-making skills in real life. This illustrates how interpretable machine learning can be used to help people perform better instead of replacing them entirely.  [36]. To decide which company to invest in, the decision-maker can compare the companies on multiple different attributes, such as market share, financial flexibility, image, efficiency, management, etc. AI-Interpret can be applied to discover optimal decision strategies for this investment task as part of our automatic planning strategy discovery pipeline (see Figure 1).

Directions for Future Work
AI-Interpret is a very general method with a broad range of possible real-life applications. These applications include improving human decision-making, understanding the decisions of artificial intelligent systems, and understanding human decisions. Each of these applications can be pursued in a wide range of domains including planning, decision-making, reasoning, vision, robotics, and learning.
Future work will extend our approach to helping people make better decisions to increasingly more realistic scenarios, such as purchasing, hiring, (college) admissions, investing, and medical diagnosis. For example, one of the directions we plan to explore in the near future is discovering human-interpretable decision strategies for multi-alternative risky choice problems that model reallife investment decisions (e.g., [36]; see Figure 10). A natural extension of multi-alternative risky choice is the topic of product selection illustrated in Figure 2a). Furthermore, we will also apply AI-Interpret to partially automate the process of scientific discovery in cognitive science by assisting cognitive scientists in their efforts to derive people's decision strategies from the order in which they inspect different pieces of information (see Figure 2a). Future work could also explore applying this approach to explain the decisions of deep neural networks that perform at a super-human level (e.g., [32,38,39]) and to transfer their expertise to people.
To establish a solid foundation for these real-world applications, future research will rigorously analyze the AI-Interpret algorithm in the theoretical framework of statistical learning theory [43]. We also plan to explore translating the decision tree that our method generates into a program in linear temporal logic [10,44] that specifies which operation should be performed next. We predict that our flowchart representation of such programs will be much more helpful for people than flowcharts for determining whether a given planning operation is consistent with the recommended strategy.
Interpretable flowcharts can be used not only as decision aids but also for teaching effective decision strategies. Existing cognitive tutors teach decision strategies primarily by giving people feedback about how they make their decisions while they practice decision-making in a simulated environment [24,25]. Since this pedagogical approach primarily relies on implicit learning, people's conscious understanding of the taught strategy tends to be limited to its application in the training environment [25]. Interpretable flowcharts, by contrast, represent strategies in general terms that are directly applicable to decision-making in real-life. Adding interpretable flowcharts to cognitive tutors might therefore make it much easier for people to transfer what they were taught to decision-making in everyday life. This makes augmenting cognitive tutors with human interpretable flowcharts discovered with the help of AI-Interpret another promising direction for future work.
In the long run, this line of research may lead to deep insights into decision-making, clever cognitive strategies, and practical tools that help people make better decisions in many important real-life situations. In this way, advances in artificial intelligence can enhance human intelligence instead of replacing people. This is an important antidote to people losing their jobs to robots and algorithms. Supplementary Material

Description of the Predicates
Every predicate in the Domain Specific Languge we defined accepts (at least) two arguments: belief state of the environment b and computation/action c. In our case, the state relates to the list of expected values of nodes in the MDP, whereas the computation is the number of the node to click, with 0 reserved for termination. The MDP we used in our tests had the form of a tree, hence a lot of the predicates made use of notions used for the tree graph structures. The meaning of the predicates, presented below in alphabetical order, is the following: A all(b,c,pred 1 ,pred 2 )All the nodes in the MDP that satisfy pred 1 also satisfy pred 2 .
among(b,c,pred 1 ,pred 2 )This node is among all the nodes in the MDP that satisfy pred 1 and inside that set it also satisfies pred 2 .
are_branch_leaves_observed(b,c)This node has successor leaves which are all observed.
are_leaves_observed(b,c)All leaf nodes have been observed.
are_roots_observed(b,c)All nodes on level 1 have been observed. has_child_highest_value(b,c,list)This node has a child with an observed value that is higher than any other observed child's value for the nodes from list.
has_child_highest_level_value(b,c)This node's child has the maximum possible value on its level.
has_child_lowest_value(b,c,list)This node has a child with an observed value that is lower than any other observed child's value for the nodes from list.
has_child_lowest_level_value(b,c)This node's child has the minimum posible value of its level.
has_largest_depth(b,c,list)This node is the deepest in the tree among the nodes from list.
has_leaf_highest_value(b,c,list)This node has a successor that is a leaf with an observed value that is higher than any other observed successor-leaf's value for the nodes from list.
has_leaf_highest_level_value(b,c)This node leads to an uncovered leaf that has the maximum possible value on its level.
has_leaf_lowest_value(b,c,list)This node has a successor that is a leaf with an observed value that is lower than any other observed successor-leaf's value for the nodes from list.
has_leaf_lowest_level_value(b,c)This node leads to an uncovered leaf that has the minimum possible value on its level has_most_branches(b,c,list)This node belongs to the largest number of paths among the nodes in list.
has_parent_highest_value(b,c,list)This node has a parent with an observed value that is higher than any other observed parent's value for the nodes from list.
has_parent_highest_level_value(b,c)This node's parent has the maximum possible value on its level.
has_parent_lowest_value(b,c,list)This node has a parent with an observed value that is lower than any other observed parent's value for the nodes from list.
has_parent_lowest_level_value(b,c)This node's parent has the minimum possible value of its level.
has_root_highest_value(b,c,list)This node has an ancestor on level 1 with an observed value that is higher than any other observed 1st-level ancestor's value for the nodes from list.
has_root_highest_level_value(b,c)This node can be accessed through an observed node on level 1 which has the highest value on level 1.
has_root_lowest_value(b,c,list)This node has an ancestor on level 1 with an observed value that is lower than any other observed 1st-level ancestor's value for the nodes from list.
has_root_lowest_level_value(b,c)This node can be accessed through an observed node on level 1 which has the minimum value on level 1.
has_smallest_depth(b,c,list)This node is the shallowest in the tree among the nodes from list. is_observed(b,c)This node was already clicked and is observed.
is_on_highest_expected_value_path(b,c)This node lies on a path that has the highest expected value.
is_positive_observed(b,c)There is a node with a positive value observed.
is_previous_observed_max(b,c)The previously observed node uncovered the maximum possible value in the MDP.
is_previous_observed_max_leaf(b,c)The previously observed node is a leaf and it uncovered the maximum possible value in the MDP.
is_previous_observed_max_level(b,c)The previously observed node uncovered a maximum possible value on that level.
is_previous_observed_max_nonleaf(b,c)The previously observed node isn't a leaf and it uncovered the maximum possible value in the MDP.
is_previous_observed_max_root(b,c)The previously observed node lies on level 1 and it uncovered the maximum possible value in the MDP.
is_previous_observed_min(b,c)The previously observed node uncovered the minimum possible value in the MDP.
is_previous_observed_min_level(b,c)The previously observed node uncovered a minimum possible value on that level.
is_previous_observed_parent(b,c)The previously observed node is the parent of this node.
is_previous_observed_sibling(b,c)The previously observed node is one of the siblings of this node.
is_root(b,c)This node is one of the nodes on level 1.
is_successor_max_val(b,c)One of the successors of this node is uncovered and has the maximum possible value in the MDP.
O observed_count(b,c,n)There are at least n observed nodes.
T termination_return(b,c,e)The expected reward after stopping now is ≥ e.

Domain Specific Language
The DSL we used for studying Mouselab MDP policies was generated through a probabilistic context-free grammar with the following format.

Pseudocode for Binary Interpretation
Binary Interpretation (Binary-Interpret) is an ablation of AI-Interpret that lacks the step of intelligent clustering. It accepts the following parameters: aspiration level α, tolerance δ, number of rollouts L, maximum depth d, mean expert reward m, patience. Binary-Interpret starts by trying to find a formula that satisfies the input constraints using all of the demonstrations. In case of a failure, it discards half of the demonstrations at random, and tries again. In case of a success, it increases the size of the demonstration set by bringing back half of the previously removed demonstrations (if any demonstrations were previously removed). The process continues until the algorithm finds a solution, but fails in the next step or when the difference in size of the currently checked demonstration set and the previous one is equal to or smaller than the patience parameter.

Formulas Outputted by AI-Interpret and the Baselines
In this section, we present the exact formulas outputted by Binary-Interpret, LPP and AI-Interpret whose statistics we reported in Table 1 in the paper. The testing environment is described by the number of trajectories used to establish the dataset of demonstrations, and the type of the variance structure of the Mouselab MDP. AND and OR written in capital letters indicate the connectives between separate predicates. We omit the environments for which the algorithms did not output any formula. Note that to construct the flowchartd we used the output of AI-Interpret trained on a dataset with 64 demonstrations.  Compute the number of distinct predicates used by f and denote it p f ;