Automatic discovery and description of human planning strategies

Scientific discovery concerns finding patterns in data and creating insightful hypotheses that explain these patterns. Traditionally, each step of this process required human ingenuity. But the galloping development of computer chips and advances in artificial intelligence (AI) make it increasingly more feasible to automate some parts of scientific discovery. Understanding human planning is one of the fields in which AI has not yet been utilized. State-of-the-art methods for discovering new planning strategies still rely on manual data analysis. Data about the process of human planning is often used to group similar behaviors together. Researchers then use this data to formulate verbal descriptions of the strategies which might underlie those groups of behaviors. In this work, we leverage AI to automate these two steps of scientific discovery. We introduce a method for automatic discovery and description of human planning strategies from process-tracing data collected with the Mouselab-MDP paradigm. Our method utilizes a new algorithm, called Human-Interpret, that performs imitation learning to describe sequences of planning operations in terms of a procedural formula and then translates that formula to natural language. We test our method on a benchmark data set that researchers have previously scrutinized manually. We find that the descriptions of human planning strategies that we obtain automatically are about as understandable as human-generated descriptions. They also cover a substantial proportion of relevant types of human planning strategies that had been discovered manually. Our method saves scientists’ time and effort, as all the reasoning about human planning is done automatically. This might make it feasible to more rapidly scale up the search for yet undiscovered cognitive strategies that people use for planning and decision-making to many new decision environments, populations, tasks, and domains. Given these results, we believe that the presented work may accelerate scientific discovery in psychology, and due to its generality, extend to problems from other fields.


Introduction
Scientific discovery is a product of scientific inquiry that allows generating and corroborating new insightful hypotheses.In the early days, scientific discovery has been seen as a prescriptive method for arriving at new knowledge, by gathering information and refining it with new experiments (see Bacon's Novum Organum (Bacon and In Fowler, 1878) or Newton's Philosophiae Naturalis Principia Mathematica (Newton, 1687)).More established philosophical tradition argued that there exists a clear demarcation between the so called context of discovery and the context of justification (Whewell, 1840;Reichenbach, 1938;Popper, 1935).The former would be a product of logically unfathomable mental process: the "eureka" moment, that if anything, could be rather studied by psychology.The latter would concern the proper verification and justification of the discovered theory and would indeed have a formal structure.However, such an account leaves scientists at the mercy of having a eureka moment.This division was thus challenged.Kuhn (1962), for instance, saw scientific discovery as a complex process of paradigm changes where increasing amounts of findings that disagree with the current paradigm lead to its change.The discovery involves gathering the anomalous data and finding a hypothesis that explains it, for instance through abduction (Hanson, 1965), so reasoning from data to the most probable, simplest explanation.Importantly for this work, this division was also challenged by early work on AI for problem solving (Simon and Newell, 1971;Simon, 1973).Therein, discovery was understood as searching the problem space from the initial state representing current knowledge to a desirable goal state.The states transition from one to another by applying simple operators with a predefined meaning.The process of finding the shortest sequence of applied operators would create rules that could later be looked at by a human to determine what search heuristic has been used.The found search heuristic would determine the method for scientific discovery.Regardless of whether you agree with the distinction between discovery versus justification, these works showed that some steps involved in scientific discovery can be automated.Further work in such automation moved beyond the philosophical debate with the main motivation becoming to aid scientists in their research (Addis et al., 2016).
One line of research involves the field of computational scientific discovery (Džeroski et al., 2007;Sozou et al., 2017) which models the discovery problem mathematically and advances the discovery of laws or relations using artificial intelligence (AI).BACON (Langley et al., 1987), for instance, was a system for inducing numeric laws from experimental data.Having dependent and independent variables, it created taxonomies of these variables by clustering equally-valued dependent variables and defining new variables as products or ratios of independent variables.Then, Langley et al. 1983) created GLAUBHER that was formulating qualitative laws over the categories in the taxonomy, and STAHL which produced structural theories based on the data leading to anomalous behavior of existing theories (c.f.Kuhn (1962)).Addis et al. (2016) suggested to represent theories as programs and use genetic search for best theories.Other systems such as PHINEAS, COAST or IDS are described at length in a review paper by Shrager and Langley (1990).An overview work by Džeroski et al. (2007) shows an extension of this field to mathematical modeling: the background knowledge is represented as generic processes, the data takes form of time-series, and discovery takes place by deriving sets of explanatory differential equations.
Another line of research which used AI to help scientists in their endeavors was automatic experimental design.This approach follows the mathematical foundations of design optimization, in which the expected information gain of an experiment defines its usefulness in testing a hypothesis (Myung et al., 2013).Vincent and Rainforth (2017) used this approach to automate the creation of intertemporal choice experiments and risky choice experiments.Ouyang et al. (2016;2018) went a step further and created a system to automatically find informative scientific experiments in general.Their method expected a formal experiment space, expressed in terms of a probabilistic program as input and returned a list of experiments ranked by their expected information gain.The experiment that was the highest on that list would provably provide the most information to differentiate between competing hypotheses.Foster et al. (2019) further refined the whole idea by introducing efficient expected information gain estimators.
In this article, we introduce a computational method for assisting scientists in studying human planning.Understanding how people plan is a difficult and timeconsuming endeavor.Previous efforts to figure out which strategies and processes people use to make decisions and plan usually entailed manual analysis of the data (Payne et al., 1993;Willemsen and Johnson, 2011;Callaway et al., 2017Callaway et al., , 2020)).The very act of finding the strategies has been thus left to researchers' ingenuity to discover the right patterns in the data.Moreover, previous work has largely failed to characterize people's strategies in detail (but see Jain et al., 2021;Agrawal et al., 2020;Peterson et al., 2021).The case for using AI in the quest of understanding human planning has been made for one-shot decision-making scenarios (Agrawal et al., 2020;Bhatia and He, 2021;Peterson et al., 2021).Agrawal et al. (2020) fit simple models to the data and optimized them with respect to the regret obtained by comparing simple models' predictions to overly complex models' predictions.After the simple models' predictions converged to those of the complex models, they were used as a proper formalization of one-shot human decision strategies.Peterson et al. (2021) employed artificial neural networks to search for theories of one-shot decision-making.At first, they created a taxonomy of theories that expressed relations between available decision items (such as gambles, and whether the gambles are dependent or independent), and which covered the entire space of possible decision-making theories.Subsequently, they used neural networks to express those theories by imposing different constraints on those networks.The authors gathered a very large data set of human decisions and determined which theory is the best fit based on the networks' performance.To study human planning behavior, Jain et al., (2021) used a similar approach of first collecting human data and then finding models that best explain the human data.To address the challenge that the cognitive operations of planning (or decisionmaking in general) cannot be observed directly they used the Mouselab-MDP process-tracing paradigm to externalize human planning operations as information gathering operations Callaway et al.; Callaway et al.. Their approach to modeling the observed planning behavior was to first manually go through all the click sequences participants generated in the paradigm and then create a set of planning strategies that the researchers subjectively found the participants to be using.These planning strategies were then used as an input to their method known as the Computational Microscope which then infers which strategy was used by which participant in which trial.Here, we go two steps further compared to the mentioned papers.We consider the issue of i) discovering detailed human multi-step decision strategies, that is planning strategies, ii) discovering them from data automatically without creating the initial model , whether a complex black-box model (Agrawal et al., 2020), a taxonomy of decision-making theories (Peterson et al., 2021), or a set of possible planning strategies (Jain et al., 2021).
In more detail, we introduce a new computational method for automatically discovering human planning strategies from the data collected in experiments that externalize planning operations as information gathering 1 .We call this algorithm Human-Interpret.On a high-level Human-Interpret tries to imitate behaviors from the data gathered in experiments measuring planning, having specified candidates on the number of strategies to be found (from x to y strategies), and having a vocabulary of logical primitives for constructing descriptions of those strategies.Specifically, it performs probabilistic clustering on the set of planning operations observed in the experiment(s) to obtain n probability distributions that assign probabilities to planning actions conditioned on knowledge states.Then it samples demonstrations of planning operations from those clusters, creates a set of predicates, and uses a method called AI-Interpret (Skirzyński et al., 2021) to describe the demonstrated strategy by a logical formula.In the next step, Human-Interpret uses the method from (Becker et al., 2021) to increase the interpretability of the output by transforming it into a procedural formula written in a special type of logic.In the penultimate step, Human-Interpret shortens the resulting formulas with a greedy algorithm, and in the last step it translates the formula into natural language using a predefined predicate-to-expression dictionary.To evaluate our approach and, particularly, the Human-Interpret algorithm, we applied it to a planning task where one has to spend money on clicking nodes in a graph to find the most rewarding path to traverse.Our method generated descriptions of human planning strategies that are on par with the descriptions created through laborious manual analysis by Jain et al. (2021).Our method automatically discovered roughly the same set of relevant planning strategies and generated straightforward, informative descriptions of those strategies, while saving time and effort.Our approach can be easily extended to other problems, such as multi-alternative risky-choice (Lieder et al., 2017;Peterson et al., 2021), by simply running new studies and creating a new set of logical primitives.Given these results, we believe that the presented work has the potential to greatly facilitate scientific discovery in psychology and perhaps revolutionize scientific discovery in general.Scientists can now use the help of AI not only for testing their hypotheses about how people make decisions but also for generating them.
The outline of the article is as follows: We begin by providing background information and summarizing related work pertaining to our method and the benchmark problem used in Section 2. In the next section, we describe the whole pipeline of our method for the automatic discovery and description of human planning strategies.Section 4 shows the results of our test on the benchmark problem where we compare our automated pipeline with a standard manual approach.Lastly, 1 For a detailed description of this experimental paradigm, see Section 2.2.
Section 5 discusses opportunities for applying our method and directions for future work.

Background and related work
In this section we present the methodology used to measure planning adapted in our studies and a state-of-the-art approach for discovering human planning strategies that we compared to.Additionally, we describe an important part of our framework for automatic scientific discovery, that is an algorithm that describes planning strategies in an interpretable way by only using demonstrations of these strategies.

Definitions
Here, we introduce definitions for important notions used in the next sections.The first definition presents a foundational concept for both our methodology and the benchmark problem we used.Definitions 2-4 connect to the formal definitions of planning strategies that we employed in our research.The last 3 definitions refer to Human-Interpret: what it means to be an imitation learning algorithm, what is the formalization of Human-Interpret's intermediate output, and what is the formalization of its final output.
Definition 1 (Markov Decision Process) A Markov decision process (MDP) is a tuple (S, A, T , R, γ) where S is a set of states; A is a set of actions; T (s, a, s Definition 2 (Policy) A deterministic policy π is a function π : S → A that controls agent's behavior in an MDP and a nondeterministic policy π is a function π : S → P rob(A) that defines a probability distribution over the actions in the MDP.
Definition 3 (Expected reward) The reward r t represents the quality of performing action a t in state s t .The cumulative return of a policy is a sum of its discounted rewards obtained in each step of interacting with the MDP, i.e.
The expected reward J(π) of policy π is equal to Definition 4 (Reinforcement learning) Reinforcement learning (RL) is a class of methods that perform iterations over trials and evaluation on a given MDP in order to find the optimal policy π * which maximizes the expected reward (Sutton and Barto, 2018).
Definition 5 (Imitation learning) Imitation learning (IL) is the problem of finding a policy π that mimicks transitions provided in a data set of trajectories D = {(s i , a i )} M i=1 where s i ∈ S, a i ∈ A (Osa et al., 2018).
Definition 6 (Disjunctive Normal Form) Let f i,j , h : X → {0, 1}, i, j ∈ N be binary-valued functions (predicates) on domain X .We say that h is in disjunctive normal form (DNF) if the following property is satisfied: and ∀i, j 1 = j 2 , f i,j 1 = f i,j 2 .In other words, h is a disjunction of conjunctions.
Definition 7 (Linear Temporal Logic) Let P be the set of propositional variables p (variables that can be either true or false), let ¬, ∧, ∨ be standard logical operators for negation, AND, and OR, respectively, and let X, U, W be modal operators for NEXT, UNTIL, and UNLESS, respectively.Linear temporal logic (LTL) is a logic defined on (potentially infinite) sequences of truth-assignments of propositional variables.LTL formulas are expressions that state which of the variables are true, and when they are true in the sequences.Whenever this agrees with the actual truth-assignment in an input sequence, then we say that a formula is true.Formally, for α and β being LTL formulas, we define a formula to be expressed in LTL inductively: ψ is an LTL formula if ψ ∈ P (ψ states that one of the variables is true in the first truth-assignment in the sequence), ψ = ¬α (ψ is a negation of an LTL formula), ψ = α ∨ β (ψ is a disjunction of two LTL formulas), ψ = α ∧ β (ψ is a conjunction of two LTL formulas), ψ = Xα (ψ states that LTL formula α is true starting from the next truth-assignment in the sequence) or ψ = αUβ (ψ states that LTL formula α is true until some truth-assignment in the sequence where LTL formula β becomes true).

Methods for measuring how people plan
People's planning for years has been an evanescence process that lacked principled analysis tools.To study planning, psychologists firstly focused on one-step decisions and most often relied on educated guesses, i.e. self-constructed mathematical models of human behavior that captured the relationship between inputs and outputs of decision-making (Abelson and Levi, 1985;Westenberg and Koele, 1994;Ford et al., 1989).These methods were not error-proof because they sometimes fit conflicting models equally well to the same data or were too limited to capture the whole decision-making process (Ford et al., 1989;Riedl et al., 2008).To mitigate these drawbacks, scientists have developed process-tracing methods that captured the process used in decision-making by also analyzing the context in which each of the steps were taken before reaching a decision (Payne et al., 1978;Svenson, 1979).Among multiple process-tracing paradigms, such as verbal protocols (e.g. in Newell et al. (1972)) or conversational protocols (e.g. in Huber et al. (1997)), some process-tracing paradigms were computerized and followed human choices by facing people with artificial tasks which required them to take a series of actions before making the final decision.One such process-tracing paradigm called the Mouselab (Payne et al., 1988) paradigm which is used for one-step decision making tasks was later adapted for studying planning under the name of Mouselab-MDP (Callaway et al., 2017) paradigm.Since we used this methodology in our framework for automatic scientific discovery for planning, we now introduce the Mouselab and Mouselab-MDP paradigms in more detail.At the end, we also introduce a stateof-the-art approach for scientific discovery for planning that uses Mouselab-MDP.Since it relies on manual analysis of data, we treat it as a baseline throughout this paper.

Mouselab
The Mouselab (Payne et al., 1988) paradigm is one of the first computerized processtracing paradigms.Its goal is to externalize some aspects of cognitive processes taking part in decision-making by engaging people in information acquisition that helps them reach a decision.Mouselab was developed to study multi-alternative risky choice.To do so, it presents participants with an initially occluded payoff matrix whose entries can be (temporarily) revealed by clicking on them.An i, j entry of the n × m payoff matrix either hides a value for bet i under outcome j, or, in the case of the first row, hides a probability for outcome j.The values are expressed in terms of dollars and they indicate the payoff a participant is expected to obtain after selecting gamble i and having outcome j to occur.Participant's task is to choose one of the available i − 1 gambles based on the information they gathered by revealing the entries of the matrix.The sequence of clicks externalizes a participant's decision-making process by showing which information they considered.Using the Mouselab paradigm, scientists were able to discover that people choose strategies adaptively (Payne et al., 1988).Despite its usefulness, however, Mouselab is inappropriate to study planning as selecting one gamble does not affect future gambles.Mouselab-MDP was developed to mitigate this shortcoming.

Mouselab-MDP
The Mouselab-MDP paradigm is a generalization of the Mouselab process-tracing paradigm in which participant's information-acquisition actions affect the availability of his or her future choices (Callaway et al., 2017).By doing so and offering a way to externalize information-acquisition, Mouselab-MDP is suitable to study human planning.Concretely, a single decision about choosing a gamble is replaced with a Markov Decision Process (MDP), and the payoff matrix is replaced with a graphical representation of the MDP, a directed graph with initially occluded nodes.Clicking on the node reveals a numerical reward or punishment hidden underneath it (see Figure 1).In the most commonly used setting of the Moueslab-MDP paradigm, a participant's goal is to find the most rewarding path for an agent to traverse from the start node to one of the terminal nodes (nodes without any out-going connections), by minimizing the number of clicks (each click has an associated cost).This formulation was used in a number of papers that studied human planning strategies (Griffiths et al., 2019;Callaway et al., 2020) and led to the creation of cognitive tutors that help people plan better (Lieder et al., 2019(Lieder et al., , 2020)), the creation of new, scalable and robust algorithms for hierarchical reinforcement learning (Kemtur et al., 2020;Consul et al., 2021), and helped in creating one of the first tools for analyzing human planning in more detail (Jain et al., 2021).
Fig. 1: The Mouselab-MDP environment we used in the paper.The environment is a connected graph of nodes (grey circles) that hide rewards or punishments.The number hidden underneath a node can be uncovered by clicking on it and paying a small fee.The goal is to traverse a path starting in the black node and ending in a node at the highest level so that the sum of rewards along the way minus the cost of clicking was as high as possible.

Categorizing planning strategies via the computational microscope
The Mouselab-MDP paradigm gives information about the planning processes used by people (Callaway et al., 2020(Callaway et al., , 2017)).Past research, similarly to the research on describing human decision-making, employed formal planning models (i.e.strategies) to describe the planning processes of people, and evaluated those models on data from experiments with human participants, e.g.(Botvinick et al., 2009;Huys et al., 2012Huys et al., , 2015;;Callaway et al., 2018Callaway et al., , 2020)).Researchers have come up with a number of planning models people could hypothetically use, including classic models of planning such as Breadth First Search, Depth First Search, models that use satisficing, etc.Recently, Jain et al. (2021) created a new computational pipeline in which a set of manually created planning strategies are given as an input to a computational method, called the computational microscope, to fit them to human planning data.In more detail, the method takes process-tracing data generated using the Mouselab-MDP paradigm and categorizes the planning operations from trials of the Mouselab-MDP paradigm into a sequence of these planning strategies.One of its features is that, to categorize a trial, it incorporates all the information from all the trials before and after it.The authors of the paper performed manual inspection of all the process-tracing data (clicks) that participants made in each trial.They first found out similar clicks sequences based the trials and the features of the trial and then created generalized strategies that replicated these click sequences.This led to the creation of a set of 79 strategies.
One of the drawbacks of their method is the amount of time it took them to manually go through all the participants data and the scalability of their approach.Since our method aims to develop a method for automatic interpretation of the process-tracing data, which includes automatic discovery of the planning strategies, we treat their set of planning strategies and their approach as a baseline to compare our framework to.

Finding interpretable descriptions of formal planning strategies (policies): AI-Interpret
Part of our framework utilizes a variant of an algorithm developed to interpret reinforcement learning (RL) policies: AI-Interpret (Skirzyński et al., 2021).In contrast to existing methods for intepretability in RL that generate complex outputs: decision trees with algebraic constraints (Liu et al., 2018), finite-state automata (Araki et al., 2019), or programs (Verma et al., 2018) to represent policies, AI-Interpret generates simple and shallow disjunctive normal form formulas (DNFs; equivalent to decision trees) that express the strategy in terms of pre-defined logical predicates.Studies presented in Skirzyński et al. (2021) show that transforming this output into flowcharts, which use natural language instead of the predicates, is easily understood by people, and can even help in improving their planning skills.Moreover, AI-Interpret is an imitation learning method and interprets policies via their demonstrations.Due to these reasons, we decided to use it in order to achieve our goal: find descriptions of human planning strategies by using data from process-tracing experiments.
On a high-level, AI-Interpret uses 4 inputs.If S is a set of states and A is a set of actions in a given environment, AI-Interpret accepts a data set of demonstrations (state-actions pairs) D = {(s i , a i )} N i=1 , s i ∈ S, a i ∈ A generated by some policy π, set of predicates L that act as feature detectors f : S × A → {0, 1}, a maximum depth of the DNF (decision tree) d, and the ratio of the expected rewards α.AI-Interpret uses D and L to find DNF formula ψ of size at most d that induces a policy π ψ with an expected reward of at least α of π's expected reward.It does so by transforming each state-action pair in D into a vector of predicate valuations and clustering the set of these vectors into coherent groups of behaviors.In each iteration, the clustered vectors are used as positive examples for a DNF learning method and suboptimal state-action pairs generated by looking at possible actions in existing states, serve as negative examples.The DNF learning method is called Logical Program Policies (LPP; (Silver et al., 2020)) and defines a prior distribution for the predicates in L and uses the MAP estimation and decision-tree learning methods to find the most probable DNF formulas that accept the positive examples and reject the negative examples.AI-Interpret uses LPP to find DNF ψ that achieves the proper expected reward defined by α and has depth limited by d, and in case of a failure, removes the least promising cluster (the smallest weighted posterior) to try describing the remaining data.

A new method for discovering and describing human planning strategies
We created a method that enables (cognitive) scientists to generate descriptions of human planning strategies using data from sequential decision-making experiments conducted with the Mouselab-MDP paradigm (Callaway et al., 2020(Callaway et al., , 2017)).As illustrated in Figure 2, our method comprises the following steps: 1) collecting and pre-processing process-tracing data on human planning, 2) setting up a vocabulary of logical predicates that can be used to describe people's strategies, 3) running our new algorithm to automatically discover how many different strategies people used and what those strategies are.Our method automatically describes the discovered Fig. 2: Diagram representing our method for discovering and describing human planning strategies.The method assumes externalizing human planning first and gathering human planning operations in an experiment.Then it assumes creating a Domain Specific Language (DSL) to describe the environment in which participants made decisions, and running a clustering algorithm on the gathered data to create mathematical representations of planning strategies -policies.Afterwards, the pipeline assumes running the AI-Interpret (Skirzyński et al., 2021) algorithm using the constructed DSL and the found policies, and finally transforming the found formulaic descriptions into procedural instructions expressed in linear temporal logic.
strategies as step-by-step procedures.The first three subsections describe each of these three steps in turn.We also detail the algorithm itself.The last section reports on the technical details of setting up the initial code base for our pipeline and can be skipped.

Data collection and data preparation
To use our method for discovering human strategies, the data collected in the experiments is required to meet certain criteria.Firstly, the experiment has to externalize people's planning by using a computerized process-tracing paradigms for planning.In our benchmark studies we used the Mouselab-MDP paradigm (Callaway et al., 2020(Callaway et al., , 2017)), since to our knowledge, it is the only such paradigm that is available so far.To study one-shot decision-making, on the other hand, one could use the standard Mouselab paradigm (Payne et al., 1988), ISLab (Cook and Swain, 1993), MouseTrace (Jasper and Shapiro, 2002) or other similar environments.Secondly, the experiments need to gather participants' planning operations that we define as sequences of state-action pairs generated by each of the participants.The states are defined in terms of the information that the participant has collected about the task environment.The actions are the information gathering operations that the participant performs to arrive at their plan.Thirdly, the gathered planning operations should be saved in a CSV file that would additionally contain labels of the experimental block they came from (e.g.'train' or 'test'), and labels corresponding to participant id.In our case this data was separated into distinct CSVs and custom Mouselab-MDP functions extracted states, actions, blocks and ids from them into a new (Python) object.

Creating the Domain Specific Language
Humans build sentences by using words (and build words by using phonemes).In order to automatically build descriptions of strategies used by people in the planning process-tracing experiments, we also need a vocabulary of some primitives.Due to the algorithmic setup of the method introduced in this section, here, the required set of primitives comprises logical predicates.Following Skirzyński et al. (2021) whose method is an important part of our pipeline, the predicates serve as feature-detectors and are formally defined as mappings from the set of state-action pairs to booleans, that is f : S × A → {0, 1}.Practically, the predicates describe the state s ∈ S, the action a ∈ A or the particular characteristic that action a has in state s.For instance, predicate is_observed(s,a) might denote that node number a in the Mouselab MDP paradigm has not yet been clicked and its value is hidden in state s.Later, this predicate could be used to define a very simple DNF that allows clicking all the nodes that have not yet been clicked, i.e. not(is_observed(s,a)).The second step in our pipeline thus considers creating a set of predicates that describe the process-tracing environment.Further in the text we will call that set of predicates the Domain Specific Language (DSL).

Obtaining interpretable descriptions of human planning strategies via Human-Interpret
The third and final step of our method is to use the algorithm for discovering interpretable descriptions of people's planning strategies.This algorithm finds a set of planning strategies people used based on the data from process-tracing experiments and then generates interpretable descriptions of those strategies in form of procedural formulas.Since the AI-Interpret method of Skirzyński et al. (2021) plays a vital role in our pipeline, we call this new algorithm Human-Interpret.

Big picture
In contrast to AI-Interpret (Skirzyński et al., 2021) which accepts demonstrations of RL policies as input, Human-Interpret uses human data to generate demonstrations on its own before processing them.In order to do that, Human-Interpret first extracts a representative subset of the sequences of information-gathering actions (clicks) along with the knowledge states in which they were performed (see the sequence of planning operations (s i , a i ) N i=1 in Figure 2; the size of this subset (K) is an input to the algorithm).This subset is selected by using a clustering technique that represents clusters as policies.The policies obtained through clustering are formalizations of planning strategies that originally still lack interpretability.To make those policies understandable to humans, the Human-Interpret algorithm utilizes AI-Interpret.AI-Interpret requires demonstrations of what the policies do.Those demonstrations are generated by executing the discretized versions of the policies.AI-Interpret describes those demonstrations using the previously created interpretable DSL.The outputs of the AI-Interpret algorithm are K DNF formulas (represented by the flowcharts in Figure 2).In the last step, Human-Interpret enhances the interpretability of these formulas using the method from (Becker et al., 2021) and turns the formulas into procedural descriptions mimicking the linear temporal logic formalism (see the output in Figure 2).The pseudo code for Human-Interpret can be found in the Algorithm 1 box; the explanation of its parameters is shown in Table 1.
Algorithm 1: Pseudocode for the Human-Interpret algorithm.

Clustering planning operations into planning strategies
Human-Interpret begins the process with extracting planning strategies out of sequences of planning operations gathered in the process-tracing experiment.Let } denote the set of planning operations belonging to M participants where τ ji = ((s ji l , a ji l )) L ji l=1 is the i-th sequence of planning operations generated by participant j, and L ji is its length.Human-Interpret utilizes the Expectation Maximization (EM) (Moon, 1996) algorithm to fit a probabilistic clustering model to the state-action sequences in D and extract k planning strategies (the clusters π 1 , π 2 ...., π k ).Each planning strategy corresponds to a softmax policy of the form given in Equation 2.
Each softmax policy (π i ) is represented by weights w i assigned to P different features f = [φ 1 , φ 2 , . . ., φ P ] of the state-action pair where φ i : S × A → X φ .These features were partially derived from the DSL and partially hand-designed.There were 19 of them and they can be found in Appendix A.4.The aim of the EM algorithm is to find the planning strategies by clustering the click sequences in D into k clusters, with each cluster being represented by a softmax policy π i .It does this by optimizing the set of weights W = (w 1 , w 2 , ...., w k ) that maximize the total likelihood (M) of the click sequences under all the clusters.M is described in Equation 3.
After obtaining the policies represented by the weights W , they are discretized to form new uniform policies (π i ) as described in Equation 4 that assign uniform probability to actions with the highest probability according to π i , that is to optimal actions.Policies π i are then used to create a data set of demonstrations D = {(s i , a i )} L i=1 which contains L planning operations generated through some fixed number of rollouts.

Finding formulaic descriptions of planning strategies
After computing policies π i and generating the data set of demonstrations D, Human-Interpret essentially runs AI-Interpret (Skirzyński et al., 2021).The only modification to the AI-Interpret algorithm that is introduced by Human-Interpret relates to the fact that the input demonstrations no longer represent the optimal policy, but some, often imperfect policy mimicking humans: π i (further called the interpreted policy).Because of that, the expected reward for policies induced by candidate formulas cannot make at least α of the expected reward for the interpreted policy (see the Background section), since it leads to errors.For instance, in a situation when the interpreted policy achieves a reward of R, and the optimal policy achieves a reward of 100R, a formula that induces a policy with reward 50R meets the criterion defined by α, but is a poor approximation to π i .Because of that, Human-Interpret uses divergence instead of the expected reward ratio.If π f is a policy induced by formula f found by AI-Interpret and π opt is the optimal policy for the studied environment (here, the Mouselab MDP), Human-Interpret computes the divergence of f as the ratio between the difference in rewards for the interpreted policy π i and π f , and the expected reward of the optimal policy, i.e. div(π f ) = . Consequently, Human-Interpret searches for solutions whose size is limited by d and for which the divergence is at most α (see the Background section).Note that introducing that modification requires the modeler, or the algorithm, to compute the optimal policy for the environment or at least know its approximation.

Extracting procedural descriptions from logical formulas
The output produced by the modified AI-Interpret algorithm we defined above is a DNF formula f * .Following the finding that procedural descriptions are easier to grasp for people than flowcharts (Becker et al., 2021), Human-Interpret uses the method presented in Becker et al. ( 2021) to transform f * into a logical expression written in linear temporal logic.The DNF2LTL algorithm described in (Becker et al., 2021) produces such an expression by separating the DNF into conjunctions, and finding the dynamics of their changes in truth valuations using the initial set of demonstrations inputted to AI-Interpret.The output, which we will call a procedural formula, separates the conjunctions with NEXT commands, and instructs to follow each conjunction until some condition occurs unless another condition occurs.The conditions are chosen among the predicates or 2-element disjunctions of predicates from the DSL introduced in Section 3.2, or simply read (for the UNTIL condition) "until it applies", which denotes a special logical operator.Since the output formula might be overly complex after that process, in the next step the algorithm prunes some of the predicates appearing in the conjunctions.Concretely, predicates are greedily removed one by one so as to increase the probability of people's planning operations under the strategy described by the shortened formula.More details regarding the algorithm can be found in Appendix A.5.

Translating to natural language
Once the procedural formulas are generated, it is possible to obtain fully understandable descriptions of human planning strategies by transforming the predicates and the operators appearing in the formulas into natural language.This is an optional step that we have accomplished for the Mouselab-MDP paradigm.Our procedural formulas are expressed in natural language by using a pre-defined predicate-to-expression dictionary.

Technical details regarding using our method
Here, we present technical details connected to installing and using our method for discovering and describing human planning strategies.We equipped the initial code base written in Python 3.6 with: 1) data from 4 planning experiments ran in different versions of the Mouselab-MDP paradigm, 2) a Domain Specific Language (DSL) of logical primitives used to generate procedural formulas, and 3) a dictionary of predicate-to-expression entries for transforming a formula into natural language.Each of these elements is either a parameter or a hyperparameter of Human-Interpret.Description of the experiments can be found in (Jain et al., 2021), whereas the DSL is detailed in Appendix A.2 and contained in one of the files in the code base, similarly to the dictionary.Thanks to the initial values for those parameters it is possible to use our method without performing prior research.Moreover, one may extend our research by slightly modifying the DSL or running Human-Interpret on a different data set.The steps involved in setting up the whole method are as follows: 1. Download data needed in the pipeline and the source code for Human-Interpret by cloning the appropriate Github repository using the command: git clone https://github.com/RationalityEnhancement/InterpretableHumanPlanning.git The repository includes four data sets that are contained in the folder data/human.Refer to (Jain et al., 2021) for a detailed description of the experiments they come from.2. Access the root directory of the downloaded source code and install the needed Python dependencies: pip3 install -r requirements.txt3. Run Human-Interpret on either of the available data sets by typing python3 pipeline.py--experiment <name> --max_num_strategies <max> --num_participants <num_p> --expert_reward <exp_rew> --num_demos <demos> -begin <b> Parameter experiment corresponds to the name of the experiment that resulted in one of the four data sets.As written in Table 1, max_num_strategies quantifies the maximum number of strategies that could exist in the data set and should be described; parameter expert_reward defines the maximum reward obtainable in the Mouselab MDP defining each of the experiments; num_participants state that the data of the first <num> participants should be extracted from the data set; num_demos controls how many strategy demonstrations to use in Human-Interpret; begin controls which model to begin with (how many clusters to consider in the first model to then incrementally increase that number thus defining consecutive models).The available names, number of all tested participants, and corresponding expert rewards are provided in the readme file included in the source code.The output of this command is saved in the interprets_procedure folder in a text file named according to the following structure: strategies_<name>_<max>_<num_p>_<demos>.
The files contain procedural formulas describing the clusters, their natural language descriptions, and a set of statistics associated with those descriptions (e.g.how well it fits the experimental data, how big it is, etc.) 4. To reproduce the exact set of strategies from this paper, set the parameters like this: python3 pipeline.py--experiment v1.0 --max_num_strategies 17 -begin 17 --num_participants 0 --expert_reward 39.97 --num_demos 128 Information regarding other parameters available for tuning the Human-Interpret algorithm can be found in Table 1 and in the pipeline.pyfile.The set of logical primitives serving for our DSL may be accessed by navigating to RL2DT/PLP/DSL.py.Finally, the dictionary is included in the file translation.py.To apply our method to different problems consult pipeline.py,README.md,Section 3.1, and Section 3.2.

Evaluating our method for discovering human planning strategies
In this section, we evaluate our method on data from an experiment that used the Mouselab-MDP paradigm to externalize planning.We show the strategies that were automatically discovered and list their descriptive statistics.We also compare the output of our method to the strategies that Jain et al. 2021 discovered through a laborious manual inspection of the same data set.The first subsection focuses on each of the three steps of our method: it describes the planning experiment, the vocabulary of logical primitives and the parameters for Human-Interpret.The second subsection describes the strategies discovered by Human-Interpret and compares them to strategies discovered manually by Jain et al. (2021).

Planning experiment
In our benchmark problem we used a process-tracing experiment on human planning conducted according to the Mouselab-MDP paradigm (see Section 2.2.2), namely the first experiment presented in Jain et al. (2021).This experiment (which we will refer to as the increasing variance experiment) used the Mouselab-MDP environment shown in Figure 1 where the rewards hidden underneath the nodes were different in every trial but were always drawn from the same probability distribution.The nodes closest to the spider's starting position at the center of the web had rewards sampled uniformly from the set {−4, −2, 2, 4}.Nodes one step further away from the spider's starting position had rewards sampled uniformly from the set {−8, −4, 4, 8}.Finally, the nodes furthest away from the starting position of the spider harbored rewards sampled uniformly from the set {−48, −24, 24, 48}.The fee for clicking a node to uncover its reward was 1.The 180 participants who took part in this experiment were divided into 3 groups that differed in what kind of feedback was provided during the trials.The control group received no feedback.The second group received feedback on their first move.The third group received feedback on every click they made that was designed to teach them the planning strategy that is optimal for the task environment.There were always 2 blocks: a training block and a test block, with 20 trials in the training block and 10 trials in the test block.As stated in the previous sections, the experiment utilized the Mouselab-MDP process-tracing paradigm and operationalized people's planning strategies in terms of the clicks (planning operations) they performed.

Domain Specific Language (DSL) and translation dictionary
We adopted the DSL of predicates from our work on AI-Interpret which was also conducted on Mouselab-MDP environments (Skirzyński et al., 2021).Generally, the DSL consisted of over 14000 predicates generated according to a hand-made context free grammar.The predicates were defined on state-action pairs, where states were represented as Python objects capturing the uncovered and covered rewards in the Mouselab-MDP, and each action denoted the ID of the node to be clicked.The predicates described the current state of the Mouselab-MDP or the actions available in that state.The state is described in terms of the clicked nodes, the termination reward, and other properties.The actions are described in graph theoretic terms, such as the depth of the clicked node, whether it is a leaf node, whether its parents or children have been observed, and so on.The DSL included two crucial second-order predicates: the among predicate asserts that a certain condition (given by a first-order predicate) holds among a set of nodes defined by another first-order predicate, and the all predicate asserts that all of the nodes satisfying a certain condition also satisfy another condition.Detailed descriptions of the predicates in the DSL we used are available in Appendix A.2.
The dictionary we used for translating the resulting procedural formulas is an adapted version of the dictionary from Becker et al. ( 2021).In our dictionary, we changed natural language translations of most predicates to use graph theoretic jargon (such as leaves, roots, etc.).Moreover, our translations always begin with "Click on the destinations satisfying all of the following conditions:" if there are any non-negated predicates in the formula and list the non-negated predicates as conditions.The difference with the original dictionary is also that more complex predicates (such as those which included other predicates as their argument) are broken down into 2 or more conditions, whereas in the original dictionary each predicate has its unique translation.More details on how the translation was created can be found in in our project's repository in the translation.pyfile.

Parameters for the Human-Interpret algorithm
To run Human-Interpret, we used the default parameters for the AI-Interpret algorithm, the DSL described in the previous section, the data from all of the participants in the experiment described in Section 4.1.1(that is 180), and data form both blocks of the experiment (i.e., the training block and the test block).We ran Human-Interpret with 1-20 clusters and then performed model-order selection according to Bayesian inference with a uniform prior on the number of clusters (Kass and Raftery, 1995).Each cluster was defined as a mixture of two policies.The first policy was induced by the procedural description constructed for the cluster.The second policy served as an error model.These two policies assigned uniform probability to actions allowed and disallowed by the procedural description constructed for the cluster, respectively.The weights assigned to both policies were Table 2: Logarithm of the marginal likelihood, BIC and AIC of the clustering models analyzed by Human-Interpret.
cluster-dependent free parameters (i.e., i ∈ [0, 1]).Mathematically, the clustering model P (τ ) for a sequence of planning operations τ = (s i , a i ) T i=1 , for K clusters represented by K policies π 1 , . . ., π K , and for K error models π 1 , . . ., π K took the following form: We performed model selection among the clustering models for which Human-Interpret was able to produce a description for all clusters.Our rationale for doing so was that models that do not describe all the clusters are not useful for the purpose of understanding human planning.After imposing this constraint, we chose the model under which the data set had the greatest marginal likelihood that we approximated for each model using the Bayesian Information Criterion (BIC) (Konishi and Kitagawa, 2008).We discovered that the model with 10 clusters achieved the highest marginal likelihood (marginal likelihoods computed for models that described all the clusters can be found in Table 2).Corroborating this result, the chosen number of clusters was identical when we used the Akaike Information Criterion (Vrieze, 2012) as our way for selecting the best model.The DSL we used and the predicates we used for the DNF2LTL algorithm can also be found in Appendix A.2 and A.3.All of the parameters are listed in Table A1 in Appendix A.1.The more elusive parameters, such as interpret_size or num_trajs were selected based on the simulations and interpretability experiments presented in Skirzyński et al. (2021).

Evaluating and comparing automatically discovered strategies to strategies discovered through manual inspection
In this section we show the discovered descriptions of human planning strategies, list their statistics, and compare them to the strategies that Jain et al. 2021 found through manual inspection.
With the selected clustering model, Human-Interpret found descriptions for 10 clusters.For some of those clusters the automatically generated descriptions were equivalent to those of one or more other clusters.This reduced the number of discovered strategies to 7 unique descriptions.The discovered strategies are described in Table 4.According to the manual analysis from (Jain et al., 2021), our automated method discovered 3 of the 4 most common strategies (i.e.strategies used in at least 3% of all trials).These strategies account for 53.4% of all planning operations.In total, our method rediscovered 6 of the strategies discovered by (Jain et al., 2021) and those strategies jointly account for 63.2% of people's planning operations.Other strategies discovered by Jain et al. (2021) were assimilated into one of the 7 unique strategies discovered by our method.These strategies were also diverse.Namely, our 7 unique strategies represent 7 out of 13 strategy types identified by Jain et al. (2021), and hence cover over 50% of existing planning strategies variability.Finally, our automatic method resulted in a set of strategies that describe people's planning operations 3.03 times better than chance.In other words, the probability of a planning operation under the strategy it was ascribed to is on average 3.03 times as high as than its probability under a random model that always assigns the same probability to all possible actions (including termination).In the case of the manual inspection performed by Jain et al. (2021), the results are very similar, and the average improvement equals 3.6 times as high as than the random model.This means that the automatically created descriptions are almost as accurate as the manually found descriptions, despite being more coarse.A summary of the comparison between the performance of our automatic strategy discovery and description method with the manual approach by Jain et al. (2021 can be found in Table 3. In previously mentioned Table 4 we also report descriptive statistics about the complexity of the automatically generated descriptions, the frequencies of the discovered strategies, how well they corresponded to the EM clusters, and how well they explain the sequences of planning operations they are meant to describe.These statistics suggest that our automatic method generates reasonable solutions of a very promising quality, especially considering that our method discovered those strategies without any labelled training examples or human feedback.On average, the discovered descriptions agreed with the softmax models of the clusters on 77% of the planning operations.That quantity, which we call the formula-cluster fit (FCF) is defined as the average of two proportions.The first one is the proportion of the planning operations generated according to the inferred description that agreed with the choices of the corresponding softmax model.The second one is the equivalent proportion obtained by generating the demonstrations according to the softmax models and then evaluating them according to the descriptions.In both cases we performed 100000 rollouts.Further, the average likelihood of the planning operations within the clusters reached 76% of the likelihood that would have been achieved if all planning operations perfectly followed the descriptions.We call that proportion the fit per operation (FPO).The quality of individual cluster-descriptions, as measured in terms of the FPO, is depicted in Figure 3.To globally visualize the quality of our method, we estimated the distribution of FPO values across all clusters.This revealed that almost 85% of clusters would achieve FPO over 60% (see Figure 4).
The runtime of our method, that is the number of days for which the code ran, was 45 days.In contrast, it took Jain et al. (2021) about 120 days to finish their manual analysis.This means that our method could have accelerated the process of strategy discovery by 75 days compared to manual inspection.That suggests that our method can be used to speed-up research on the strategies of human planning and decision-making.Table 3: Comparison between our automatic method for finding and describing human planning strategies and the manual approach by Jain et al. (2021).The success rate is the proportion of sequences of planning operations that the method was able to describe.Runtime is the number of days it took to generate the result.
The likelihood per click is the average likelihood per a planning operation.Times better than random quantifies the improvement in average likelihood per operation over the random model which assigns equal probability to all actions that are possible in a given step.

FR FCF FON FPO C N 3
Description: 1. Click on a node satisfying all of the following conditions: -it is an unobserved leaf.2. Click on the nodes satisfying all of the following conditions: -they are unobserved leaves.Click in this way as long as -the previously observed node was their sibling Repeat this step as long as possible.egy 28, 29, 30, 33, 34, 35, 36, 54, 55, 61, 70, 71, 73) 5.68% 6 Description: 1. Click on a node satisfying all of the following conditions: -it is a node on an arbitrary level but for 2 and on an arbitrary level but for 3, -it is an unobserved node that has a parent with the non-highest or unobserved value on its level, -lies on a best path.2. Click on the nodes satisfying all of the following: -they are unobserved non-roots, -they have parents with the lowest values considering the parents of other unobserved non-roots.Click in this way as long as: -all the nodes with a 48 on their path have a parent with the same observed value.Repeat this step as long as possible.3. GOTO step 1. 13.15% Table 4: The results of applying our computational method to the benchmark problem.Human-Interpret with 10 clusters found 7 unique strategies listed along with their ID, their automatically generated description, the summary we created by hand, and the name (or numbers) of the corresponding strategy (strategies) discovered manually by Jain et al. (2021).FR denotes the frequency of the strategy; FCF (fit cluster-formula) averages two proportions: formula demonstrations agreeing with the softmax clusters and vice-versa measured using 100000 demonstrations; FON (fit optimal-non-optimal) quantifies how often people's planning operations in the cluster agreed with the description; FPO (fit per operation) is the ratio between the average likelihood per planning operation belonging to the cluster and the average likelihood per planning operation for the (policy induced by the) cluster's description in general; C, which stands for complexity, is the number of individual predicates in the description; N is the number of clusters with the given description.Frequency (FR) is the only statistic measured for the manually found strategies.
Since the average measurements over all clusters do not take into account the importance of clusters, we decided to perform the same computations using cluster size-dependent weighted averages.As larger clusters represent the most often used strategies, and these are the strategies we are predominantly interested in from a psychological standpoint, weighted averages capture the measured statistics with higher accuracy.After weighting our statistics we found that the descriptions achieved the formula-cluster fit (FCF) of 89%, and the fit per planning operation (FPO) of 82%.The weighted average complexity of the description was 7 predicates.To present a clearer picture, we excluded the unique cluster describing the random policy, and then computed the weighted statistics again.This cluster contains around 12% of all sequences of planning operations and it trivially achieves the FPO of 1. Having removed that cluster, our descriptions obtained the FCF of 93%, and the FPO of 79%.The first number marks an increase in comparison to the previous fit between the descriptions and the softmax models, since the random policy advocated by the removed cluster was a bad fit for a substantial part of the planning operations.In the second case, the new fit per operation marks a slight decrease in comparison to the previous fit, but still amounts to a reasonable quantity.The weighted average complexity of the description increased to 8 predicates.Fig. 3: Plot showing the quality of the clusters found by our pipeline, measured as the fit per operation (FPO), depending on their size.Clusters labeled on the x-axis via their unique cluster ID (see Table 4 for a reference).The width of the bar corresponds to the size of the cluster.The height of the bar corresponds to the FPO measure.The red line indicates the average FPO across all clusters.

General discussion
The main contribution of our research is automating the process of scientific discovery in the area of human planning.By using our method, scientists are no longer left at the mercy of their own ingenuity to notice and correctly interpret the right patterns in human behavior; instead they can rely on computational methods that do so reliably.Concretely, we developed the first method for the automatic discovery and description of human planning strategies.Our method's pipeline comprises 3 steps: The first step is to run an experiment that externalizes human planning operations.The second step is to create a Domain Specific Language (DSL) of logical predicates that describe the task environment and the planning operations.Finally, the third step is to run the generative imitation learning algorithm that we created and present in this paper, called Human-Interpret.Human-Interpret discovers the strategies externalized in the experiment by creating Fig. 4: Density of the fit per operation measure for clusters discovered by our pipeline.93.5% of the clusters achieve the fit per operation above 0.6.generative softmax models (the generative step) and describes them by procedural rules formulated in the DSL by imitating their rollouts (the imitation learning step).

Advantages
The tests we ran on a benchmark planning problem revealed that the accuracy of our method is akin to that of a manual human analysis.Firstly, our automated approach found the majority of all frequently used strategies (3 out of 4), if we assume that a strategy appears "frequently" if it is used in at least 3% of the trials according to the manual analysis (Jain et al., 2021).Secondly, almost all of the unique automatically discovered strategies (6 out of 7) had either an exact manually discovered counterpart or a class of manually found counterparts (see Table 4).These two properties indicate that our method discovered virtually the same strategies as scientists did without any human supervision.Thirdly, the average fit of the automatically found strategies with respect to human planning operations was moderately high (FPO of 76%), meaning the descriptions were accurate representations of the experimental data.Fourthly, these descriptions were clear and understandable (see Table 4), proving the effectiveness of using our method.Last but not least, applying our computational method was faster than manual human analysis.It took us 45 days before we obtained the final output via our pipeline whereas studying human planning without the aid of AI took about 120 days.
In the light of our results, the main benefit of using the introduced pipeline is accelerating scientific discovery in research on decision-making and planning.Rather than trying to manually discover one strategy at a time, our method makes it possible to run many large experiments with many different environments and discover all of the strategies people use across those environments at once.Further, our method is more objective than the subjective and potentially biased personcentered manual approach.Strategies and their descriptions are assigned based on mathematical likelihood and they are provably optimal under such probabilistic criteria, whereas people could introduce strategies based on their preconceptions and imperfect knowledge.Employing our pipeline in research could save precious time.A scientist interested in human planning could invest his or her resources to other components of their research while waiting for the computations to finish, unlike in the manual approach where he or she needed to perform all the steps alone.Finally, our method sometimes discovers strategies that scientists might have otherwise overlooked; one example thereof is Strategy 3 in Table 4 which does not have a counterpart in the broad list of strategies found manually.

Limitations
The above, however, brings us to the limitations of our method for the automatic discovery and description of human planning strategies.The softmax clusters did not always contain planning operations that fit well together.For instance, Strategy 3 had a poor fit score of 0.4 with respect to people's planning operations (FPO).Similarly, the No planning strategy (Strategy 2 in Table 4) had an FPO score of 0.61 despite a perfect 1.0 fit with respect to the cluster softmax model (FCF).This discrepancy occurred because the softmax model of the cluster for Strategy 2 performed zero planning even though the cluster included a substantial number of human planning operations that gathered additional information.Moreover, our method discovered many random planning clusters -3 out of 10.A few of those were not random before pruning.Since pruning optimized for the likelihood of the operations under the description, this shows that the planning operations clustered together by the generative part of Human-Interpret were too diverse.Furthermore, our DSL was incapable of capturing all the strategies.One of the often used planning strategies that our method did not discover is the "Consecutive second maximum strategy" (Jain et al., 2021) which was used in around 6% of the trials of the planning experiment.This strategy observes the final outcomes in a random order but stops planning after it encounters two outcomes with the second-highest reward consecutively.It was not discovered by our method most likely due to the lack of a predicate which described the second-highest reward.
Table 4 shows differences in the coverage (FR) between the same strategies found by our pipeline and manual analysis.The main discrepancy is that the automatically discovered strategies have a higher frequency than the humandiscovered strategies.We believe those differences stem from the generalization inherent to our method and used by AI-Interpret -it finds a description that fits some subset of the elements.Note that since the softmax clusters represent human planning operations, when a demonstration (planning operation) of the softmax cluster is rejected by the algorithm, a human operation that corresponded to this demonstration (e.g. it followed the same planning principle) is also rejected.A rejection of a cluster-generated planning operation in AI-Interpret might occur in two situations: 1) either it is indescribable due to the imperfect DSL, or 2) it differs from other operations generated by the softmax cluster due to the imperfect clustering.In fact, we computed that AI-Interpret utilized only 56% of the clustergenerated planning operations to find the descriptions.A part of human planning operations were thus also rejected when finding a description, but still counted towards the coverage of the strategy.However, the 44% of omitted cluster planing operations do not necessarily represent the same proportion of omitted human planning operations.Given that the FPO of the discovered strategies, which is our measure of the quality of descriptions with respect to human planning operations, was as high as 76%, the upper bound for the number of human planning operations which were completely disregarded by Human-Interpret is probably around 24%.
Finally, computational limitations make it necessary to limit the maximum number of strategies that our method can currently consider.The main computational bottleneck lies in Human-Interpret, and within it, in the AI-Interpret method.Because of the considerable computation time required to run AI-Interpret on the data of one softmax cluster (Skirzyński et al., 2021) we could not find strategies the are as fine-grained as the manually found ones.This is the main reason why we only considered up to 20 strategies that could be used by people, instead of up to 80, to match the results of the hand-made analysis.

Future work
As stated above, the main limitations of our method are: imperfect softmax clustering, an incomplete DSL, and the computational bottlenecks of AI-Interpret.We could improve the clustering by imposing a no-diversity penalty on the models.Hypothetically, this would make the softmax models as different as possible and they could thus capture more diverse kinds of behaviors, for instance as in (Eysenbach et al., 2018).Alternatively, we could obtain a better clustering by incorporating a different method for choosing which number of softmax clusters is optimal.Here, we used the maximum marginal likelihood or the highest AIC to differentiate between competing models (understood as numbers of clusters), but there might be better ways for accounting for model complexity, such as performing cross-validation.We see improving the DSL as an iterative refinement process in which we would add the necessary predicates to the DSL, check the results of running our computational method with the new DSL against the manual analysis, identify missing predicates, if any, and then return to the first step.Just a few iterations of this process should render a DSL capable of describing most of the strategies used by people in the Mouselab-MDP environment.Both of those enhancements, however, would not be feasible without first making our computational method run faster.Thus, we plan on upgrading the implementation of AI-Interpret in one of our first next steps.
Besides ameliorating the performance of our method, a worthwhile future work direction is to combine the strategy discovery method presented in this article with our computational process-tracing method for measuring which planning strategy each participant used on each trial of the experiment (Jain et al., 2021).We hypothesize that our method could serve to establish the basic set of strategies one could use as input to the microscope.Then, a human planning sequence could be automatically assigned to one of the automatically discovered strategies, instead to the manually discovered strategies.In this way, automatic strategy discovery might make it possible to speed up the development of equivalent computational process-tracing methods for other tasks and domains and to improve the repertoire of strategies that those methods use to describe the temporal evolution of people's cognitive strategies.
Finally, future work should seek to apply our method for the automatic discovery and description of human strategies to other scenarios.The methodology we presented in this paper can be easily extended to decision-making in other domains, such as risky choice, inter-temporal choice, and multi-attribute decision-making (Lieder et al., 2017)).Because our approach is rather general, we believe it has the potential to accelerate scientific discovery in several areas of the cognitive and behavioral sciences.Table A1: Values for the parameters of Human-Interpret that were used in the benchmark test.

A.2 Defining the Domain Specific Language
Our Domain Specific Language consisted of a number of logical predicates.Every predicate we defined accepts (at least) two arguments: (belief) state of the environment b and computation/action c.In our case, the state relates to the list of expected values of nodes in the Mouselab MDP, whereas the computation is the number of the node to click, with 0 reserved for termination.The Mouselab MDP we used in the benchmark test had the form of a tree, hence a lot of the predicates made use of notions used for the tree graph structures.The meaning of the predicates, presented below in alphabetical order, is the following: A all(b,c,pred 1 ,pred 2 ) : All the nodes in the MDP that satisfy pred 1 also satisfy pred 2 .
among(b,c,pred 1 ,pred 2 ) : This node is among all the nodes in the MDP that satisfy pred 1 and inside that set it also satisfies pred 2 .
are_branch_leaves_observed(b,c) : This node has successor leaves which are all observed.
are_leaves_observed(b,c) : All leaf nodes have been observed.
has_parent_lowest_level_value(b,c) : This node's parent has the minimum possible value of its level.
has_root_highest_value(b,c,list) : This node has an ancestor on level 1 with an observed value that is higher than any other observed 1st-level ancestor's value for the nodes from list.
has_root_highest_level_value(b,c) : This node can be accessed through an observed node on level 1 which has the highest value on level 1.
has_root_lowest_value(b,c,list) : This node has an ancestor on level 1 with an observed value that is lower than any other observed 1st-level ancestor's value for the nodes from list.
has_root_lowest_level_value(b,c) : This node can be accessed through an observed node on level 1 which has the minimum value on level 1.
has_smallest_depth(b,c,list) : This node is the shallowest in the tree among the nodes from list.is_previous_observed_sibling(b,c) : The previously observed node is one of the siblings of this node.
is_root(b,c) : This node is one of the nodes on level 1.
is_successor_max_val(b,c) : One of the successors of this node is uncovered and has the maximum possible value in the MDP.
O observed_count(b,c,n) : There are at least n observed nodes.
T termination_return(b,c,e) : The expected reward after stopping now is ≥ e.
The DSL we used for studying Mouselab MDP policies was generated through a probabilistic context-free grammar with the following format: Listing 1: Probabilistic context-free grammar that generates the predicates used by AI-Interpret and in consequence by Human-Interpret.Probability of each production is uniform with respect to the non-terminal symbol on its left hand-side.

A.3 Defining redundant and allowed predicates
The redundant_predicates parameter of Human-Interpret controls which predicates are to be removed from the DNF formula before it is turned into linear temporal logic formula.The allowed_predicates parameter specifies which predicates are considered for the until and unless conditions.We used the following values: A.4 Defining features for the softmax models The generative models created in the generative step of the Human-Interpret algorithm take form of softmax functions shown in Equation 2. Those function are defined on a set of features derived from the Mouselab MDP in a given (belief) state b while taking a given computation/action c (node to click).We used the following features in the benchmark test: is_root(b,c) : Is the given node a root or not.

A.5.2 Transforming disjunctive normal form formulas into procedural formulas
In the first phase, DNF2LTL generates a procedural formula out of a DNF formula.To do so, our algorithm accepts four main inputs: the set of trajectories that led to the creation of the DNF formula, a set of predicates that could serve as the until or unless conditions, a set of predicates which are unwanted in the procedural formula, and, naturally, the DNF formula itself.
The trajectories play the role of the sequences of truth-assignments from Definition 9, whereas the set of predicates for until/unless conditions and the DNF formula define the building blocks out of which the procedural formula would be constructed.The other remaining parameter is optional, and in case of a failure in producing an output, the algorithm is ran again without removing the redundant predicates.On a high level, our algorithm exploits the idea that a DNF formula is satisfied when at least one of its conjunctions is satisfied.It iterates over the trajectories to discover the dynamics of changes in truth values of the conjunctions, and uses the conjunctions, the found dynamics, and the candidate until/unless conditions to generate procedural formulas.
During the first phase DNF2LTL generates an initial procedural description in four steps.In the first step, the algorithm extracts potential subroutines from the inputted DNF formula.In the second step, the algorithm determines the order in which those subroutines should be performed.In the third step, the algorithm computes the logical conditions for transitioning from each step to the next.Finally, in the fourth step, our method connects the subroutines with the appropriate conditions into a complete procedural description and outputs the result.Algorithm 2 presents a pseudo code that implements the first phase of DNF2LTL and the following paragraph provides a technical description of each of these four steps.Readers who are primarily interested in the big picture and the application to boosting human planning can skip these technical details.
Step 1: DNF2LTL starts by dividing the DNF formula into a set of conjunctions and removing all the unwanted predicates.
Step 2: Then, it iterates over the trajectories and for each trajectory records the sequence of conjunctions that were true for that trajectory so that the whole DNF formula could be true across all the state-action pairs within it.Our algorithm then creates a transition graph where conjunction c i is connected with conjunction c j if there is at least one trajectoryτ where the value of c i changed from true to false at the same moment when the value of c j changed from false to true.The transition graph is used to generate maximum length sequences of conjunctions c i 1 c i 2 . . .c in in order to capture the possibly fullest transition evidenced in the data.The last predicate in this sequence (i.e., c in ) either has no outgoing connections in the transition graph or connects to one of the c i j s in which case the sequence ends with a special loop symbol that indicates which i j that is.The resulting maximum length sequences (with potential loop symbols at the end) are used to define equivalence classes for the trajectories.These equivalence classes represent potential dynamics of how the conjunctions change their truth values so that the full DNF formula was satisfied.To find a small subset of equivalences classes that is sufficient to describe all of the trajectories, each trajectory, treated as a sequence of conjunctions of the DNF formula, is assigned to one equivalence class.Namely, for trajectory τ , the first encountered equivalence class is chosen, for which its sequence contains a subsequence representing τ .For instance if τ is represented by sequence c 1 c 3 , it could be assigned to equivalence class c 1 c 2 c 3 c 4 LOOP c 2 .
Step 3: Then, the algorithm selects the unempty equivalence classes, which represent dynamics that occur in the trajectories, and tries to transform the sequences representing the equivalence classes into procedural formulas.It does so by using the trajectories in the class to iteratively find conditions for the UNTIL operator (UNTIL conditions) that could separate each of the elements in the sequence representing the class.During one iteration, DNF2LTL searches for the UNTIL condition separating a pair of subsequent conjunctions.The UNTIL conditions are the allowed predicates provided as an input to the method and 2-element disjunctions of those predicates which are created on the fly.The candidate conditions are generated by searching for conditions whose truth value changes from constantly false, while one element of the sequence of conjunctions is true, to true, when the next element becomes true.Since there may be multiple conditions which conform to this criterion, we use probabilistic reasoning to choose the best one.Concretely, we test all candidate conditions and for each build a simplified procedural formula.The first part of that formula connects the consecutive conjunctions from the sequence according to the conditions selected in the previous iterations, then connects the currently considered pair of conjunctions with one of the candidate conditions, and ends the formula with an always T RU E predicate to capture that the rest of the steps are yet unspecified.Each such partial formula serves to induce a probabilistic policy.DNF2LTL uses these policies to specify the likelihood of each trajectory of planning operations under alternative procedural formulas.Then, the algorithm utilizes the maximum likelihood principle and selects the candidate UNTIL condition that explains the trajectories best.If some of the trajectories have a conjunction representation shorter than the representation of the class, the algorithm also adds an UNLESS operator after the UNTIL operator, and searches for the UNLESS condition in a similar way.If it fails in finding the UNTIL condition, it runs the algorithm again without removing the redundant predicates or, if no predicates had been removed in the first pass, it adds the HOLD operator instead.Similarly, if it fails in finding the UNLESS condition, it either runs the algorithm again without removing the predicates, or if nothing was removed, it adds F ALSE instead.This allows to find an imperfect procedural formula, that allows for excessive planning, in the case when the UNLESS condition cannot be expressed in the current DNF.After finding the UNTIL (and perhaps the UNLESS) condition, the subroutines represented by the two conjunctions from the sequence are separated with the found condition(s) and then connected via the NEXT operator.This process then continues.If the last pair of conjunctions from the sequence representing the equivalence class contains a conjunction and a loop symbol, this symbol is transformed into the LOOP operator (see Definition 11) and the conjunction and the loop operator are joined through the NEXT operator.If there are demonstrations that end before the loop, the UNLESS operator is added in the same way as before.
Step 4: After generating the procedural formulas for each of the equivalence classes the final procedural formula is returned as a disjunction of these formulas.Note, however, that only one of the elements in the disjunction is returned by DNF2LTL after it performs pruning (see below).
Our algorithm captures a very special type of procedural formulas.For a DNF formula with only one conjunction, the structure of the output can be described by the the regular expression where P may be substituted with either of the input allowed predicates or their 2-element disjunction, and Φ may be substituted with an arbitrary conjunction of those predicates.The expression given in Equation 7thus generates procedural formulas in the form of a sequence of NEXT operators, where subsequent conjunctions are separated with UNTIL conditions (and/or UNLESS conditions) or accompanied by the HOLD operator.The formula ends with the last NEXT operator or with a LOOP operator.

A.5.3 Pruning
After our algorithm generates a procedural formula Ψ in the first phase, it enters the second phase.During the second phase, DNF2LTL prunes the predicates appearing in the conjunctions of Ψ. Recall, however, Ψ is a disjunction of procedural formulas.Because of this reason, pruning occurs for each element of that disjunction separetely.To do so, DNF2LTL maps each procedural formula ψ i of the disjunction Ψ onto a distinct binary vector b i .Each element of b i is the truth value of one of the predicates appearing in the conjunctions making up psi i .Our algorithm iterates over b i s and in each step performs a greedy optimization.Concretely, for each consecutive predicate of ψ i the corresponding entry of b i is set to zero if and only if removing that predicate increases the likelihood of the trajectories under the pruned description relative to the unpruned description.Some of the predicates increase the likelihood and are consequently pruned.After performing this optimization for each b i (and ψ i ), the algorithm outputs the pruned ψ i for which the likelihood was the highest as the final procedural description.
Algorithm 2: First phase of DNF2LTL which generates procedural formulas out of DNF formulas.
Input: DNF formula f ; Set of N trajectories D = {((s 1j , a 1j )) l 1 j=1 , . . ., ((s N j , a N j )) e trajs ← Trajectories from e. 14: for β in e do 15: if loop in β then 16: Use the special loop symbol in β to determine the conjunction to return to γ. 17: Summary: DFS until a positive leaf (optionally observe its sibling) or until +48.

Description:
1. Click on a node satisfying all of the following: -it is a node on an arbitrary level but for 2 and a non-leaf -it is an unobserved node that has a child with the non-highest or unobserved value on its level -lies on a best path.2. Click on the nodes satisfying all of the following: -they are unobserved non-roots -they have parents with the lowest values considering the the parents of other unobserved non-roots.Click in this way as long as: -all the nodes with a 48 on their path have a parent with the same observed value.Repeat this step as long as possible.3. GOTO step 1. LTL formula: among(not(depth(2)) and not(is_leaf)) and among(not(has_child_highest_level_value) and not(is_observed) : has_best_path) AND NEXT among(not(is_observed) and not(is_root) : has_parent_smallest_value) and all_(is_max_in_branch : has_parent_smallest_value) UNTIL IT STOPS APPLYING GO TO among(not(depth( 2)) and not(is_leaf)) and among(not(has_child_highest_level_value) and not(is_observed) : has_best_path) 4.23% 0.93 0.7 0.67 21 Table A2: The results of applying our pipeline to the benchmark problem.Pipeline with 10 clusters found 7 unique strategies listed with their ID.For each unique strategy, we provide automatically generated descriptions that represent that unique strategy, and a summary of that strategy that we created by hand.FR denotes the frequency of the strategy; FCF (fit cluster-formula) averages two proportions: formula demonstrations agreeing with the softmax clusters and vice-versa measured using 100000 demonstrations; FON (fit optimal-non-optimal) quantifies how often people's planning operations in the cluster agreed with the description; FPO (fit per operation) is the ratio between the average likelihood per planning operation belonging to the cluster and the average likelihood per planning operation for the (policy induced by the) cluster's description in general; C, which stands for complexity, is the number of individual predicates in the description.
No planning.Manual counterpart: No planning.

I
is_ancestor_max_val(b,c) : One of the ancestors of this node is uncovered and has the maximum possible value in the MDP.is_leaf(b,c) : This node is a leaf.is_max_in_branch(b,c) : This node lies on a path with an uncovered maximum possible value in the MDP.is_2max_in_branch(b,c) : This node lies on a path with 2 uncovered maximum possible values in the MDP.is_observed(b,c) : This node was already clicked and is observed.is_on_highest_expected_value_path(b,c) : This node lies on a path that has the highest expected value.is_positive_observed(b,c) : There is a node with a positive value observed.is_previous_observed_max(b,c) : The previously observed node uncovered the maximum possible value in the MDP.is_previous_observed_max_leaf(b,c) : The previously observed node is a leaf and it uncovered the maximum possible value in the MDP.is_previous_observed_max_level(b,c) : The previously observed node uncovered a maximum possible value on that level.is_previous_observed_max_nonleaf(b,c) : The previously observed node isn't a leaf and it uncovered the maximum possible value in the MDP.is_previous_observed_max_root(b,c) : The previously observed node lies on level 1 and it uncovered the maximum possible value in the MDP.is_previous_observed_min(b,c) : The previously observed node uncovered the minimum possible value in the MDP.is_previous_observed_min_level(b,c) : The previously observed node uncovered a minimum possible value on that level.is_previous_observed_parent(b,c) : The previously observed node is the parent of this node.

C
count_observed_node_branch(b,c) : What is the minimum of the number of observed nodes on branches that pass through the given node.D depth_count(b,c) : What is the number of observed nodes on the same depth as the given node.depth(b,c) : What is the depth of the given node.G get_level_observed_std(b,c) : What is the standard deviation for the values of observed nodes at the same level as the given node.H hp_0(b,c) : Does the path that the node lies on has a value greater than 0. I immediate_successor_count(b,c) : What is the number of observed children of the given node.is_leaf(b,c) : Is the given node a leaf or not.is_previous_max(b,c) : Did the previously observed node uncover the maximum possible value in the MDP or not.

Table 1 :
Explanation of the parameters used by Human-Interpret and other methods utilized by Human-Interpret.
TableA1lists the values of the parameters of the Human-Interpret method that we used in our benchmark planning environment (the Mouselab MDP).The goal of the benchmark test was to discover and describe strategies used in this environment by people.
← possible until conditions found with e trajs , M D , β.