figure a
figure b

1 Introduction

Partially Observable Markov Decision Processes (POMDPs) combine non-determinism, probability and partial observability. Consequently, they have gained popularity in various applications as a model of planning in an unsafe and only partially observable environment. Coming from the machine learning community [30], they also gained interest in the formal methods community [11, 14, 22, 25]. They are a very powerful model, able to faithfully capture real-life scenarios where we cannot assume perfect knowledge, which is often the case. Unfortunately, the great power comes with the hardness of analysis. Typical objectives of interest such as reachability or total reward already result in undecidable problems [25]. Namely, the resolution of the non-determinism (a.k.a. synthesis of a strategy, policy, scheduler, or controller) cannot be done algorithmically while guaranteeing optimality w.r.t. the objective. Consequently, heuristics to synthesize practically well-performing strategies became of significant interest. Let us name several aspects playing a key role in applicability of such synthesis procedures:

:

quality of the synthesized strategies,

:

size and explainability of the representation of the synthesized strategies,

:

scalability of the computation method.

Strategy Representation. While and are of obvious importance, it is important to note the aspect . A strategy is a function mapping the current history (sequence of observations so far) to an action available in the current state. When written as a list of history-action pairs, it results in a large and incomprehensible table. In contrast, when equivalently written as a Mealy machine transducing the stream of observation to a stream of actions, its size may be dramatically lower (making it easier to implement and more efficient to execute) and its representation more explainable (making it easier to certify). Besides, better understandability allows for easier maintenance and modification. To put it in a contrast, explicit (table-like) or, e.g., neural-network representations of the function can hardly be hoped to be understandable by any human (even domain expert). Compact and understandable representations of strategies have recently gained attention, e.g., [12, 27], also for POMDP [3, 21], and even tool support [7] and [4], respectively. See [6] for detailed aspects of motivation for compact representations.

Current Approaches For POMDP, the state of the art is torn into two streams.

On the one hand, tools such as Storm [11] feature a classic belief-based analysis, which essentially blows up the state space, making it easier to analyze. Consequently, it is still reasonably scalable , but the size of the resulting strategy is even larger than that of the state space of the POMDP and is simply given as a table, i.e., not doing well w.r.t. the representation . Moreover, to achieve the scalability (and in fact even termination), the analysis has to be stopped at some places (“cut-offs”), resulting in poorer performance . On the other hand, the exhaustive bounded synthesis as in PAYNT [4] tries to synthesize a small Mealy machine representing a good strategy (while thus solving the POMDP) and if it fails, it tries again with an increased allowed size of the automaton. While this approach typically achieves better quality and, by principle, better size and explainability , it is extremely expensive and does not scale at all if the strategy requires a larger automaton . While symbiotic approaches are emerging [2], the best of both worlds has not been achieved yet.

Our Contribution We design a highly scalable postprocessing step, which improves the quality and the representation of the strategy. It is compatible with any framework producing any strategy representation, requiring only that we can query the strategy function (which action corresponds to a given observation sequence). In particular, Storm, which itself is scalable, can thus profit from improving the quality and the representation of the produced strategies. Our procedure learns a compact representation of the given strategy as a Mealy machine using automata-learning techniques, in two different ways. First, through learning the complete strategy, we get its automaton representation, which is fully equivalent and thus achieving also the same value. Second, we provide heuristics learning small modifications of the strategy. Indeed, for some inputs (observation sequences), we ignore what the strategy suggests, in particular when the strategy is not defined, but also when it explicitly states that it is unsure about its choice (such as at the cut-off points, where the sequences become too long and the strategy was not optimised well at these later points). Whenever we ignore the strategy, we try to devise with a possibly better solution. For instance, we can adopt the decision that the currently learnt automaton suggests, or we can reflect other decisions in similar situations. This way we produce a simpler strategy (thus also comparatively smaller), which can, in principle, fix the suboptimal decisions of the strategy stemming from the limitations of the original analysis (such as bounds on the exploration) or any other irregularities. Of course, this only works well if the true optimal strategy is “sensible”, i.e., has inner structure allowing for a simple automaton representation. For practical, hence sensible, problems, this is typically the case.

Summary of our contribution:

  • We provide a method to take any POMDP strategy and transform it into an equivalent or similar (upon choice) automaton, yielding small size and potential for explainability.

  • Thereby we often improve the quality of the strategy.

  • The experiments confirm the improvements and frequent proximity to best known values (typically of PAYNT) on the simpler benchmarks.

  • The experiments indicate great scalability even on harder benchmarks where the comparison tool times out. The auspicious comparison on simpler benchmarks warrants the trust in good absolute quality and size on the harder ones.

Related Work Methods to solve planning problems on POMDPs have been studied extensively in the literature [18, 32, 34]. Many state-of-the-art solvers use point-based methods like PBVI [29], Perseus [35] and SARSOP [23] to treat bounded and unbounded discounted properties. For these methods, strategies are typically represented using so called \(\alpha \)-vectors. Apart from a significant overhead in the analysis, they completely lack of explainability. Notably, while the SARSOP implementation provides an export of its computed strategies in an automaton format, we have not been able to find an explanation of how it is generated.

Methods based on the (partial) exploration and solving of the belief MDP underlying the POMDP [10, 11, 28] have been implemented in the probabilistic model checkers Storm [20] and Prism [24]. The focus of these methods is optimizing infinite-horizon objectives without discounting. Recent work [2] describes how strategies are extracted from the results of these belief exploration methods. The resulting strategy representation, however, is rather large and contains potentially redundant information.

Orthogonal to the methods above, there are approaches that directly synthesize strategies from a space of candidates [16, 26]. The synthesized strategy is then applied to the POMDP to yield a Markov chain. Analyzing this Markov chain yields the objective value achieved by the strategy. Methods used for searching policies include using inductive synthesis [3], gradient decent [19] or convex optimization [1, 15, 21]. [2] describes an integration of a belief exploration approach [11] with inductive synthesis [3].

Our approach is orthogonal to the solution methods in that it uses an existing strategy representation and learns a new, potentially more concise finite-state controller representation. Furthermore, our modifications of learned strategy representations shares similarities with approaches for strategy improvement [13, 33, 36].

2 Preliminaries

For a countable set S, we denote its power set by \(2^S\). A (discrete) probability distribution on a countable set S is a function \(d:S\rightarrow [0,1]\) such that \(\sum _{s\in S} d(S)=1\). We denote the set of all probability distributions on the set S as \(\textsf{Dist}(S)\). For \(d \in \textsf{Dist}(S)\), the support of d is \(\textsf{supp}(d) = \{s \in S \mid d(s) > 0\}\). We use the Iverson bracket notation where \([x]=1\) if the expression x is true and 0 otherwise. For two sets ST, we define the set of concatenations of S with T as \(S \cdot T = \{s \cdot t \mid s \in S, t \in T\}\). We analogously define the set of n-times concatenation of S with itself as \(S^n\) for \(n \ge 1\) and \(S^0 =\{\epsilon \}\) is the set containing the empty string. We denote by \(S^* = \bigcup _{i=0}^\infty S^n\) the set of all finite strings over S and by \(S^+ = \bigcup _{i=1}^\infty S^n\) the set of all non-empty finite strings over S. For a finite string \(w=w_1w_2\ldots w_n\), the string w[0, i] with \(w[0,0] = \epsilon \) and \(w[0,i] = w_1\ldots w_i\) for \(0 < i \le n\) is a prefix of w. The string \(w[i,n] = w_i\ldots w_n\) with \(0 < i \le n\) is a suffix of w. A set \(W \subseteq S^*\) is prefix-closed if for all \(w \in S^*\), \(w = w_1 \ldots w_n \in W\) implies \(w[0, i] \in W\) for all \(0 \le i \le n\). A set \(W' \subseteq S^*\) is suffix-closed if \(\epsilon \notin W\) and for all \(w \in S^*\), \(w = w_1 \ldots w_n \in W\) implies \(w[i, n] \in W\) for all \(0 < i \le n\).

Definition 1 (MDP)

[MDP] A Markov decision process (MDP) is a tuple \(\mathcal {M}= (S, A, P, s_0)\) where \(S\) is a countable set of states, \(A\) is a finite set of actions, \(P: S\times A\rightharpoonup \textsf{Dist}(S)\) is a partial transition function, and \(s_0\in S\) is the initial state.

For an MDP \(\mathcal {M}= (S, A, P, s_0)\), \(s\in S\) and \(a\in A\), let \(\textsf{Post}^{\mathcal {M}}(s,a)=\{s'\mid P(s,a,s')>0\}\) be the set of successor states of s in \(\mathcal {M}\) that can be reached by taking the action a. We also define the set of enabled actions in \(s\in S\) by \(A(s) = \{a\in A\mid P(s,a)\ne \bot \}\). A Markov chain (MC) is an MDP with \(|A(s)|=1\) for all \(s\in S\). For an MDP \(\mathcal {M}\), a finite path \(\rho = s_0a_0s_1\ldots s_i\) of length \(i\ge 0\) is a sequence of states and actions such that for all \(t\in [0,i-1]\), \(a_t\in A(s_t)\) and \(s_{t+1}\in \textsf{Post}^{\mathcal {M}}(s_t,a_t)\). Similarly, an infinite path is an infinite sequence \(\rho = s_0a_0s_1a_1s_2\ldots \) such that for all \(t\in \mathbb {N}\), \(a_t\in A(s_t)\) and \(s_{t+1}\in \textsf{Post}^{\mathcal {M}}(s_t,a_t)\). For an MDP \(\mathcal {M}\), we denote the set of all finite paths by \(\textsf{FPaths}_\mathcal {M}\), and of all infinite paths by \(\textsf{IPaths}_\mathcal {M}\).

Definition 2 (POMDP)

[POMDP] A partially observable MDP (POMDP) is a tuple \(\mathcal {P}=(\mathcal {M}, Z, \mathcal {O})\) where \(\mathcal {M}=(S, A, P, s_0)\) is the underlying MDP with finite number of states, \(Z\) is a finite set of observations, and \(\mathcal {O}:S\rightarrow Z\) is an observation function that maps each state to an observation.

For POMDPs, we require that states with the same observation have the same set of enabled actions, i.e., \(\mathcal {O}(s) = \mathcal {O}(s')\) implies \(A(s) = A(s')\) for all \(s,s'\in S\). This way, we can lift the notion of enabled actions to an observation \(z\in Z\) by setting \(A(z)=A(s)\) for some state \(s\in S\) with \(\mathcal {O}(s)=z\). The notion of observation \(\mathcal {O}\) for states can be lifted to paths: for a path \(\rho = s_0 a_0 s_1 a_1 \ldots \), we define \(\mathcal {O}(\rho ) = \mathcal {O}(s_0)a_0\mathcal {O}(s_1)a_1\ldots \). Two paths \(\rho _1\) and \(\rho _2\) are called observation-equivalent if \(\mathcal {O}(\rho _1) = \mathcal {O}(\rho _2)\). We call an element \(\bar{o}\in Z^*\) an observation sequence and denote the observation sequence of a path \(\rho = s_0a_0s_1\ldots \) by \(\overline{\mathcal {O}}(\rho ) = \mathcal {O}(s_0)\mathcal {O}(s_1)\ldots \) .

Fig. 1.
figure 1

Running example: POMDP

Example 1

Consider the POMDP graphically depicted in Fig. 1, modeling a basic robot planning task. A robot is dropped uniformly at random in one of four grid cells. Its goal is to reach cell 3. The robot’s sensors cannot to distinguish cells 0 and 2, while cells 1 and 3 provide unique information. For the POMDP model, we use states 0, 1, 2, and 3 to indicate the robot’s position. We mimic the random initialization by introducing a unique initial state \(s_0\) with a unique observation \(\texttt {i}\) (init). \(s_0\) has a single action that reaches any of the other four states with equal probability 0.25. Thus, the state space of the POMDP is \(S=\{s_0,0,1,2,3\}\). To represent the observations of the robot, we use three observations \(\texttt {b}\), \(\texttt {y}\) and \(\texttt {g}\), so \(Z=\{\texttt {i},\texttt {b},\texttt {y},\texttt {g}\}\). States 0 and 2 have the same observation, while states 1 and 3 are uniquely identifiable, formally \(\mathcal {O}=\{(s_0\rightarrow \texttt {i}), (0\rightarrow \texttt {b}),(1\rightarrow \texttt {y}), (2\rightarrow \texttt {b}), (3\rightarrow \texttt {g})\}\). The goal is for the robot to reach state 3. In each state, it can choose to move up, down, left, or right, \(A=\{s,u,d,l,r\}\). In each step, executing the chosen action may fail with a probability of \(p=0.5\), causing the robot to remain in its current cell without changing states.

Definition 3 (Strategy)

[Strategy] A strategy for an MDP \(\mathcal {M}\) is a function \(\pi : \textsf{FPaths}_\mathcal {M}\rightarrow \textsf{Dist}(A)\) such that for all paths \(\rho \in \textsf{FPaths}_\mathcal {M}\), \(\textsf{supp}(\pi (\rho ))\subseteq A(\textsf{last}(\rho ))\).

A strategy \(\pi \) is deterministic if \(|\textsf{supp}(\pi (\rho ))| = 1\) for all paths \(\rho \in \textsf{FPaths}_\mathcal {M}\). Otherwise, it is randomized. A strategy \(\pi \) is called memoryless if it depends only on \(last(\rho )\) i.e. for any two paths \(\rho _1,\rho _2 \in \textsf{FPaths}_\mathcal {M}\), if \(last(\rho _1) = last(\rho _2)\) then \(\pi (\rho _1) = \pi (\rho _2)\). As general strategies have access to full state information, they are unsuitable for partially observable domains. Therefore, POMDPs require a notion of strategies based only on observations. For a POMDP \(\mathcal {P}\), we call a strategy observation-based if for any \(\rho _1,\rho _2 \in \textsf{FPaths}_\mathcal {M}\), \(\mathcal {O}(\rho _1) = \mathcal {O}(\rho _2)\) implies \(\pi (\rho _1) = \pi (\rho _2)\), i.e. the strategy has same output on observation-equivalent paths.

We are interested in representing observation-based strategies approximating optimal objective values for infinite horizon objectives without discounting, also called indefinite-horizon objectives, i.e., maximum/minimum reachability probabilities and expected total reward objectives. We emphasize that our general learning framework also generalizes straightforwardly to strategies for different objectives. In contrast to fully observable MDPs, deciding if a given strategy is optimal for an indefinite-horizon objective on a POMDP is generally undecidable [25]. In fact, optimal behavior requires access to the full history of observations, necessitating an arbitrary amount of memory. As such, our goal is to learn a small representation of a strategy using only a finite amount of memory that approximates optimal values as well as possible.

We represent these strategies as finite-state controllers (FSCs) – automata that compactly encode strategies with access to memory and randomization in a POMDP.

Definition 4 (Finite-State Controller)

[Finite-State Controller] A finite-state controller (FSC) is a tuple \(\mathcal {F}= (N, \gamma , \delta , n_0)\) where \(N\) is a finite set of nodes, \(\gamma : N\times Z\rightarrow \textsf{Dist}(A)\) is an action mapping, \(\delta :N\times Z\rightarrow N\) is the transition function, and \(n_0\) is the initial node.

We denote by \(\pi _{\mathcal {F}}\) the strategy represented by the FSC \(\mathcal {F}\) and use \(\mathfrak {F}\) for the set of all FSCs for a POMDP \(\mathcal {P}\). Given an FSC \(\mathcal {F}= (N, \gamma , \delta , n_0)\) that is currently in node n, and a POMDP \(\mathcal {P}\) with underlying MDP \(\mathcal {M}= (S, A, P, s_0)\), in state s, the action to play by an agent following \(\mathcal {F}\) is chosen randomly from the distribution \(\gamma (n,\mathcal {O}(s))\). \(\mathcal {F}\) then updates its current node to \(n' = \delta (n,z)\). The state of the POMDP is updated according to \(P\). As such, an FSC induces a Markov chain \(\mathcal {M}_{\mathcal {F}} = (S\times N, \{\alpha \}, P^{\mathcal {F}}, (s_0,n_0))\) where \(P^{\mathcal {F}}((s,n),\alpha ,(s',n'))\) is \([\delta (n,\mathcal {O}(s)) = n'] \cdot \sum _{a\in A(s)}\gamma (n,\mathcal {O}(s))(a) \cdot P(s,a,s')\).

An FSC can be interpreted as a Mealy machine: nodes correspond directly to states of the Mealy machine, which takes observations as input. The set of output symbols is the set of all distributions over actions occurring in the FSC.

Fig. 2.
figure 2

Depiction of the FSC learning framework

3 Learning a Finite-State Controller

We present a framework for learning a concise finite-state controller representation from a given strategy for a POMDP. Our approach mimics an extension of the L* automaton learning approach [5] for learning Mealy machines [31]. The main difference in our approach is that we have a sparse learning space: not all observations of a POMDP are possible to reach from all states. Thus, there are many observation sequences that can never occur in the POMDP. To mark situations where this occurs, i.e. where a learned FSC has complete freedom to decide what to do, we introduce a “don’t-care" symbol \(\dag \).

Furthermore, for some policy computation methods, the strategy we receive as input may be incomplete. Although some observation sequence can appear in the POMDP, the strategy does not specify what to do when it occurs. This can for example be caused by reaching the depth limit in an exploration based approach. We use a “don’t-know" symbol \(\chi \) to mark such cases. While the non-occurring sequences do not directly influence the learning process, they cannot be ignored completely. These \(\chi \) need to be replaced by actual actions using some heuristics for the final FSC to yield a complete strategy (see Section 3.4).

An overview of the learning process is depicted in Fig. 2. We expect as input a (partially defined) strategy in the form of a table that maps observation sequences in the POMDP to a distribution over actions.

Definition 5 (Strategy Table)

[Strategy Table] A strategy table \(\mathcal {S}\) for a POMDP \(\mathcal {P}\) is a relation \(\mathcal {S}\subseteq Z^* \times (\textsf{Dist}(A) \cup \{\chi \})\). A row of \(\mathcal {S}\) is an element \((\bar{o},d) \in \mathcal {S}\).

For \((\bar{o},d) \in \mathcal {S}\), if \(\textsf{supp}(d)\) contains only a single action a, we write it as \((\bar{o},a)\). We say a strategy table \(\mathcal {S}\) is consistent if and only if for \(\bar{o}\in Z^*\), \((\bar{o},d_1) \in \mathcal {S}\) and \((\bar{o},d_2) \in \mathcal {S}\) implies \(d_1 = d_2\), i.e. each observation sequence has at most one unique output. A consistent strategy table \(\mathcal {S}\) (partially) defines an observation-based strategy \(\pi _{\mathcal {S}}\) with \(\pi _{\mathcal {S}}(\rho ) = d\) if and only if \((\overline{\mathcal {O}}(\rho ), d) \in \mathcal {S}\) and \(d \ne \chi \). For consistent strategy tables, the FSC resulting from our approach correctly represents the partially defined strategy.

Example 2

Table 1 depicts a strategy table for the POMDP described in Example 1. The table does not specify what to do in state 3 as at that point, the robot has already achieved its target. The action chosen at that point is irrelevant. Intuitively, the strategy table describes that the robot should go right as long as it sees b, and goes down once it sees y. The FSC in Figure 3 fully captures the behaviour described by the strategy table and thus accurately represents it.

Table 1. Example strategy table for the POMDP in Example 1. It only contains observation sequences of length at most 2.

In our framework, the input strategy table is used to build an initial FSC which is then compared to the input. If the initial FSC is already equivalent to the given strategy table, we are done and we output the FSC. Otherwise, we get a counterexample and use it to update the FSC. This process of checking for equivalence and updating the FSC is repeated until the FSC is equivalent to the table.

Fig. 3.
figure 3

FSC representing the strategy table of Table 1.

In the sequel, we first explain how our learning approach works on general input of the form described above. Then we show how the learning approach is integrated with an existing POMDP solution method by means of the belief exploration framework from [11]. Lastly, we introduce heuristics for improvement of the learned policies when the information in the table is incomplete.

3.1 Automaton Learning

The regular L* approach is used to learn a DFA for a regular language. It is intuitively described as: a teacher has information about an automaton and a student wants to learn that automaton. The student asks the teacher whether specific words are part of the language (membership query). At some point, the student proposes a solution candidate (in case of L*, a DFA) and asks the teacher whether it is correct, i.e. whether the proposed automaton accepts the language (equivalence query). Instead of the membership query of standard L*, the extension to Mealy machines [31] uses an output query, since we are not interested in the membership of a word in a language but rather the output of the Mealy machine corresponding to a specific word. As such, our learning approach needs access to an output query, specifying the output of the strategy table for a given observation sequence, and an equivalence query, checking whether an FSC accurately represents the strategy table. We formally define the two types of queries.

Definition 6 (Output Query (OQ))

[Output Query (OQ)] The output query for a strategy table \(\mathcal {S}\) is the function \(OQ_{\mathcal {S}}: Z^* \rightarrow \textsf{Dist}(A) \cup \{\chi , \dag \}\) with \(OQ_{\mathcal {S}}(\bar{o}) = d\) if \((\bar{o},d) \in \mathcal {S}\) and \(OQ_{\mathcal {S}}(\bar{o}) = \dag \) otherwise.

Definition 7 (Equivalence Query (EQ))

[Equivalence Query (EQ)] The equivalence query for a strategy table \(\mathcal {S}\) is a function \(EQ_{\mathcal {S}}: \mathfrak {F}\rightarrow Z^*\) defined as follows: \(EQ_{\mathcal {S}}(\mathcal {F}) = \epsilon \) if for all \((\bar{o}, d) \in \mathcal {S}\) and for all \(\rho \) with \(\overline{\mathcal {O}}(\rho ) = \bar{o}\), \(\pi _\mathcal {F}(\rho ) = d \). Otherwise, \(EQ_{\mathcal {S}}(\mathcal {F}) = c\) where \(c \in \{ \bar{o}\mid (\bar{o}, d) \in \mathcal {S}, \exists \rho \in \textsf{FPaths}_\mathcal {M}(\mathcal {P}): \overline{\mathcal {O}}(\rho ) = \bar{o}\wedge \pi _\mathcal {F}(\rho ) \ne d\}\) is a counterexample where \(\mathcal {S}\) and \(\mathcal {F}\) have different output.

The output query (OQ) takes an observation sequence \(\bar{o}\), and outputs the distribution (or the \(\chi \) symbol) suggested by the strategy table. If the given observation sequence is not present in the strategy table, it returns the \(\dag \) symbol, i.e., a "don’t care"-symbol. The equivalence query (EQ) takes a hypothesis FSC \(\mathcal {F}_{hyp}\) and asks whether it accurately represents \(\mathcal {S}\). In case it does not, an observation sequence where \(\mathcal {F}_{hyp}\) and \(\mathcal {S}\) differ is generated as a counterexample.

Using the definitions of these two queries, we formalise our problem statement as follows:

figure o

Learning Table We aim at solving the problem using a learning framework similar to \(L^*\). We learn an FSC by creating a learning table which keeps track of the observation sequences and the outputs the learner assumes they should yield in the strategy. Formally, it is defined as follows:

Definition 8 (Learning Table)

[Learning Table] A learning table for POMDP \(\mathcal {P}\) is a tuple where is a prefix-closed finite set of finite strings over the observations representing the upper row indices, the set are the lower rows indices and is a suffix-closed finite set of non-empty finite strings over \(Z\) – the columns. is a mapping that represents the entries of the table.

Table 2. Running example - initial table

Intuitively speaking, the table is divided into and lower rows. Initially, the of the learning table are the observations in the POMDP. Additional columns may be added in the learning process to further refine the behavior of the learned FSC. Upper rows effectively result in nodes of the learned FSC, while lower rows specify destinations of the transitions. For a row in the upper rows, each entry represents the output of the FSC corresponding to their respective observation (column). For an upper row, if a column is labelled only with an observation, the corresponding represents the output of the FSC on that observation. As an example, Table 2 contains the initial learning table for our running example. We do not include observation g for the target state as we are not interested in the behavior of the strategy after the target has been reached.

We say that two rows \(r_1,r_2\in R\cup R\cdot Z\) are equivalent (\(r_1\equiv r_2\)) if they fully agree on their entries, i.e., \(r_1\equiv r_2\) if and only if \(\mathcal {E}(r_1,c) = \mathcal {E}(r_2,c)\) for all \(c\in C\). The equivalence class of a row \(r \in R\cup R\cdot Z\) is \([r] = \{ r' \mid r \equiv r' \}\).

From Learning Table to FSC To transform a learning table into an FSC, the table needs to be of a specific form. In particular, it needs to be closed and consistent. A learning table is closed if for each lower row \(l \in R\cdot Z\), there is an upper row \(u\in R\) such that \(l \equiv u\).

Fig. 4.
figure 4

Transformation of a learning table to an FSC.

A learning table is consistent if for each \(r_1,r_2\in R\) such that \(r_1\equiv r_2\), we have \(r_1\cdot e \equiv r_2\cdot e\) for all \(e\in Z\). Closure of a learning table guarantees that each transition – defined in the FSC by a lower row – leads to a valid node, i.e. the node corresponding to the equivalent upper row. Consistency, on the other hand, guarantees that the table unambiguously defines the output of a node in the FSC given an observation.

Using the notions of closure and consistency, we can define the transformation of a learning table into the learned FSC :

Definition 9 (Learned FSC)

[Learned FSC] Given a closed and consistent learning table \(\mathcal {T}= (R, C, \mathcal {E})\), we obtain a learned FSC \(\mathcal {F}_\mathcal {T}= (N_\mathcal {T}, \gamma _\mathcal {T}, \delta _\mathcal {T}, n_{0,\mathcal {T}})\) where:

\(N_\mathcal {T}= \{[r] \mid r\in R\}\), i.e., the nodes are the upper rows of the table; \(\gamma _\mathcal {T}([r],o) = \mathcal {E}(r,o)\) for all \(o\in Z\), i.e. the output of a transition is defined by its entry in the table; \(\delta _\mathcal {T}([r],o) = [r \cdot o]\) for all \(r \in R, o\in Z\), i.e., the destination of a transition from node [r] with observation o is the node corresponding to the upper row equivalent to the lower row \(r \cdot o\); \(n_{0,\mathcal {T}}= [\epsilon ]\), i.e., the initial state is \([\epsilon ]\).

Example 3

We demonstrate how to transform a table to an FSC in Fig. 4. The upper rows become states, the lower rows show the transitions. In this example, on both the observations , we stay in the state and play action a and b, respectively.

3.2 Algorithm

We present our algorithm for learning an FSC from a strategy table. We have already seen the abstract view of the approach in Fig. 2. Algorithm 1 contains the pseudo-code for our learning algorithm. It consists of four main parts, also pictured in Fig. 2: , , , .

First, we initialise the learning table. The columns are initially filled with all available observations \(Z\), i.e. we set \(C\leftarrow Z\). We start with a single upper row \(\epsilon \), representing the empty observation sequence. In the lower rows, we add the observation sequences of length 1. The entries of the table are then filled using output queries. For example, consider the strategy table in Table 1. The learning table after initialisation is shown in Table 2. The strategy table only contains observation sequences starting with i. Thus, for any sequence starting with b or y, all entries are \(\dag \).

After initialising the table, we check whether it is closed. If the table is not closed, all rows in the lower part of the table that do not occur in the upper part are moved to the upper part. Formally, we set \(R\leftarrow R\cup \{l\}\) for all \(l\in R\cdot Z\) with \(l \not \equiv u\) for all \(u\in R\). In our example, this means that we move the rows and to the upper part of the table.

figure au

Once the table is closed (and naturally consistent), we check for each row in the given strategy table \(\mathcal {S}\) whether it coincides with the action provided for this observation sequence by our hypothesis FSC \(\mathcal {F}_{hyp}\). This is done formally using the equivalence query, i.e. we check if \(EQ_\mathcal {S}(\mathcal {F}_{hyp}) = \epsilon \). If our hypothesis is not correct, we get a counterexample \(c \in Z^+\) where the output of \(\mathcal {S}\) and \(\mathcal {F}_{hyp}\) differ. We add all non-empty prefixes of c to \(C\) and fill the table. We repeat this until \(\mathcal {F}_{hyp}\) is equivalent to the strategy table \(\mathcal {S}\).

After the equivalence has been established, we use the “don’t-care" entries \(\dag \) to further minimise the FSC. These entries only appear for observation sequences that do not occur in the strategy table. Thus, changing them to any action does not change the FSC’s behaviour with respect to the strategy table. We use this fact to merge nodes of the FSC to obtain a smaller one that still captures the behaviour of the strategy table. It is not trivial to already exploit “don’t care" entries during the learning phase. Two upper rows that are compatible in terms of the outputs they suggest, i.e. they either agree or have a \(\dag \) where the other suggests an output, might be split when a new counterexample is added. As such, we postpone minimisation of the FSC until the learning is finished.

3.3 Proof of Concept: Belief Exploration

For integrating our learning approach with an existing POMDP solution framework, we need to consider how the strategy table is constructed. Assume that the solution method outputs some representation of a strategy. For strategies that are equivalent to some FSC, one possibility is to pre-compute the strategy table. However, it is not clear how to determine the length of observation sequences that need to be considered. A more reasonable view is considering the strategy representation as a symbolic representation of the strategy table as long as it permits computable output and equivalence queries.

We demonstrate how this works by considering the belief exploration framework of [11]. The idea of belief exploration is to explore (a fragment of) the belief MDP corresponding to the POMDP. Then, model checking techniques are used on this finite MDP to optimise objectives and find a strategy. States of the belief MDP are beliefs – distributions over states of the POMDP that describe the likelihood of being in a state given the observation history. The strategy output of the belief exploration is a memoryless deterministic strategy \(\pi _{bel}\) that maps each belief to the optimal action. It is well-known that there is a direct correspondence between strategies on the belief MDP and its POMDP [34]. A decision in a belief corresponds to a decision in the POMDP for all observation sequences that lead to the belief in the belief MDP. Thus, \(\pi _{bel}\) can also be interpreted as a strategy for the POMDP that we want to learn using our approach.

First, assume that the belief MDP is finite. Defining the computation of the output query is conceptually straightforward. During each output query, we search for the belief b that corresponds to the observation sequence in the belief MDP. If we find it, the output is \(\pi _{bel}(b)\), otherwise the query outputs “don’t care" (\(\dag \)). For the equivalence query, we consider one representative observation sequence for each belief b. We compare whether \(\pi _{bel}(b)\) coincides with the output of the hypothesis FSC on the corresponding observation sequence. If not, this sequence is a counterexample. To deal with infinite belief MDPs, [11] employs a partial exploration of the reachable belief space of the POMDP. At the points where the exploration has been stopped (cut-off states), they use approximations based on pre-computed, small strategies on the POMDP to yield a finite abstraction of the belief MDP. The strategy \(\pi _{bel}\) computed on this abstraction, however, does not output valid actions for the POMDP in the cut-off states. We modify the output query described above and introduce a set of \(\chi \) symbols, i.e., \(\chi _0,...\chi _n\). On observation sequences of cut-off states, the output query returns “don’t-know" corresponding to that cutoff, i.e., \(\chi _i\) for “cut-off” strategy i. This allows us to later integrate the strategies used for approximation in our learned FSC or even substitute these strategies by different ones.

3.4 Improving Learned FSCs for Incomplete Information

FSCs learned using the learning approach described in Section 3.2 may still contain transitions with output “don’t-know" (\(\chi \)). To make the FSC applicable to a POMDP, these outputs need to be replaced by distributions over actions of the POMDP. For this purpose, we suggest two heuristics. They are designed to be general, i.e. they do not consider any information that the underlying POMDP solution method provides. Furthermore, they use the idea that already learned behavior might offer a basis for generalization. As a result, the information already present in the FSC is used to replace the “don’t-know" outputs. We note that additional heuristics can take for example the structure of the POMDP or information available in the POMDP solution method used to generate the strategy table into account. For illustrating the heuristics, we assume that all output distributions are Dirac. We denote the number of transitions in the FSC with observation o with output not equal to \(\dag _i\) or \(\chi _i\) for some i by \(\#(o)\) and the number of transitions with output action a for o by \(\#(o,a)\).

  • Heuristic 1 – Distribution: Intuitively, this heuristic replaces “don’t know” by a distribution over all actions that the FSC already chooses for an observation. The resulting FSC therefore represents a randomized strategy, i.e. the strategy may probabilistically choose between actions. This happens only in nodes of the FSC where “don’t know” occurs. Furthermore, this does not mean that the FSC itself is randomized; its structure remains deterministic. Only some outputs represent randomization over actions. In this method, we replace the ith “don’t know" \(\chi _i\) by an action distribution where the probability of action a under observation o is given by \(\frac{\#(o,a)}{\#(o)}\). If \(\#(o) = 0\), we keep \(\chi _i\) instead which, in the belief exploration approach of Storm, represents a precomputed cutoff strategy. In approaches where the strategy does not provide any information at all it can be replaced by \(\dag \). Intuitively, we try to copy the behavior of the FSC for an observation and since the optimal action is unknown, we use a distribution over all possible actions.

  • Heuristic 2 – Minimizing Using \(\dag \)-transitions: As for ease of implementation and explainability, smaller FSCs are preferable, this heuristic aims at replacing \(\chi _i\) outputs such that we can minimise the FSC as much as possible. For this purpose, we simply replace all occurrences of \(\chi _i\) by \(\dag \), i.e. we replace “don’t-know" by “don’t-care" outputs. This allows the FSC to behave arbitrarily on these transitions. By then applying an additional minimisation step, we can potentially reduce the size of our FSC. Intuitively, this allows for a smaller FSC that might be able to generalize better than specifying all actions directly. Note that this heuristic will transform any deterministic FSC into a smaller representation that is still deterministic, and will not induce any randomization.

4 Experimental Evaluation

We implemented a prototype of the policy automaton learning framework on top of version 1.8.1 of the probabilistic model checker Storm [20]. As input, our implementation takes the belief MC induced by the optimal policy on the belief MDP abstraction computed by Storm’s belief exploration for POMDPs [11]. This Markov chain, labeled with observations and actions chosen by the computed strategy, encodes all information necessary for our approach as described in Section 3.3. We apply our learning techniques to obtain a finite-state controller representation of a policy. This FSC can be exported into a human-readable format or analyzed by building the Markov chain induced by the learned policy directly on the POMDP. As a baseline comparison for the learned FSC, we use the tool PAYNT [4]. Recall that PAYNT uses a technique called inductive synthesis to directly synthesize FSCs with respect to a given objective.

Recent research has shown that PAYNT’s performance greatly improves when working in tandem with belief exploration [2]. As such, the comparison made here does not show the full capabilities of PAYNT. The tandem approach is likely to outperform our approach in many cases. We want to, however, show a comparison of our approach with a more basic, and thus more comparable, method. We emphasize furthermore that integrating our approach in the framework of [2] is a promising prospect for future work.

Setup The experiments are run on two cores of an Intel® Xeon® Platinum 8160 CPU using 64GB RAM and a time limit of 1 hour. We run Storm’s POMDP model checking framework using default parameters. In particular, we use the heuristically determined exploration depth for the belief MDP approximation and apply cut-offs where we choose not to explore further. We refer to [11] for more information. For PAYNT, we use abstraction-refinement with multi-core support [3]. We run experiments for the two heuristics described in Section 3.4. Additionally, we provide another result described as the “base” approach. This is specific to the input given by Storm and encodes the strategy obtained from Storm exactly by keeping the cut-off strategies, represented as \(\chi _i\) (see the extended version of this paper [8] for more technical details).

Benchmarks As benchmarks for our evaluation, we consider the models from [2]. The benchmark set contains models taken from the literature [3, 10, 11, 17] meant to illustrate the strengths and weaknesses of the belief exploration and inductive synthesis approaches. As such, they also showcase how our learning approach transforms the output of the belief exploration concerning the size and quality of the computed FSC. An overview of the used benchmarks is available as part of the extended version of this paper [8].

Fig. 5.
figure 5

Comparison of the resulting FSC size

4.1 Results

Our approach is general and meant to be used on top of other algorithms to transform possibly big and hardly explainable strategies into small FSCs. However, we want to explore whether our results are comparable to state-of-the-art work for directly learning FSCs. Therefore, we compare our FSCs to PAYNT.

First, we talk about the size of the FSC generated by our method compared to the MC generated by Storm and the FSCs generated by PAYNT. Secondly, we show the scalability of our approach by comparing the runtime with PAYNT. Lastly, we discuss the quality of the synthesized FSCs compared to PAYNT and also discuss the trade-off between runtime and quality of the FSC.

Fig. 6.
figure 6

Runtime comparison: our approach vs. PAYNT

Small and Explainable FSCs Given a strategy table, our approach results in the smallest possible FSC for the represented strategy. As an overview, in Fig. 5a, we show a comparison of the sizes of the belief MC from Storm to the size of our FSC. The dashed line corresponds to a 10-fold reduction in size, showing our approach’s usefulness. We generate FSCs of sizes 1 to 64; however, more than 80% of the FSCs are smaller than ten nodes, and only two are bigger than 60. More than half of the generated FSCs have less than four nodes. In one case, we reduce 4517 states in the MC to an FSC of size 12.

We claim that these concise representations can generally be considered explainable, in particular when compared to huge original strategy representations. When the given strategy is deterministic, our learning approach would construct a deterministic FSC which is easy to explain. While improving the FSC by replacing the “don’t know” actions (Section 3.4), heuristic 2 still keeps the FSC deterministic as it only replaces the \(\chi \) actions with \(\dag \) actions before minimisation. Heuristic 1 often introduces some randomization when it replaces the \(\chi \) actions with a distribution. But even in that case, they are only in selected sink states which does not impede explainability.

In Fig. 5b, we provide the size comparison of PAYNT’s FSCs and ours. Our FSCs are slightly bigger than PAYNT’s in general, but our approach also returns smaller FSCs in some cases. This is to be expected since the approach of PAYNT is iteratively searching through the space of FSCs, starting with only one memory node and adding memory only once it is necessary. Therefore, it is meant to find the smallest possible FSC. However, PAYNT times out much more often because of its exhaustive search on small FSC. Additionally, our FSC are bound to be as big as necessary to represent the given strategy. Let us consider the benchmark grid-avoid-4-0. In this model, a robot moves in a grid of size four by four with no knowledge about its position. It starts randomly at any place in the grid and has to move towards a goal without falling into a “hole”. PAYNT produces a strategy of size 3, which moves right once and then iterates between moving right and down. The nature of Storm’s exploration leads to a strategy that moves right three times and then down forever. This can be represented in an FSC of size at least 5.

Scalability Regarding scalability, Figure 6 shows that our approach outperforms PAYNT on almost all cases. The dotted lines show differences by a factor of 10. There are only two benchmarks, for which our approach times out and PAYNT does not. In one of these cases, PAYNT also takes more than 2000s to produce a result.

Table 3. Comparison to PAYNT on value, size, and time (in that order) on selected benchmarks. The reported time for our approach includes the time of Storm for producing the strategy table and the time for learning the FSC.

Runtime and Quality of FSCs Comparing the quality of results, we need to put into consideration that our approach often runs within a fraction of the available time. We run Storm with its default values to get a strategy. As demonstrated in [3], running Storm using non-default parameters, specifically larger exploration thresholds, results in better strategies at the cost of longer runtimes. Our approach directly profits from such better input strategies.

Since the learning is done in far less than a second for most of the benchmarks, we suggest using a portfolio of the heuristics. This allows us to output the optimal solution among all our heuristics with negligible computational overhead. To simplify the presentation of our results, we categorize the benchmarks into three groups: A, B, and C, based on the overall performance of our method. Due to space constraints, we provide detailed results for only a selection of benchmarks for each category and do not discuss benchmarks for which both approaches experienced timeouts. The complete set of results is given in [8].

Category A. This category represents benchmarks where our approach is arguably favored, assuming the portfolio approach. There are a total of 19 benchmarks in this category, and we observe that we can improve all variants of properties using heuristics. Only one time, PAYNT produces a slightly better probability value (0.93 vs 0.9), but it takes significantly more time (726s vs \(<1\)s). There are 7 cases where we can generate FSCs while PAYNT times out and on 6 out of these 7 cases, we get the smallest FSCs reported in state-of-the-art [2]. In this category, we also include benchmarks on which the heuristics improved on Storm’s strategy to achieve the same value as PAYNT while being more efficient, e.g. problem-paynt-storm-combined. Also, for the benchmark problem-storm-extended, designed to be difficult for Storm, we reduce the approximate total reward from 3009 to 98, resulting in an FSC of size 1 in \(<1\)s.

Category B. This category contains benchmarks on which there is no clear front-runner. There are a total of 7 benchmarks in this category. Three of these benchmarks are similar to posterior-awareness, where the results produced and the time taken are quite similar for both approaches. The other 4 benchmarks (similar to 4x5x2-95) show that the value generated by our approach is significantly worse; however, it takes significantly less time. Depending on the situation, this trade-off between quality and runtime may favor either approach.

Category C. This category shows the weakness of our method compared to PAYNT. In this category, there are a total of 3 benchmarks, out of which our approach times out 2 times. It is notable that the drone-benchmarks seem to be generally hard: PAYNT needs 2250s for drone-4-1, and both approaches time out for the bigger instances. There is only one benchmark, query-s2, where we produce a worse value without any significant time advantage over PAYNT.

5 Conclusion

In this paper, we present an approach to learn an FSC for representing POMDP strategies. Our FSCs are (i) always smaller than the given representation, and (ii) the FSC structure is simple, which together increases the strategy’s explainability. The structure of the FSC is always deterministic. Additionally, one of our heuristics only generates deterministic output actions (without randomization). The other heuristic typically represents a randomized strategy. However, only output actions are randomized, not the FSC structure. Besides, this randomization happens in only a very restricted form. Further, our heuristics achieved notable improvements in the performance of many strategies produced by Storm and provably perform equal or better than the baseline, while retaining negligible resource consumption. Altogether, our comparison against PAYNT underscores the competitiveness of our method, frequently yielding FSCs of comparable quality with significant improvements in terms of runtime and size.

This attests to the scalability and efficiency of our approach and also highlights its applicability in scenarios challenging for other tools.

Concerning future work, several directions open up. Further heuristics can be designed to solve some of the patterns occurring in the cases where our approach could not match the size achieved by PAYNT. Furthermore, we would like to integrate our approach into other approaches in order to improve them, in particular the tandem synthesis approach from [2] is a suitable candidate.