Search and Explore: Symbiotic Policy Synthesis in POMDPs

Andriushchenko, Roman; Bork, Alexander; Češka, Milan; Junges, Sebastian; Katoen, Joost-Pieter; Macák, Filip

doi:10.1007/978-3-031-37709-9_6

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13966))

Included in the following conference series:

International Conference on Computer Aided Verification

2130 Accesses
3 Citations

Abstract

This paper marries two state-of-the-art controller synthesis methods for partially observable Markov decision processes (POMDPs), a prominent model in sequential decision making under uncertainty. A central issue is to find a POMDP controller—that solely decides based on the observations seen so far—to achieve a total expected reward objective. As finding optimal controllers is undecidable, we concentrate on synthesising good finite-state controllers (FSCs). We do so by tightly integrating two modern, orthogonal methods for POMDP controller synthesis: a belief-based and an inductive approach. The former method obtains an FSC from a finite fragment of the so-called belief MDP, an MDP that keeps track of the probabilities of equally observable POMDP states. The latter is an inductive search technique over a set of FSCs, e.g., controllers with a fixed memory size. The key result of this paper is a symbiotic anytime algorithm that tightly integrates both approaches such that each profits from the controllers constructed by the other. Experimental results indicate a substantial improvement in the value of the controllers while significantly reducing the synthesis time and memory footprint.

This work has been supported by the Czech Science Foundation grant GA23-06963S (VESCAA), the ERC AdG Grant 787914 (FRAPPANT) and the DFG RTG 2236/2 (UnRAVeL).

You have full access to this open access chapter, Download conference paper PDF

Enforcing Almost-Sure Reachability in POMDPs

POMDP Controllers with Optimal Budget

Strategy Synthesis in Markov Decision Processes Under Limited Sampling Access

1 Introduction

A formidable synthesis challenge is to find a decision-making policy that satisfies temporal constraints even in the presence of stochastic noise. Markov decision processes (MDPs) [26] are a prominent model to reason about such policies under stochastic uncertainty. The underlying decision problems are efficiently solvable and probabilistic model checkers such as PRISM [22] and Storm [13] are well-equipped to synthesise policies that provably (and optimally) satisfy a given specification. However, a major shortcoming of MDPs is the assumption that the policy can depend on the precise state of a system. This assumption is unrealistic whenever the state of the system is only observable via sensors. Partially observable MDPs (POMDPs) overcome this shortcoming, but policy synthesis for POMDPs and specifications such as the probability to reach the exit is larger than 50$\%$ requires solving undecidable problems [23]. Nevertheless, in recent years, a variety of approaches have been successfully applied to a variety of challenging benchmarks, but the approaches also fail somewhat spectacularly on seemingly tiny problem instances. From a user perspective, it is hard to pick the right approach without detailed knowledge of the underlying methods. This paper sets out to develop a framework in which conceptually orthogonal approaches symbiotically alleviate each other’s weaknesses and find policies that maximise, e.g., the expected reward before a target is reached. We show empirically that the combined approach can find compact policies achieving a significantly higher reward than the policies that either individual approach constructs.

Belief Exploration. Several approaches for solving POMDPs use the notion of beliefs [27]. The key idea is that each sequence of observations and actions induces a belief—a distribution over POMDP states that reflects the probability to be in a state conditioned on the observations. POMDP policies can decide optimally solely based on the belief. The evolution of beliefs can be captured by a fully observable, yet possibly infinite belief MDP. A practical approach (see the lower part of Fig. 1) is to unfold a finite fragment of this belief MDP and make its frontier absorbing. This finite fragment can be analysed with off-the-shelf MDP model checkers. Its accuracy can be improved by using an arbitrary but fixed cut-off policy from the frontier onwards. Crucially, the probability to reach the target under such a policy can be efficiently pre-computed for all beliefs. This paper considers the belief exploration method from [8] realised in Storm [13].

Policy Search. An orthogonal approach searches a (finite) space of policies [14, 24] and evaluates these policies by verifying the induced Markov chain. To ensure scalability, sets of policies must be efficiently analysed. However, policy spaces explode whenever they require memory. The open challenge is to adequately define the space of policies to search in. In this paper, we consider the policy-search method from [5] as implemented in Paynt [6] that explores spaces of finite-state controllers (FSCs), represented as deterministic Mealy machines [2], using a combination of abstraction-refinement, counterexamples (to prune sets of policies), and increasing a controller’s memory, see the upper part of Fig. 1.

Our Symbiotic Approach. In essence, our idea relies on the fact that a policy found via one approach can boost the other approach. The key observation is that such a policy is beneficial even when it is sub-optimal in terms of the objective at hand. Figure 1 sketches the symbiotic approach. The FSCs $F_{\mathcal {I}}$ obtained by policy search are used to guide the partial belief MDP to the target. Vice versa, the FSCs $F_{\mathcal {B}}$ obtained from belief exploration are used to shrinken the set of policies and to steer the abstraction. Our experimental evaluation, using a large set of POMDP benchmarks, reveals that (a) belief exploration can yield better FSCs (sometimes also faster) using FSCs $F_{\mathcal {I}}$ from Paynt—even if the latter FSCs are far from optimal, (b) policy search can find much better FSCs when using FSCs from belief exploration, and (c) the FSCs from the symbiotic approach are superior in value to the ones obtained by the standalone approaches.

Beyond Exploration and Policy Search. In this work, we focus on two powerful orthogonal methods from the set of belief-based and search-based methods. Alternatives exist. Exploration can also be done using a fixed set of beliefs [25]. Prominently, HSVI [18] and SARSOP [20] are belief-based policy synthesis approaches typically used for discounted properties. They also support undiscounted properties, but represent policies with $\alpha $-vectors. Bounded policy synthesis [29] uses a combination of belief-exploration and inductive synthesis over paths and addresses finite horizon reachability. $\alpha $-vector policies lead to more complex analysis downstream: the resulting policies must track the belief and do floating-point computations to select actions. For policy search, prominent alternatives are to search for randomised controllers via gradient descent [17] or via convex optimization [1, 12, 19]. Alternatively, FSCs can be extracted via deep reinforcement learning [9]. However, randomised policies limit predictability, which hampers testing and explainability. The area of programmatic reinforcement learning [28] combines inductive synthesis ideas with RL. While our empirical evaluation is method-specific, the lessons carry over to integrating other methods.

Contributions. The key contribution of this paper is the symbiosis of belief exploration [8] and policy search [5]. Though this seems natural, various technical obstacles had to be addressed, e.g., obtaining $F_{\mathcal {B}} $ from the finite fragment of the belief MDP and the policies for its frontier and developing an interplay between the exploration and search phases that minimises the overhead. The benefits of the symbiotic algorithm are manifold, as we show by a thorough empirical evaluation. It can solve POMDPs that cannot be tackled with either of the two approaches alone. It outputs FSCs that are superior in value (with relative improvements of up to 40%) as well as FSCs that are more succinct (with reduction of a factor of up to two orders of magnitude) with only a small penalty in their values. Additionally, the integration reduces the memory footprint compared to belief exploration by a factor of 4. In conclusion, the proposed symbiosis offers a powerful push-button, anytime synthesis algorithm producing, in the given time, superior and/or more succinct FSCs compared to the state-of-the-art methods.

2 Motivating Examples

We give a sample POMDP that is hard for the belief exploration, a POMDP that challenges the policy search approach, and indicate why a symbiotic approach overcomes this. A third sample POMDP is shown to be unsolvable by either approach alone but can be treated by the symbiotic one.

A Challenging POMDP for Belief-Based Exploration. Consider POMDP $\mathcal {M}_a$ in Fig. 2a. The objective is to minimise the expected number of steps to the target $T_a$. An optimal policy is to always take action $\alpha $ yielding 4 expected steps. An FSC realising this policy can be found by a policy search under 1s.

Belief MDPs. States in the belief MDP $\mathcal {M}^{\mathcal {B}}_a$ are beliefs, probability distributions over POMDP states with equal observations. The initial belief is $\{S \mapsto 1\}$. By taking action $\alpha $, ‘yellow’ is observed and the belief becomes $\{L \mapsto \frac{1}{2},\, R \mapsto \frac{1}{2}\}$. Closer inspection shows that the set of reachable beliefs is infinite rendering $\mathcal {M}^{\mathcal {B}}_a$ to be infinite. Belief exploration constructs a finite fragment $\overline{\mathcal {M}^{\mathcal {B}}_a}$ by exploring $\mathcal {M}^{\mathcal {B}}_a$ up to some depth while cutting off the frontier states. From cut-off states, a shortcut is taken directly to the target. These shortcuts are heuristic over-approximations of the true number of expected steps from the cut-off state to the target. The finite MDP $\overline{\mathcal {M}^{\mathcal {B}}_a}$ can be analysed using off-the-shelf tools yielding the minimising policy ${\sigma _{\mathcal {B}}}$ assigning to each belief state the optimal action.

Admissible Heuristics. A simple way to over-approximate the minimal number of the expected number of steps to the target is to use an arbitrary controller $\overline{F}$ and use the expected number of steps under $\overline{F}$. The latter is cheap if $\overline{F}$ is compact, as detailed in Sect. 4.2. Figure 2c shows a Markov chain induced by ${\sigma _{\mathcal {B}}}$ in $\overline{\mathcal {M}^{\mathcal {B}}_a}$, where the belief $\{L \mapsto \frac{7}{8}, R \mapsto \frac{1}{8} \}$ is cut off using $\overline{F}$. The belief exploration in Storm [8] unfolds 1000 states of $\mathcal {M}^{\mathcal {B}}_a$ and finds controller $\overline{F}$ that uniformly randomises over all actions in the rightmost state. The resulting sub-optimal controller $F_{\mathcal {B}}$ reaches the target in ${\approx } 4.1$ steps. Exploring only a few states suffices when replacing $\overline{F}$ by a (not necessarily optimal) FSC provided by a policy search.

A Challenging POMDP for Policy Search. Consider POMDP $\mathcal {M}_b$ in Fig. 2b. The objective is to minimise the expected number of steps to $T_b$. Its 9-state belief MDP $\mathcal {M}^{\mathcal {B}}_b$ is trivial for the belief-based method. Its optimal controller ${\sigma _{\mathcal {B}}}$ first picks action $\gamma $; on observing ‘yellow’ it plays $\beta $ twice, otherwise it always picks $\alpha $. This is realised by an FSC with 3 memory states. The inductive policy search in Paynt [5] explores families of FSCs of increasing complexity, i.e., of increasing memory size. It finds the optimal FSC after consulting about 20 billion candidate policies. This requires 545 model-checking queries; the optimal one is found after 105 queries while the remaining queries prove that no better 3-state FSC exists.

Reference Policies. The policy search is guided by a reference policy, in this case the fully observable MDP policy that picks (senseless) action $\delta $ in $B_1$ first. Using policy ${\sigma _{\mathcal {B}}}$—obtained by the belief method—instead, $\delta $ is never taken. As ${\sigma _{\mathcal {B}}}$ picks in each ‘blue’ state a different action, mimicking this requires at least three memory states. Using ${\sigma _{\mathcal {B}}}$ reduces the total number of required model-checking queries by a factor of ten; the optimal 3-state FSC is found after 23 queries.

The Potential of Symbiosis. To further exemplify the limitation of the two approaches and the potential of their symbiosis, we consider a synthetic POMDP, called Lanes+, combining a Lane model with larger variants of the POMDPs in Fig. 2; see Table 2 on page 14 for the model statistics and Appendix C of [3] for the model description. We consider minimisation of the expected number of steps and a 15-min timeout. The belief-based approach by Storm yields the value 18870. The policy search method by Paynt finds an FSC with 2 memory states achieving the value 8223. This sub-optimal FSC significantly improves the belief MDP approximation and enables Storm to find an FSC with value 6471. The symbiotic synthesis loop finds the optimal FSC with value 4805.

3 Preliminaries and Problem Statement

A (discrete) distribution over a countable set A is a function $\mu :A \rightarrow [0,1]$ s.t. $\sum _a \mu (a) = 1$. The set $\textrm{supp}(\mu ) {:}{=}\left\{ a \in A \mid \mu (a) > 0\right\} $ is the support of $\mu $. The set Distr(A) contains all distributions over A. We use Iverson bracket notation, where $\left[ x\right] = 1$ if the Boolean expression x evaluates to true and $\left[ x\right] = 0$ otherwise.

Definition 1 (MDP)

A Markov decision process (MDP) is a tuple $M = (S,s_{0},Act,\mathcal {P})$ with a countable set S of states, an initial state $s_{0}\in S$, a finite set $Act$ of actions, and a partial transition function $\mathcal {P}:S \times Act\nrightarrow Distr(S)$. $Act(s) {:}{=}\left\{ \alpha \in Act\mid \mathcal {P}(s,\alpha ) \ne \bot \right\} $ denotes the set of actions available in state $s \in S$. An MDP with $|Act(s)|=1$ for each $s \in S$ is a Markov chain (MC).

Unless stated otherwise, we assume $Act(s) = Act$ for each $s \in S$ for conciseness. We denote $\mathcal {P}(s,\alpha ,s') {:}{=}\mathcal {P}(s,\alpha )(s')$. A (finite) path of an MDP M is a sequence $\pi = s_0\alpha _0s_1\alpha _1\dots s_n$ where ${\mathcal {P}(s_i,\alpha _i,s_{i+1}) > 0}$ for $0 \le i < n$. We use $last(\pi )$ to denote the last state of path $\pi $. Let $Paths^M$ denote the set of all finite paths of M. State s is absorbing if $\textrm{supp}(\mathcal {P}(s, \alpha )) = \{ s \}$ for all $\alpha \in Act$.

Definition 2 (POMDP)

A partially observable MDP (POMDP) is a tuple $\mathcal {M}= (M,Z,O)$, where M is the underlying MDP, Z is a finite set of observations and $O :S \rightarrow Z$ is a (deterministic) observation function.

For POMDP $\mathcal {M}$ with underlying MDP M, an observation trace of path $\pi = s_0\alpha _0s_1\alpha _1\dots s_n$ is a sequence $O(\pi ) {:}{=}O(s_0)\alpha _0 O(s_1)\alpha _1 \dots O(s_n)$. Every MDP can be interpreted as a POMDP with $Z= S$ and $O(s) = s$ for all $s \in S$.

A (deterministic) policy is a function $\sigma :Paths^M \rightarrow Act$. Policy $\sigma $ is memoryless if $last(\pi ) = last(\pi ') \Longrightarrow \sigma (\pi ) = \sigma (\pi ')$ for all $\pi ,\pi ' \in Paths^M$. A memoryless policy $\sigma $ maps a state $s \in S$ to action $\sigma (s)$. Policy $\sigma $ is observation-based if $O(\pi ) = O(\pi ') \Longrightarrow \sigma (\pi ) = \sigma (\pi ')$ for all $\pi ,\pi ' \in Paths^M$. For POMDPs, we always consider observation-based policies. We denote by $\varSigma _{ obs }$ the set of all observation-based policies. A policy $\sigma \in \varSigma _{ obs }$ induces the MC $\mathcal {M}^{\sigma }$.

We consider indefinite-horizon reachability or expected total reward properties. Formally, let $M=(S,s_{0},Act,\mathcal {P})$ be an MC, and let $T \subseteq S$ be a set of target states. $\mathbb {P}^{M}\left[ s \models \Diamond T\right] $ denotes the probability of reaching T from state $s \in S$. We use $\mathbb {P}^{M}\left[ \Diamond T\right] $ to denote $\mathbb {P}^{M}\left[ s_{0}\models \Diamond T\right] $ and omit the superscript if the MC is clear from context. Now assume POMDP $\mathcal {M}$ with underlying MDP $M = (S,s_{0},Act,\mathcal {P})$, and a set $T \subseteq S$ of absorbing target states. Without loss of generality, we assume that the target states are associated with the unique observation $z^T \in Z$, i.e. $s \in T $ iff $O(s) = z^T$. For a POMDP $\mathcal {M}$ and $T \subseteq S$, the maximal reachability probability of T for state $s \in S$ in $\mathcal {M}$ is $\mathbb {P}^{\mathcal {M}}_{\max }\left[ s \models \Diamond T\right] {:}{=}\sup _{\sigma \in \varSigma _{ obs }} \mathbb {P}^{\mathcal {M}^{\sigma }}\!\left[ s \models \Diamond {T}\right] $. The minimal reachability probability $\mathbb {P}^{\mathcal {M}}_{\min }\left[ s \models \Diamond T\right] $ is defined analogously.

Finite-state controllers are automata that compactly encode policies.

Definition 3 (FSC)

A finite-state controller (FSC) is a tuple $F = (N,n_{0},\gamma ,\delta )$, with a finite set N of nodes, the initial node $n_{0}\in N$, the action function $\gamma :N \times Z \rightarrow Act$ and the update function $\delta :N \times Z \times Z \rightarrow N$.

A k-FSC is an FSC with $|N|=k$. If $k{=}1$, the FSC encodes a memoryless policy. We use $\mathcal {F}^\mathcal {M}$ (${\mathcal {F}^\mathcal {M}_k}$) to denote the family of all (k-)FSCs for POMDP $\mathcal {M}$. For a POMDP in state s, an agent receives observation $z= O(s)$. An agent following an FSC F executes action $\alpha = \gamma (n,z)$ associated with the current node n and the current (prior) observation z. The POMDP state is updated accordingly to some $s'$ with $\mathcal {P}(s, \alpha , s') > 0$. Based on the next (posterior) observation $z' = O(s')$, the FSC evolves to node $n'=\delta (n,z,z')$. The induced MC for FSC F is $\mathcal {M}^F= (S\times N, (s_{0},n_{0}), \{\alpha \}, \mathcal {P}^F)$, where for all $(s,n),(s',n') \in S \times N$ we have

$$\begin{aligned} \mathcal {P}^F \left( (s,n),\alpha ,(s',n') \right) = \left[ n' = \delta \left( n,O(s),O(s')\right) \right] \cdot \mathcal {P}(s,\gamma (n,O(s)),s'). \end{aligned}$$

We emphasise that for MDPs with infinite state space and POMDPs, an FSC realising the maximal reachability probability generally does not exist. For FSC $F \in \mathcal {F}^\mathcal {M}$ with the set N of memory nodes, let $\mathbb {P}^{\mathcal {M}^{F}}\!\left[ (s,n) \models \Diamond {T}\right] {:}{=}\mathbb {P}^{\mathcal {M}^F}\left[ {(s,n) \models \Diamond (T\times N)}\right] $ denote the probability of reaching target states T from state $(s,n) \in S \times N$. Analogously, $\mathbb {P}^{\mathcal {M}^{F}}\!\left[ \Diamond {T}\right] {:}{=}\mathbb {P}^{\mathcal {M}^F}\left[ \Diamond (T\times N)\right] $ denotes the probability of reaching target states T in the MC $\mathcal {M}^F$ induced on $\mathcal {M}$ by F.

Problem Statement. The classical synthesis problem [23] for POMDPs asks: given POMDP $\mathcal {M}$, a set T of targets, and a threshold $\lambda $, find an FSC F such that $\mathbb {P}^{\mathcal {M}^{F}}\!\left[ \Diamond {T}\right] \ge \lambda $, if one exists. We take a more practical stance and aim instead to optimise the value $\mathbb {P}^{\mathcal {M}^{F}}\!\left[ \Diamond {T}\right] $ in an anytime fashion: the faster we can find FSCs with a high value, the better.

Remark 1

Variants of the maximising synthesis problem for the expected total reward and minimisation are defined analogously. For conciseness, in this paper, we always assume that we want to maximise the value.

In addition to the value of the FSC F, another key characteristic of the controller is its size, which we treat as a secondary objective and discuss in detail in Sect. 6.

4 FSCs for and from Belief Exploration

We consider belief exploration as described in [8]. A schematic overview is given in the lower part of Fig. 1. We recap the key concepts of belief exploration. This section explains two contributions: we discuss how arbitrary FSCs are included and present an approach to export the associated POMDP policies as FSCs.

4.1 Belief Exploration with Explicit FSC Construction

Finite-state controllers for a POMDP can be obtained by analysing the (fully observable) belief MDP [27]. The state space of this MDP consists of beliefs: probability distributions over states of the POMDP $\mathcal {M}$ having the same observation. Let $S_z{:}{=}\{s \in S \mid O(s) = z\}$ denote the set of all states of $\mathcal {M}$ with observation $z\in Z$. Let the set of all beliefs $\mathcal {B}_\mathcal {M}{:}{=}\bigcup _{z \in Z} Distr(S_z)$ and denote for $b \in \mathcal {B}_\mathcal {M}$ by $O(b) \in Z$ the unique observation O(s) of any $s \in \textrm{supp}(b)$.

In a belief b, taking action $\alpha $ yields an updated belief as follows: let $\mathcal {P}(b,\alpha ,z') {:}{=}\sum _{s \in S_{O(b)}}b(s) \cdot \sum _{s'\in S_{z'}}\mathcal {P}(s,\alpha ,s')$ denote the probability of observing $z' \in Z$ upon taking action $\alpha \in Act$ in belief $b \in \mathcal {B}_\mathcal {M}$. If $\mathcal {P}(b,\alpha ,z')>0$, the corresponding successor belief $b'=\llbracket b {\mid } \alpha ,z'\rrbracket $ with $O(b')=z'$ is defined component-wise as

$$ \llbracket b {\mid } \alpha ,z'\rrbracket (s') {:}{=}\frac{\sum _{s \in S_{O(b)}}b(s)\cdot \mathcal {P}(s,\alpha ,s')}{\mathcal {P}(b,\alpha ,z')} $$

for all $s' \in S_{z'}$. Otherwise, $\llbracket b {\mid } \alpha ,z'\rrbracket $ is undefined.

Definition 4 (Belief MDP)

The belief MDP of POMDP $\mathcal {M}$ is the MDP $\mathcal {M}^{\mathcal {B}}= (\mathcal {B}_\mathcal {M}, b_{0}, Act, \mathcal {P}^\mathcal {B})$, with initial belief $b_{0}{:}{=}\{s_{0}\mapsto 1\}$ and transition function $\mathcal {P}^\mathcal {B}(b,\alpha ,b') {:}{=}\left[ b'=\llbracket b {\mid } \alpha ,z'\rrbracket \right] \cdot \mathcal {P}(b,\alpha ,z')$ where $z'=O(b')$.

The belief MDP captures the behaviour of its POMDP. It can be unfolded by starting in the initial belief and computing all successor beliefs.

Deriving FSCs from Finite Belief MDPs. Let $T^\mathcal {B}{:}{=}\left\{ b \in \mathcal {B}_\mathcal {M}\mid O(b) = z^T\right\} $ denote the set of target beliefs. If the reachable state space of the belief MDP $\mathcal {M}^{\mathcal {B}}$ is finite, e.g. because the POMDP is acyclic, standard model checking techniques can be applied to compute the memoryless policy ${\sigma _{\mathcal {B}}}:\mathcal {B}_\mathcal {M}\rightarrow Act$ that selects in each belief state ${b \in \mathcal {B}_\mathcal {M}}$ the action that maximises $\mathbb {P}^{}\left[ b \models \Diamond T^\mathcal {B}\right] $^{Footnote 1}. We can translate the deterministic, memoryless policy ${\sigma _{\mathcal {B}}}$ into the corresponding FSC $F_{\mathcal {B}} = \left( \mathcal {B}_\mathcal {M}, b_{0}, \gamma , \delta \right) $ with action function $\gamma (b, z) = {\sigma _{\mathcal {B}}}(b)$ and update function $\delta (b, z, z') = \llbracket b {\mid } {\sigma _{\mathcal {B}}}(b),z'\rrbracket $ for all $z,z' \in Z$.^{Footnote 2}

Handling Large and Infinite Belief MDPs. In case the reachable state space of the belief MDP $\mathcal {M}^{\mathcal {B}}$ is infinite or too large for a complete unfolding, a finite approximation $\overline{\mathcal {M}^{\mathcal {B}}}$ is used instead [8]. Assuming $\mathcal {M}^{\mathcal {B}}$ is unfolded up to some depth, let $\mathcal {E}\subset \mathcal {B}_\mathcal {M}$ denote the set of explored beliefs and let $\mathcal {U}\subset \mathcal {B}_\mathcal {M}{\setminus } \mathcal {E}$ denote the frontier: the set of unexplored beliefs reachable from $\mathcal {E}$ in one step. To complete the finite abstraction, we require handling of the frontier beliefs. The idea is to use for each $b \in \mathcal {U}$ a cut-off value $\underline{V}(b)$: an under-approximation of the maximal reachability probability $\mathbb {P}^{\mathcal {M}^{\mathcal {B}}}_{\max }\left[ b \models \Diamond T^\mathcal {B}\right] $ for b in the belief MDP. We explain how to compute cut-off values systematically given an FSC in Sect. 4.2.

Ultimately, we define a finite MDP $\overline{\mathcal {M}^{\mathcal {B}}}= (\mathcal {E}\cup \mathcal {U}\cup \{b_{\top },b_{\bot }\}, b_{0}, Act, \overline{\mathcal {P}^\mathcal {B}})$ with the transition function: $\overline{\mathcal {P}^\mathcal {B}}(b,\alpha ) {:}{=}\mathcal {P}^\mathcal {B}(b,\alpha )$ for explored beliefs $b \in \mathcal {E}$ and all $\alpha \in Act$, and $\overline{\mathcal {P}^\mathcal {B}}(b,\alpha ) {:}{=}\{b_{\top }\mapsto \underline{V}(b), b_{\bot }\mapsto 1-\underline{V}(b)\}$ for frontier beliefs $b \in \mathcal {U}$ and all $\alpha \in Act$, where $b_{\top }$ and $b_{\bot }$ are fresh sink states, i.e. $\overline{\mathcal {P}^\mathcal {B}}(b_{\top },\alpha ) {:}{=}\{b_{\top }\mapsto 1\}$ and $\overline{\mathcal {P}^\mathcal {B}}(b_{\bot },\alpha ) {:}{=}\{b_{\bot }\mapsto 1\}$ for all $\alpha \in Act$. The reachable state space of $\overline{\mathcal {M}^{\mathcal {B}}}$ is finite, enabling its automated analysis; since our method to compute cut-off values emulates an FSC, a policy maximising $\mathbb {P}^{\overline{\mathcal {M}^{\mathcal {B}}}}_{\max }\left[ \Diamond (T^\mathcal {B}\cup \{b_{\top }\})\right] $ induces an FSC for the original POMDP $\mathcal {M}$. We discuss how to obtain this FSC in Sect. 4.3.

4.2 Using FSCs for Cut-Off Values

A crucial aspect when applying the belief exploration with cut-offs is the choice of suitable cut-off values. The closer the cut-off value is to the actual optimum in a belief, the better the approximation we obtain. In particular, if the cut-off values coincide with the optimal value, cutting off the initial state is optimal. However, finding optimal values is as hard as solving the original POMDP. We consider under-approximative value functions induced by applying any^{Footnote 3} FSC to the POMDP and lifting the results to the belief MDP. The better the FSC, the better the cut-off value. We generalise belief exploration with cut-offs such that the approach supports arbitrary sets of FSCs with additional flexibility.

Let $F_{\mathcal {I}} \in \mathcal {F}^\mathcal {M}$ be an arbitrary, but fixed FSC for POMDP $\mathcal {M}$. Let $p_{s,n} {:}{=}\mathbb {P}^{\mathcal {M}^{F_{\mathcal {I}}}}\left[ (s,n) \models \Diamond T\right] $ for state $(s,n) \in S \times N$ in the corresponding induced MC. For fixed $n \in N$, $V(b,n) {:}{=}\sum _{s \in S_{O(b)}} b(s) \cdot p_{s,n}$ denotes the cut-off value for belief b and memory node n. It corresponds to the probability of reaching a target state in $\mathcal {M}^{F_{\mathcal {I}}}$ when starting in memory node $n \in N$ and state $s \in S$ according to the probability distribution b. We define the overall cut-off value for b induced by F as $\underline{V}(b) {:}{=}\max _{n \in N} V(b,n)$. It follows straightforwardly that $\underline{V}(b) \le \mathbb {P}^{\mathcal {M}^{\mathcal {B}}}_{\max }\left[ b \models \Diamond T^\mathcal {B}\right] $. As values $p_{s,n}$ only need to be computed once, computing $\underline{V}(b)$ for a given belief b is relatively simple. However, the complexity of the FSC-based cut-off approach depends on the size of the induced MC. Therefore, it is essential that the FSCs used to compute cut-off values are concise.

4.3 Extracting FSC from Belief Exploration

Model checking the finite approximation MDP $\overline{\mathcal {M}^{\mathcal {B}}}$ with cut-off values induced by an FSC $F_{\mathcal {I}} $ yields a maximising memoryless policy ${\sigma _{\mathcal {B}}}$. Our goal is to represent this policy as an FSC $F_{\mathcal {B}} $. We construct $F_{\mathcal {B}} $ by considering both $F_{\mathcal {I}} $ and the necessary memory nodes for each explored belief $b \in \mathcal {E}$. Concretely, for each explored belief, we introduce a corresponding memory node. In each such node, the action ${\sigma _{\mathcal {B}}}(b)$ is selected. For the memory update, we distinguish between two cases based on the next belief after executing ${\sigma _{\mathcal {B}}}(b)$ in $\overline{\mathcal {M}^{\mathcal {B}}}$. If for observation $z' \in Z$, the successor belief $b' = \llbracket b {\mid } {\sigma _{\mathcal {B}}}(b),z'\rrbracket \in \mathcal {E}$, the memory is updated to the corresponding node. Otherwise, $b' \in \mathcal {U}$ holds, i.e., the successor is part of the frontier. The memory is then updated to the memory node n of FSC $F_{\mathcal {I}} $ that maximises the cut-off value $V(b',n)$. This corresponds to the notion that if the frontier is encountered, we switch from acting according to policy ${\sigma _{\mathcal {B}}}$ to following $F_{\mathcal {I}} $ (initialised in the correct memory node). This is formalised as:

Definition 5 (Belief-based FSC with cut-offs)

Let $F_{\mathcal {I}} = (N,n_0,\gamma _{\mathcal {I}}, \delta _{\mathcal {I}})$ and $\overline{\mathcal {M}^{\mathcal {B}}}$ as before. The belief-based FSC with cut-offs is $F_{\mathcal {B}} = (\mathcal {E}\cup N, b_{0}, \gamma , \delta )$ with action function $\gamma (b,z) = {\sigma _{\mathcal {B}}}(b)$ for $b \in \mathcal {E}$ and $\gamma (n,z) = \gamma _{\mathcal {I}}(n,z)$ for $n \in N$ and arbitrary $z \in Z$. The update function $\delta $ is defined for all $z, z' \in Z$ by $\delta (n,z,z') = \delta _{\mathcal {I}}(n,z,z')$ if $n \in N$, and for $b \in \mathcal {E}$ with $b' = \llbracket b {\mid } {\sigma _{\mathcal {B}}}(b),z'\rrbracket $ by:

$$ \delta (b,z,z') = b' \text { if } b' \in \mathcal {E}, \text { and } \delta (b,z,z') = \textrm{argmax}_{n \in N}V(b',n) \text { otherwise.} $$

5 Accelerated Inductive Synthesis

In this section, we consider inductive synthesis [5], an approach for finding controllers for POMDPs in a set of FSCs. We briefly recap the main idea, then first explain how to use a reference policy. Finally, we introduce and discuss a novel search space for the controllers that we consider in this paper in detail.

5.1 Inductive Synthesis with k-FSCs

In the scope of this paper, inductive synthesis [4] considers a finite family of FSCs ${\mathcal {F}^\mathcal {M}_k}$ of k-FSCs with memory nodes $N = \{n_0,\dots ,n_{k-1}\}$, and the family $\mathcal {M}^{{\mathcal {F}^\mathcal {M}_k}}{:}{=}\{\mathcal {M}^F\mid F \in {\mathcal {F}^\mathcal {M}_k}\}$ of associated induced MCs. The states for each MC are tuples $(s, n) \in S \times N$. For conciseness, we only discuss the abstraction-refinement framework [10] within the inductive synthesis loop. The overall image is as in Fig. 1. Informally, the MDP abstraction of the family $\mathcal {M}^{{\mathcal {F}^\mathcal {M}_k}}$ of MCs is an MDP $\textsf {MDP}({{\mathcal {F}^\mathcal {M}_k}})$ with the set $S \times N$ of states such that, if some MC $M \in \mathcal {M}^{{\mathcal {F}^\mathcal {M}_k}}$ executes action $\alpha $ in state $(s,n) \in S \times N$, then this action (with the same effect) is also enabled in state (s, n) of $\textsf {MDP}({{\mathcal {F}^\mathcal {M}_k}})$. Essentially, $\textsf {MDP}({{\mathcal {F}^\mathcal {M}_k}})$ over-approximates the behaviour of all the MCs in the family $\mathcal {M}^{{\mathcal {F}^\mathcal {M}_k}}$: it simulates an arbitrary family member in every step, but it may switch between steps.^{Footnote 4}

Definition 6

MDP abstraction for POMDP $\mathcal {M}$ and family ${\mathcal {F}^\mathcal {M}_k}= \{ F_1, \ldots , F_m \}$ of k-FSCs is the MDP $\textsf {MDP}({{\mathcal {F}^\mathcal {M}_k}}){:}{=}\big (S\times N, (s_{0},n_{0}), \{ 1, \ldots , m \}, \mathcal {P}^{\mathcal {F}^\mathcal {M}_k}\big )$ with

$$\begin{aligned} \mathcal {P}^{\mathcal {F}^\mathcal {M}_k}( (s,n), i ) = \mathcal {P}^{F_i}. \end{aligned}$$

While this MDP has m actions, practically, many actions coincide. Below, we see how to utilise the structure of the FSCs. Here, we finish by observing that the MDP is a proper abstraction:

Lemma 1

[10] For all $F\in {\mathcal {F}^\mathcal {M}_k}$, $\mathbb {P}^{\textsf {MDP}({{\mathcal {F}^\mathcal {M}_k}})}_{\min }\left[ \Diamond T\right] \le \mathbb {P}^{\mathcal {M}^F}\left[ \Diamond T\right] \le \mathbb {P}^{\textsf {MDP}({{\mathcal {F}^\mathcal {M}_k}})}_{\max }\left[ \Diamond T\right] $.

With that result, we can naturally start with the set of all k-FSCs and search through this family by selecting suitable subsets [10]. Since the number k of memory nodes necessary is not known in advance, one can iteratively explore the sequence $\mathcal {F}^\mathcal {M}_{1},\mathcal {F}^\mathcal {M}_{2},\dots $ of families of FSCs of increasing complexity.

5.2 Using Reference Policies to Accelerate Inductive Synthesis

Consider the synthesis process of the optimal k-FSC $F \in {\mathcal {F}^\mathcal {M}_k}$ for POMDP $\mathcal {M}$. To accelerate the search for F within this family, we consider a reference policy, e.g., a policy ${\sigma _{\mathcal {B}}}$ extracted from an (approximation of the) belief MDP, and shrink the FSC family. For each observation $z \in Z$, we collect the set $Act[{\sigma _{\mathcal {B}}}](z) {:}{=}\left\{ {\sigma _{\mathcal {B}}}(b) \mid b \in \mathcal {B}_\mathcal {M}, O(b) = z \right\} $ of actions that were selected by ${\sigma _{\mathcal {B}}}$ in beliefs with observation z. The set $Act[{\sigma _{\mathcal {B}}}](z)$ contains the actions used by the reference policy when in observation z. We focus the search on these actions by constructing a subset of FSCs $\{\ (N, n_{0}, \gamma , \delta ) \in {\mathcal {F}^\mathcal {M}_k}\mid \forall n \in N, z\in Z. \gamma (n, z) \in Act[{\sigma _{\mathcal {B}}}](z) \}$.

Restricting the action selection may exclude the optimal k-FSC. It also does not guarantee that the optimal FSC in the restricted family achieves the same value as the reference policy ${\sigma _{\mathcal {B}}}$ as ${\sigma _{\mathcal {B}}}$ may have more memory nodes. We first search the restricted space of FSCs before searching the complete space. This also accelerates the search: The earlier a good policy is found, the easier it is to discard other candidates (because they are provably not optimal). Furthermore, in case the algorithm terminates earlier (notice the anytime aspect of our problem statement), we are more likely to have found a reasonable policy.

Additionally, we could use sets $Act[{\sigma _{\mathcal {B}}}]$ to determine with which k to search. If in some observation $z \in Z$ the belief policy ${\sigma _{\mathcal {B}}}$ uses $|Act[{\sigma _{\mathcal {B}}}](z)|$ distinct actions, then in order to enable the use of all of these actions, we require at least $k = \max _{z\in Z} |Act[{\sigma _{\mathcal {B}}}](z)|$ memory states. However, this may lead to families that are too large and thus we use a more refined view discussed below.

5.3 Inductive Synthesis with Adequate FSCs

In this section, we discuss the set of candidate FSCs in more detail. In particular, we take a more refined look at the families that we consider.

More Granular FSCs. We consider memory models [5] that describe per-observation how much memory may be used:

Definition 7

($\mu $-FSC). A memory model for POMDP $\mathcal {M}$ is a function $\mu :Z \rightarrow \mathbb {N}$. Let $k = \max _{z \in Z}\mu (z)$. The k-FSC $F \in {\mathcal {F}^\mathcal {M}_k}$ with nodes $N = \{n_0,\dots ,n_{k-1}\}$ is a $\mu $-FSC iff for all $z \in Z$ and for all $i > \mu (z)$ it holds: $\gamma (n_i,z) = \gamma (n_{0},z)$ and $\delta (n_i,z,z') = \delta (n_{0},z,z')$ for any $z' \in Z$.

${\mathcal {F}^\mathcal {M}_{\mu }}$ denotes the family of all $\mu $-FSCs. Essentially, memory model $\mu $ dictates that for prior observation z only $\mu (z)$ memory nodes are utilised, while the rest behave exactly as the default memory node $n_{0}$. Using memory model $\mu $ with $\mu (z)<k$ for some observations $z \in Z$ greatly reduces the number of candidate controllers. For example, if $|S_z|=1$ for some $z\in Z$, then upon reaching this state, the history becomes irrelevant. It is thus sufficient to set $\mu (z)=1$ (for the specifications in this paper). It also significantly reduces the size of the abstraction, see Appendix A of [3].

Posterior-aware or Posterior-unaware. The technique outlined in [5] considers posterior-unaware FSCs [2]. An FSC with update function $\delta $ is posterior-unaware if the posterior observation is not taken into account when updating the memory node of the FSC, i.e. $\delta (n,z,z') = \delta (n,z,z'')$ for all $n\in N, z, z',z'' \in Z$. This restriction reduces the policy space and thus the MDP abstraction $\textsf {MDP}({{\mathcal {F}^\mathcal {M}_k}})$. On the other hand, general (posterior-aware) FSCs can utilise information about the next observation to make an informed decision about the next memory node. As a result, fewer memory nodes are needed to encode complex policies. Consider Fig. 3a which depicts a simple POMDP. First, notice that in yellow states $Y_i$ we want to be able to execute two different actions, implying that we need at least two memory nodes to distinguish between the two states, and the same is true for the blue states $B_i$. Second, notice that in each state the visible action always leads to states having different observations, implying that the posterior observation $z'$ is crucial for the optimal decision making. If $z'$ is ignored, it is impossible to optimally update the memory node. Figure 3b depicts the optimal posterior-aware 2-FSC allowing to reach the target within 12 steps on expectation. The optimal posterior-unaware FSC has at least 4 memory nodes and the optimal posterior-unaware 2-FSC uses 14 steps.

MDP Abstraction. To efficiently and precisely create and analyse MDP abstractions, Definition 6 is overly simplified. In Appendix A of [3], we present the construction for general, posterior-aware FSCs including memory models.

6 Integrating Belief Exploration with Inductive Synthesis

We clarify the symbiotic approach from Fig. 1 and review FSC sizes.

Symbiosis by Closing the Loop. Section 4 shows the potential to improve belief exploration using FSCs, e.g., obtained from an inductive synthesis loop, whereas Sect. 5 shows the potential to improve inductive synthesis using policies from, e.g., belief exploration. A natural next step is to use improved inductive synthesis for belief exploration and improved belief exploration for inductive synthesis, i.e., to alternate between both techniques. This section briefly clarifies the symbiotic approach from Fig. 1 using Algorithm 1.

Table 1. Sizes of different types of FSCs.

Full size table

We iterate until a global timeout t: in each iteration, we make both controllers available to the user as soon as they are computed (Algorithm 1, l. 13). We start in the inductive mode (l. 3-8), where we initially consider the 1-FSCs represented in ${\mathcal {F}^\mathcal {M}_{\mu }}$. Method (l. 8) investigates $\mathcal {F}$ and outputs the new maximising FSC $F_{\mathcal {I}}$ (if it exists). If the timeout $t_{\mathcal {I}}$ interrupts the synthesis process, the method additionally returns yet unexplored parameter assignments. If $\mathcal {F}$ is fully explored within the timeout $t_{\mathcal {I}}$ (l. 4), we increase k and repeat the process. After the timeout $t_{\mathcal {I}}$, we run belief exploration for $t_{\mathcal {B}}$ seconds, where we use $F_{\mathcal {I}} $ as backup controllers (l. 9). After the timeout $t_{\mathcal {B}}$ (exploration will continue from a stored configuration in the next belief phase), we use $F_{\mathcal {I}}$ to obtain cut-off values at unexplored states, compute the optimal policy $\sigma ^{\mathcal {M}^{\mathcal {B}}}$ (see Sect. 4) and extract the FSC $F_{\mathcal {B}}$ which incorporates $F_{\mathcal {I}}$. Before we continue the search, we check whether the belief-based FSC is better and whether that FSC gives any reason to update the memory model (l. 10). If so, we update $\mu $ and reset the $\mathcal {F}$ (l. 11-12).

The Size of an FSC. We have considered several sub-classes of FSCs and wish to compare the sizes of these controllers. For FSC $F = (N,n_{0},\gamma ,\delta )$, we define its size $size(F) {:}{=}size(\gamma ) +size(\delta )$ as the memory required to encode functions $\gamma $ and $\delta $. Encoding $\gamma :N \times Z \rightarrow Act$ of a general k-FSC requires $size(\gamma ) = \sum _{n \in N} \sum _{z \in Z} 1 = k \cdot |Z|$ memory. Encoding $\delta :N \times Z \times Z \rightarrow N$ requires $k \cdot |Z|^2$ memory. However, it is uncommon that in each state-memory pair (s, n) all posterior observations can be observed. We therefore encode $\delta (n,z,\cdot )$ as a sparse adjacency list, i.e., as a list of pairs $(z',\delta (n,z,z'))$. To define the size of such a list properly, consider the induced MC $\mathcal {M}^F = (S \times N, (s_{0},n_{0}), \{\alpha \}, \mathcal {P}^F)$. Let $post(n,z) {:}{=}\left\{ O(s') \mid \exists s \in S_z :(s',\cdot ) \in \textrm{supp}(\mathcal {P}^F((s,n),\alpha )) \right\} $ denote the set of posterior observations reachable when taking a transition in a state (s, n) of $\mathcal {M}^F$ with $O(s) = z$. Table 1 summarises the resulting sizes of FSCs of various sub-classes. The derivation is included in Appendix B of [3]. Table 4 on p. 18 shows that we typically find much smaller $\mu $-FSCs ($F_{\mathcal {I}} $) than belief-based FSCs ($F_{\mathcal {B}} $).

7 Experiments

Our evaluation focuses on the following three questions:

Q1:
Do the FSCs from inductive synthesis raise the accuracy of the belief MDP?
Q2:
Does exploiting the belief MDP boost the inductive synthesis of FSCs?
Q3:
Is the symbiotic approach improving run time, controller’s value and size?

Table 2. Information about the benchmark POMDPs.

Full size table

Selected Benchmarks and Setup. Our baseline are the recent belief exploration technique [8] implemented in Storm [13] and the inductive (policy) synthesis method [5] implemented in Paynt [6]. Paynt uses Storm for parsing and model checking of MDPs, but not for solving POMDPs. Our symbiotic framework (Algorithm 1) has been implemented on top of Paynt and Storm. In the following, we use Storm and Paynt to refer to the implementation of belief exploration and inductive synthesis respectively, and Saynt to refer to the symbiotic framework. The implementation of Saynt and all benchmarks are publicly available^{Footnote 5}. Additionally, the implementation and the benchmarks in the form of an artifact are also available at https://doi.org/10.5281/zenodo.7874513.

Setup. The experiments are run on a single core of a machine equipped with an Intel i5-12600KF @4.9GHz CPU and 64GB of RAM. Paynt searches for posterior-unaware FSCs using abstraction-refinement, as suggested by [5]. By default, Storm applies the cut-offs as presented in Sect. 4.1. Saynt uses the default settings for Paynt and Storm while $t_{\mathcal {I}} = 60s$ and $t_{\mathcal {B}} = 10s$ were taken for Algorithm 1. Under Q3, we discuss the effect of changing these values.

Benchmarks. We evaluate the methods on a selection of models from [5, 7, 8] supplemented by larger variants of these models (Drone-8-2 and Refuel-20), by one model from [16] (Milos-97) and by the synthetic model (Lanes+) described in Appendix C of [3]. We excluded benchmarks for which Paynt or Storm finds the (expected) optimal solution in a matter of seconds. The benchmarks were selected to illustrate advantages as well as drawbacks of all three synthesis approaches: belief exploration, inductive (policy) search, and the symbiotic technique. Table 2 lists for each POMDP the number |S| of states, the total number $\sum Act{:}{=}\sum _{s}|Act(s)|$ of actions, the number |Z| of observations, the specification (either maximising or minimising a reachability probability P or expected reward R), and a known over-approximation on the optimal value computed using the technique from [7]. These over-approximations are solely used as rough estimates of the optimal values. Table 5 on p. 20 reports the quality of the resulting FSCs on a broader range of benchmarks and demonstrates the impact of the non-default settings.

Q1: FSCs provide better approximations of the belief MDP

In these experiments, Paynt is used to obtain a sub-optimal $F_{\mathcal {I}}$ within 10s which is then used by Storm. Table 3 (left) lists the results. Our main finding is that . For instance, Storm with provided $F_{\mathcal {I}} $ finds an FSC with value 0.97 for the Drone-4-2 benchmark within a total of 10s (1s+9s for obtaining $F_{\mathcal {I}}$), compared to obtaining an FSC of value 0.95 in 56s on its own. A value improvement is also obtained if Storm runs longer. For the Network model, the value improves with 37% (short-term) and 47% (long-term) respectively, at the expense of investing 3s to find $F_{\mathcal {I}}$. For the other models, the relative improvement ranges from 3% to 25%. A further value improvement can be achieved when using better FSCs $F_{\mathcal {I}}$ from Paynt; see Q3. Sometimes, belief exploration does not profit from $F_{\mathcal {I}}$. For Hallway, the unexplored part of the belief MDP becomes insignificant rather quickly, and so does the impact of $F_{\mathcal {I}}$. Clipping [8], a computationally expensive extension of cut-offs, is beneficial only for Rocks-12, rendering $F_{\mathcal {I}}$ useless. Though even in this case, using $F_{\mathcal {I}}$ significantly improves Short Storm that did not have enough time to apply clipping.

Q2: Belief-based FSCs improve inductive synthesis

In this experiment, we run Storm for at most 1s, and use the result in Paynt. Table 3 (right) lists the results. Our main finding is that For instance, for the $4\,\times \,5\,\times \,2$ benchmark, an FSC is obtained about six times faster while improving the value by 116%. On some larger models, Paynt alone struggles to find any good $F_{\mathcal {I}}$ and using $F_{\mathcal {B}}$ boosts this; e.g., the value for the Refuel-20 model is raised by a factor 20 at almost no run time penalty. For the Tiger benchmark, a value improvement of 860% is achieved (albeit not as good as $F_{\mathcal {B}}$ itself) at the expense of doubling the run time. Thus: even a shallow exploration of the belief MDP pays off in the inductive synthesis. The inductive search typically profits even more when exploring the belief MDP further. This is demonstrated, e.g., in the Rocks-12 model: using the FSC $F_{\mathcal {B}}$ computed using clipping (see Table 3 (left)) enables Paynt to find FSC $F_{\mathcal {I}}$ with the same (optimal) value 20 as $F_{\mathcal {B}}$ within 1s. Similarly, for the Milos-97 model, running Storm for 45s (producing a more precise $F_{\mathcal {B}}$) enables Paynt to find an FSC $F_{\mathcal {I}}$ achieving a better value than controllers found by Storm or Paynt alone within the timeout. (These results are not reported in the tables.) However, as opposed to Q1, where a better FSC $F_{\mathcal {I}}$ naturally improves the belief MDP, longer exploring the belief MDP does not always yield a better $F_{\mathcal {I}}$: a larger $\overline{\mathcal {M}^{\mathcal {B}}}$ with a better $F_{\mathcal {B}}$ may yield a larger memory model $\mu $, thus inducing a significantly larger family where Paynt struggles to identify good FSCs.

Table 3. Left (Q1): Experimental results on how a (quite sub-optimal) FSC $F_{\mathcal {I}}$ computed by Paynt within 10s impacts Storm. (For Drone-8-2, the largest model in our benchmark, we use 30s). The “Paynt ” column indicates the value of $F_{\mathcal {I}}$ and its run time. The “Short Storm ” column runs storm for 1s and compares the value of FSC $F_{\mathcal {B}}$ found by Storm alone to Storm using $F_{\mathcal {I}}$. The “Long Storm ” column is analogous, but with a 300s timeout for Storm. In the last row, * indicates that clipping was used. Right (Q2): Experimental results on how an FSC $F_{\mathcal {B}}$ obtained by a shallow exploration of the belief MDP impacts the inductive synthesis by Paynt. The “Storm ” column reports the value of $F_{\mathcal {B}}$ computed within 1s. The “Paynt ” column compares the values of the FSCs $F_{\mathcal {I}}$ obtained by Paynt itself to Paynt using the FSCs $F_{\mathcal {B}}$ within a 300s timeout.

Full size table

Q3: The practical benefits of the symbiotic approach

The goals of these experiments are to investigate whether the symbiotic approach improves the run time (can FSCs of a certain value be obtained faster?), the memory footprint (how is the total memory consumption affected?), the controller’s value (can better FSCs be obtained with the same computational resources?) and the controller’s size (are more compact FSCs obtained?).

Value of the Synthesised FSCs. Figure 4 plots the value of the FSCs produced by Storm, Paynt, and Saynt versus the computation time. Note that for maximal objectives, the aim is to obtain a high value (the first 4 plots) whereas for minimal objectives a lower value prevails. From the plots, it follows that The relative improvement of the value of the resulting FSCs differs across individual models, similar to the trends in Q1 and Q2. When comparing the best FSC found by Storm or Paynt alone with the best FSC found by Saynt, the improvement ranges from negligible (4$\,\times \,$3-95) to around 3%-7% (Netw-3-8-20, Milos-97, Query-s3) and sometimes goes over 40% (Refuel-20, Lines+). We note that the distance to the (unknown) optimal values remains unclear. The FSC value never decreases but sometimes does also not increase, as indicated by Hallway and Rocks-12 (see also Q2). Our experiments (see Table 5) also indicate that the improvement over the baseline algorithms is typically more significant in the larger variants of the models. Furthermore, the plots in Fig. 4 also include the FSC value by the one-shot combination of Storm and Paynt. We see that This is illustrated in, e.g., the 4$\,\times \,$3-95 and Lanes+ benchmarks, see the 1st and 3rd plots in Fig. 4 (left).

Table 4. Trade-offs between the value and size in the resulting FSCs $F_{\mathcal {I}}$ and $F_{\mathcal {B}}$ found by Saynt. Each cell reports value/size. The first three models have a minimising objective. $\diamond $ indicates that Saynt ran with $t_{\mathcal {I}}=$90s.

Full size table

Total Synthesis Time. Saynt initially needs some time for the first iteration (one inductive and one belief phase) in Algorithm 1 and thus during the beginning of the synthesis process, the standalone tools may provide FSCs of a certain value faster. For instance, for the Refuel-20 benchmark Saynt swiftly overtakes Storm after the first iteration. The only exception is Rocks-12 (discussed before), where Saynt with the default settings needs significantly more time than Storm to obtain an FSC of the same value.

Memory Footprint. Belief exploration typically has a large memory footprint: Storm quickly hits the 64GB memory limit on exploring the belief MDP. , see the bottom right plot of Fig. 4. The average memory footprint of running Paynt standalone quickly stabilises around 700MB. The memory footprint of Saynt is thus dominated by the restricted exploration of the belief MDP.

The Size of the Synthesised FSCs. For selected models, Table 4 shows the trade-offs between the value and size of the resulting FSCs $F_{\mathcal {I}}$ and $F_{\mathcal {B}}$ found by Saynt. The experiments show that . There are models (e.g. Refuel-06) where a very small $F_{\mathcal {B}}$, having even slightly smaller size than $F_{\mathcal {I}}$, does exist. The integration mostly reduces the size of $F_{\mathcal {B}}$ due to the better approximation of the belief MDP by up to a factor of two. This reduction has a negligible effect on the size of $F_{\mathcal {I}}$. This observation further strengthens the usefulness of Saynt that jointly improves the value of $F_{\mathcal {I}}$ and $F_{\mathcal {B}}$. Hence, Saynt gives users a unique opportunity to run a single, time-efficient synthesis and select the FSC according to the trade-off between its value and size.

Customising the Saynt Setup. In contrast to the standalone approaches as well as to the one-way integrations presented in Q1 and Q2, . Naturally, adjusting the parameters to individual benchmarks can further improve the quality of the computed controllers: captions of Fig. 4 and Table 4 describe which non-default settings were used for selected models.

Additional Results

In Table 5, we compare values and sizes of FSCs synthesised by the particular methods on a broader range of benchmarks. We can see that FSCs $F_{\mathcal {I}}$ obtained by Saynt achieve better values than the controllers computed by Paynt; size-wise, these better FSCs of Saynt are similar or only slightly bigger. Meanwhile, for FSCs $F_{\mathcal {B}}$ obtained by Saynt, we sometimes observe a significant size reduction while still improving the value compared to the FSCs produced by Storm. Two models are notable: On Drone-8-2, Saynt obtains 50% smaller $F_{\mathcal {B}}$ while having a 41% better value. On Network-3-8-20, the size of $F_{\mathcal {B}}$ is reduced by 40% while again providing better value.

Table 5. The quality and size of resulting FSCs provided by Paynt, Storm, and Saynt within the 15-min timeout. The run times indicate the time needed to find the best FSC. Non-default settings: $*$ marks experiments where clipping was enabled, $\bullet $ marks experiments where PAYNT synthesised posterior-aware FSCs, $\diamond $ marks experiments where integration parameter $t_{\mathcal {I}}$ was set to 90 s.

Full size table

In the following, we further discuss the impact of non-default settings for selected benchmarks, as presented in Table 5. For instance, using posterior-aware FSCs generally significantly slows down the synthesis process, however, for Network and 4$\,\times \,$3-95, it helps improve the value of the default posterior-unaware FSCs by 2% and 4%, respectively. For the former model, a better $F_{\mathcal {I}}$ also improves $F_{\mathcal {B}}$ by about a similar value. In some cases, e.g. for Query-s3, it is beneficial to increase the parameter $t_{\mathcal {I}}$, giving Paynt enough time to search for a good FSC $F_{\mathcal {I}}$ (the relative improvement is 6%), which also improves the value of the resulting FSC $F_{\mathcal {B}}$ by about a similar value. Tuning $t_{\mathcal {I}}$ and $t_{\mathcal {B}}$ can also have an impact on the value-size trade-off, as seen in the Milos-97 model, where setting longer timeout $t_{\mathcal {I}}$ results in finding a 2% better $F_{\mathcal {B}}$ with 130% size increase. A detailed analysis of the experimental results suggests that usually, it is more beneficial to invest time into searching for good $F_{\mathcal {I}}$ that is used to compute better cut-off values, rather than into deeper exploration of belief MDP. However, the timeouts still need to allow for multiple subsequent iterations of the algorithm in order to utilise the full potential of the symbiosis.

8 Conclusion and Future Work

We proposed Saynt, a symbiotic integration of the two main approaches for controller synthesis in POMDPs. Using a wide class of models, we demonstrated that Saynt substantially improves the value of the resulting controllers and provides an any-time, push-button synthesis algorithm allowing users to select the controller based on the trade-off between its value and size, and the synthesis time.

In future work, we plan to explore if the inductive policy synthesis can also be successfully combined with point-based approximation methods, such as SARSOP, and on discounted reward properties. A preliminary comparison on discounting properties provides two interesting observations: 1) For models with large reachable belief space and discount factors (very) close to one, SARSOP typically fails to update its initial alpha-vectors and thus produces low-quality controllers. In these cases, SAYNT outperforms SARSOP. 2) For common discount factors, SARSOP beats SAYNT on the majority of benchmarks. This is not surprising, as the MDP engine underlying SAYNT does not natively support discounting and instead computes a much harder fixed point. See [15], for a recent discussion on the differences between discounting and not discounting.

Notes

1.
Memoryless policies suffice to maximise the value in a fully observable MDP [26].
2.
The assignments of missing combinations where $z \ne O(b)$ are irrelevant.
3.
We remark that [8] considers memoryless FSCs only.
4.
The MDP is an game-based abstraction [21] of the all-in-one MC [11].
5.
https://github.com/randriu/synthesis.

References

Amato, C., Bernstein, D.S., Zilberstein, S.: Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs. Auton. Agent. Multi-Agent Syst. 21(3), 293–320 (2010)
Article Google Scholar
Amato, C., Bonet, B., Zilberstein, S.: Finite-state controllers based on Mealy machines for centralized and decentralized POMDPs. In: AAAI, pp. 1052–1058. AAAI Press (2010)
Google Scholar
Andriushchenko, R., Bork, A., Češka, M., Junges, S., Katoen, J.P., Macák, F.: Search and explore: symbiotic policy synthesis in POMDPs. arXiv preprint arXiv:2305.14149 (2023)
Andriushchenko, R., Češka, M., Junges, S., Katoen, J.-P.: Inductive synthesis for probabilistic programs reaches new horizons. In: TACAS 2021. LNCS, vol. 12651, pp. 191–209. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72016-2_11
Chapter MATH Google Scholar
Andriushchenko, R., Češka, M., Junges, S., Katoen, J.P.: Inductive synthesis of finite-state controllers for POMDPs. In: UAI, vol. 180, pp. 85–95. PMRL (2022)
Google Scholar
Andriushchenko, R., Češka, M., Junges, S., Katoen, J.-P., Stupinský, Š: PAYNT: a tool for inductive synthesis of probabilistic programs. In: Silva, A., Leino, K.R.M. (eds.) CAV 2021. LNCS, vol. 12759, pp. 856–869. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81685-8_40
Chapter Google Scholar
Bork, A., Junges, S., Katoen, J.-P., Quatmann, T.: Verification of indefinite-horizon POMDPs. In: Hung, D.V., Sokolsky, O. (eds.) ATVA 2020. LNCS, vol. 12302, pp. 288–304. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59152-6_16
Chapter MATH Google Scholar
Bork, A., Katoen, J.-P., Quatmann, T.: Under-approximating expected total rewards in POMDPs. In: TACAS 2022. LNCS, vol. 13244, pp. 22–40. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99527-0_2
Chapter Google Scholar
Carr, S., Jansen, N., Topcu, U.: Task-aware verifiable RNN-based policies for partially observable Markov decision processes. J. Artif. Intell. Res. 72, 819–847 (2021)
Article MathSciNet MATH Google Scholar
Češka, M., Jansen, N., Junges, S., Katoen, J.-P.: Shepherding hordes of Markov chains. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11428, pp. 172–190. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17465-1_10
Chapter Google Scholar
Chrszon, P., Dubslaff, C., Klüppelholz, S., Baier, C.: ProFeat: feature-oriented engineering for family-based probabilistic model checking. Formal Aspects Comput. 30(1), 45–75 (2018)
Article MathSciNet Google Scholar
Cubuktepe, M., Jansen, N., Junges, S., Marandi, A., Suilen, M., Topcu, U.: Robust finite-state controllers for uncertain POMDPs. In: AAAI, pp. 11792–11800. AAAI Press (2021)
Google Scholar
Dehnert, C., Junges, S., Katoen, J.-P., Volk, M.: A storm is coming: a modern probabilistic model checker. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10427, pp. 592–600. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63390-9_31
Chapter Google Scholar
Hansen, E.A.: Solving POMDPs by searching in policy space. In: UAI, pp. 211–219. Morgan Kaufmann (1998)
Google Scholar
Hartmanns, A., Junges, S., Quatmann, T., Weininger, M.: A practitioner’s guide to MDP model checking algorithms. In: Sankaranarayanan, S., Sharygina, N. (eds.) Tools and Algorithms for the Construction and Analysis of Systems. TACAS 2023. Lecture Notes in Computer Science, vol. 13993, pp. 469–488. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-30823-9_24
Hauskrecht, M.: Incremental methods for computing bounds in partially observable Markov decision processes. In: AAAI/IAAI, pp. 734–739 (1997)
Google Scholar
Heck, L., Spel, J., Junges, S., Moerman, J., Katoen, J.-P.: Gradient-descent for randomized controllers under partial observability. In: Finkbeiner, B., Wies, T. (eds.) VMCAI 2022. LNCS, vol. 13182, pp. 127–150. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-94583-1_7
Chapter Google Scholar
Horak, K., Bosansky, B., Chatterjee, K.: Goal-HSVI: heuristic search value iteration for Goal POMDPs. In: IJCAI, pp. 4764–4770. AAAI Press (2018)
Google Scholar
Junges, S., et al.: Finite-state controllers of POMDPs via parameter synthesis. In: UAI, pp. 519–529 (2018)
Google Scholar
Kurniawati, H., Hsu, D., Lee, W.S.: SARSOP: efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Robotics: Science and Systems. MIT Press (2008)
Google Scholar
Kwiatkowska, M.Z., Norman, G., Parker, D.: Game-based abstraction for Markov decision processes. In: QEST, pp. 157–166. IEEE Computer Society (2006)
Google Scholar
Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22110-1_47
Chapter Google Scholar
Madani, O., Hanks, S., Condon, A.: On the undecidability of probabilistic planning and related stochastic optimization problems. Artif. Intell. 147(1), 5–34 (2003)
Article MathSciNet MATH Google Scholar
Meuleau, N., Kim, K., Kaelbling, L.P., Cassandra, A.R.: Solving POMDPs by searching the space of finite policies. In: UAI, pp. 417–426. Morgan Kaufmann (1999)
Google Scholar
Norman, G., Parker, D., Zou, X.: Verification and control of partially observable probabilistic systems. Real-Time Syst. 53(3), 354–402 (2017). https://doi.org/10.1007/s11241-017-9269-4
Article MATH Google Scholar
Puterman, M.L.: Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons (1994)
Google Scholar
Smallwood, R.D., Sondik, E.J.: The optimal control of partially observable Markov processes over a finite horizon. Oper. Res. 21(5), 1071–1088 (1973)
Article MATH Google Scholar
Verma, A., Murali, V., Singh, R., Kohli, P., Chaudhuri, S.: Programmatically interpretable reinforcement learning. In: ICML, vol. 80, pp. 5052–5061. PMLR (2018)
Google Scholar
Wang, Y., Chaudhuri, S., Kavraki, L.E.: Bounded policy synthesis for pomdps with safe-reachability objectives. In: AAMAS, pp. 238–246. International Foundation for Autonomous Agents and Multiagent Systems Richland, SC, USA/ACM (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Brno University of Technology, Brno, Czech Republic
Roman Andriushchenko, Milan Češka & Filip Macák
RWTH Aachen University, Aachen, Germany
Alexander Bork & Joost-Pieter Katoen
Radboud University, Nijmegen, The Netherlands
Sebastian Junges

Authors

Roman Andriushchenko
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Bork
View author publications
You can also search for this author in PubMed Google Scholar
Milan Češka
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Junges
View author publications
You can also search for this author in PubMed Google Scholar
Joost-Pieter Katoen
View author publications
You can also search for this author in PubMed Google Scholar
Filip Macák
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Milan Češka .

Editor information

Editors and Affiliations

LIX, Ecole Polytechnique, CNRS and Institut Polytechnique de Paris, Palaiseau, France
Constantin Enea
Microsoft Research, Bangalore, India
Akash Lal

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Andriushchenko, R., Bork, A., Češka, M., Junges, S., Katoen, JP., Macák, F. (2023). Search and Explore: Symbiotic Policy Synthesis in POMDPs. In: Enea, C., Lal, A. (eds) Computer Aided Verification. CAV 2023. Lecture Notes in Computer Science, vol 13966. Springer, Cham. https://doi.org/10.1007/978-3-031-37709-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-37709-9_6
Published: 17 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37708-2
Online ISBN: 978-3-031-37709-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Search and Explore: Symbiotic Policy Synthesis in POMDPs

Abstract

Similar content being viewed by others

Enforcing Almost-Sure Reachability in POMDPs

POMDP Controllers with Optimal Budget

Strategy Synthesis in Markov Decision Processes Under Limited Sampling Access

1 Introduction

2 Motivating Examples

3 Preliminaries and Problem Statement

Definition 1 (MDP)

Definition 2 (POMDP)

Definition 3 (FSC)

Remark 1

4 FSCs for and from Belief Exploration

4.1 Belief Exploration with Explicit FSC Construction

Definition 4 (Belief MDP)

4.2 Using FSCs for Cut-Off Values

4.3 Extracting FSC from Belief Exploration

Definition 5 (Belief-based FSC with cut-offs)

5 Accelerated Inductive Synthesis

5.1 Inductive Synthesis with k-FSCs

Definition 6

Lemma 1

5.2 Using Reference Policies to Accelerate Inductive Synthesis

5.3 Inductive Synthesis with Adequate FSCs

Definition 7

6 Integrating Belief Exploration with Inductive Synthesis

7 Experiments

8 Conclusion and Future Work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation