figure g
figure f

1 Introduction

Markov decision processes (MDPs) are the standard formalism to model sequential decision making under uncertainty. A typical goal is to find a policy that satisfies a temporal logic specification [5]. Probabilistic model checkers such as Storm  [22] and Prism  [30] efficiently compute such policies. A concern, however, is the robustness against potential perturbations in the enviro nment. MDPs cannot capture such uncertainty about the shape of the environment.

Multi-environment MDPs (MEMDPs) [14, 36] contain a set of MDPs, called environments, over the same state space. The goal in MEMDPs is to find a single policy that satisfies a given specification in all environments. MEMDPs are, for instance, a natural model for MDPs with unknown system dynamics, where several domain experts provide their interpretation of the dynamics [11]. These different MDPs together form a MEMDP. MEMDPs also arise in other domains: The guessing of a (static) password is a natural example in security. In robotics, a MEMDP captures unknown positions of some static obstacle. One can interpret MEMDPs as a (disjoint) union of MDPs in which an agent only has partial observation, i.e., every MEMDP can be cast into a linearly larger partially observable MDP (POMDP) [27]. Indeed, some famous examples for POMDPs are in fact MEMDPs, such as RockSample [39] and Hallway [31]. Solving POMDPs is notoriously hard [32], and thus, it is worthwhile to investigate natural subclasses.

We consider almost-sure specifications where the probability needs to be one to reach a set of target states. In MDPs, it suffices to consider memoryless policies. Constructing such policies can be efficiently implemented by means of a graph-search [5]. For MEMDPs, we consider the following problem:

Compute one policy that almost-surely reaches the target in all environments.

Such a policy robustly satisfies an almost-sure specification for a set of MDPs.

Our approach. Inspired by work on POMDPs, we construct a belief-observation MDP (BOMDP) [16] that tracks the states of the MDPs and the (support of the) belief over potential environments. We show that a policy satisfying the almost-sure property in the BOMDP also satisfies the property in the MEMDP.

Although the BOMDP is exponentially larger than the MEMDP, we exploit its particular structure to create a PSPACE algorithm to decide whether such a robust policy exists. The essence of the algorithm is a recursive construction of a fragment of the BOMDP, restricted to a setting in which the belief-support is fixed. Such an approach is possible, as the belief in a MEMDP behaves monotonically: Once we know that we are not in a particular environment, we never lose this knowledge. This behavior is in contrast to POMDPs, where there is no monotonic behavior in belief-supports. The difference is essential: Deciding almost-sure reachability in POMDPs is EXPTIME-complete [19, 37]. In contrast, the problem of deciding whether a policy for almost-sure reachability in a MEMDP exists is indeed PSPACE-complete. We show the hardness using a reduction from the true quantified Boolean formula problem. Finally, we cannot hope to extract a policy with such an algorithm, as the smallest policy for MEMDPs may require exponential memory in the number of environments.

The PSPACE algorithm itself recomputes many results. For practical purposes, we create an algorithm that iteratively explores parts of the BOMDP. The algorithm additionally uses the MEMDP structure to generalize the set of states from which a winning policy exists and deduce efficient heuristics for guiding the exploration. The combination of these ingredients leads to an efficient and competitive prototype on top of the model checker Storm.

Related work. We categorize related work in three areas.

MEMDPs. Almost-sure reachability for MEMDPs for exactly two environments has been studied by [36]. We extend the results to arbitrarily many environments. This is nontrivial: For two environments, the decision problem has a polynomial time routine [36], whereas we show that the problem is PSPACE-complete for an arbitrary number of environments. MEMDPs and closely related models such as hidden-model MDPs, hidden-parameter MDPs, multi-model MDPs, and concurrent MDPs [2, 10, 11, 40] have been considered for quantitative propertiesFootnote 1. The typical approach is to consider approximative algorithms for the undecidable problem in POMDPs [14] or adapt reinforcement learning algorithms [3, 28]. These approximations are not applicable to almost-sure properties.

POMDPs. One can build an underlying potentially infinite belief-MDP [27] that corresponds to the POMDP – using model checkers [7, 8, 35] to verify this MDP can answer the question for MEMDPs. For POMDPs, almost-sure reachability is decidable in exponential time [19, 37] via a construction similar to ours. Most qualitative properties beyond almost-sure reachability are undecidable [4, 15]. Two dedicated algorithms that limit the search to policies with small memory requirements and employ a SAT-based approach [12, 26] to this NP-hard problem [19] are implemented in Storm. We use them as baselines.

Robust models. The high-level representation of MEMDPs is structurally similar to featured MDPs [1, 18] that represent sets of MDPs. The proposed techniques are called family-based model checking and compute policies for every MDP in the family, whereas we aim to find one policy for all MDPs. Interval MDPs [23, 25, 43] and SGs [38] do not allow for dependencies between states and thus cannot model features such as various obstacle positions. Parametric MDPs [2, 24, 44] assume controllable uncertainty and do not consider robustness of policies.

Contributions. We establish PSPACE-completeness for deciding almost-sure reachability in MEMDPs and show that the policies may be exponentially large. Our iterative algorithm, which is the first specific to almost-sure reachability in MEMDPs, builds fragments of the BOMDP. An empirical evaluation shows that the iterative algorithm outperforms approaches dedicated to POMDPs.

2 Problem Statement

In this section, we provide some background and formalize the problem statement.

For a set X, \( Dist (X)\) denotes the set of probability distributions over X. For a given distribution \(d \in Dist (X)\), we denote its support as \( Supp (d)\). For a finite set X, let \(\textsf{unif}(X)\) denote the uniform distribution. \(\textsf{dirac}(x)\) denotes the Dirac distribution on \(x\in X\). We use short-hand notation for functions and distributions, \(f = [ x \mapsto a, y \mapsto b ]\) means that \(f(x) = a\) and \(f(y) = b\). We write \(\mathcal {P}\left( X\right) \) for the powerset of X. For \(n \in \mathbb {N}\) we write \([n] = \{ i \in \mathbb {N} \mid 1 \le i \le n \}\).

Definition 1 (MDP)

A Markov Decision Process is a tuple \(\mathcal {M}= \langle S, A, \iota _{\text {init}}, p \rangle \) where \(S\) is the finite set of states, \(A\) is the finite set of actions, \(\iota _{\text {init}}\in Dist (S)\) is the initial state distribution, and \(p:S\times A\rightarrow Dist (S)\) is the transition function.

The transition function is total, that is, for notational convenience MDPs are input-enabled. This requirement does not affect the generality of our results. A path of an MDP is a sequence \(\pi = s_{0}a_{0}s_1a_{1}\ldots s_{n}\) such that \(\iota _{\text {init}}(s_{0}) > 0\) and \(p(s_{i},a_{i})(s_{i+1}) > 0\) for all \(0 \le i < n\). The last state of \(\pi \) is \( last (\pi )=s_n\). The set of all finite paths is \(\textsc {Path}\) and \(\textsc {Path}(S')\) denotes the paths starting in a state from \(S'\subseteq S\). The set of reachable states from \(S'\) is \(\textsf{Reachable}(S')\). If \(S' = Supp (\iota _{\text {init}})\) we just call them the reachable states. The MDP restricted to reachable states from a distribution \(d \in Dist (S)\) is \(\textsf{ReachFragment}(\mathcal {M}, d)\), where d is the new initial distribution. A state \(s\in S\) is absorbing if \(\textsf{Reachable}(\{s\})=\{s\}\). An MDP is acyclic, if each state is absorbing or not reachable from its successor states.

Action choices are resolved by a policy \(\sigma :\textsc {Path}\rightarrow Dist (A)\) that maps paths to distributions over actions. A policy of the form \(\sigma :S\rightarrow Dist (A)\) is called memoryless, deterministic if we have \(\sigma :\textsc {Path}\rightarrow A\); and, memoryless deterministic for \(\sigma :S\rightarrow A\). For an MDP \(\mathcal {M}\), we denote the probability of a policy \(\sigma \) reaching some target set \(T \subseteq S\) starting in state s as \({\Pr }_{\mathcal {M}}(s \rightarrow T \mid \sigma )\). More precisely, \({\Pr }_{\mathcal {M}}(s \rightarrow T \mid \sigma )\) denotes the probability of all paths from s reaching T under \(\sigma \). We use \({\Pr }_{\mathcal {M}}(T \mid \sigma )\) if s is distributed according to \(\iota _{\text {init}}\).

Definition 2 (MEMDP)

A Multiple Environment MDP is a tuple \(\mathcal {N}= \langle S, A, \iota _{\text {init}}, \{p_i\}_{i \in I} \rangle \) with \(S, A, \iota _{\text {init}}\) as for MDPs, and \(\{p_i\}_{i \in I}\) is a set of transition functions, where I is a finite set of environment indices.

Intuitively, MEMDPs form sets of MDPs (environments) that share states and actions, but differ in the transition probabilities. For MEMDP \(\mathcal {N}\) with index set I and a set \(I' \subseteq I\), we define the restriction of environments as the MEMDP \({\mathcal {N}}_{\downarrow I'} = \langle S, A, \iota _{\text {init}}, \{p_{i}\}_{i \in I'} \rangle \). Given an environment \(i \in I\), we denote its corresponding MDP as \(\mathcal {N}_{i}=\langle S, A, \iota _{\text {init}}, p_{i} \rangle \). A MEMDP with only one environment is an MDP. Paths and policies are defined on the states and actions of MEMDPs and do not differ from MDP policies. A MEMDP is acyclic, if each MDP is acyclic.

Fig. 1.
figure 1

Example MEMDP

Example 1

Figure 1 shows an MEMDP with three environments \(\mathcal {N}_i\). An agent can ask two questions, \(q_{1}\) and \(q_{2}\). The response is either ‘switch’ (\(s_{1} \leftrightarrow s_{2}\)), or ‘stay’ (loop). In \(\mathcal {N}_1\), the response to \(q_1\) and \(q_2\) is to switch. In \(\mathcal {N}_2\), the response to \(q_1\) is stay, and to \(q_2\) is switch. The agent can guess the environment using \(a_{1}, a_{2}, a_{3}\). Guessing \(a_i\) leads to the target only in environment i. Thus, an agent must deduce the environment via \(q_{1}, q_{2}\) to surely reach the target. \(\blacksquare \)

Definition 3 (Almost-Sure Reachability)

An almost-sure reachability property is defined by a set \(T \subseteq S\) of target states. A policy \(\sigma \) satisfies the property T for MEMDP \(\mathcal {N}= \langle S, A, \iota _{\text {init}}, \{p_{i}\}_{i \in I} \rangle \) iff \(\forall i \in I:{\Pr }_{\mathcal {N}_{i}}(T \mid \sigma )=1\).

In other words, a policy \(\sigma \) satisfies an almost-sure reachability property T, called winning, if and only if the probability of reaching T within each MDP is one. By extension, a state \(s \in S\) is winning if there exists a winning policy when starting in state s. Policies and states that are not winning are losing.

We will now define both the decision and policy problem:

figure b

In Section 4 we discuss the computational complexity of the decision problem. Following up, in Section 5 we present our algorithm for solving the policy problem. Details on its implementation and evaluation will be presented in Section 6.

3 A Reduction To Belief-Observation MDPs

In this section, we reduce the policy problem, and thus also the decision problem, to finding a policy in an exponentially larger belief-observation MDP. This reduction is an elementary building block for the construction of our PSPACE algorithm and the practical implementation. Additional information such as proofs for statements throughout the paper are available in the technical report [41].

3.1 Interpretation of MEMDPs as Partially Observable MDPs

Definition 4 (POMDP)

A partially observable MDP (POMDP) is a tuple \(\langle \mathcal {M}, Z, O \rangle \) with an MDP \(\mathcal {M}= \langle S, A, \iota _{\text {init}}, p \rangle \), a set \(Z\) of observations, and an observation function \(O:S \rightarrow Z\).

A POMDP is an MDP where states are labelled with observations. We lift \(O\) to paths and use \(O(\pi ) = O(s_1) a_{1} O(s_{2}) \ldots O(s_n)\). We use observation-based policies \(\sigma \), i.e., policies s.t. for \(\pi , \pi ' \in \textsc {Path}\), \( O(\pi ) = O(\pi ') \text { implies } \sigma (\pi ) = \sigma (\pi '). \) A MEMDP can be cast into a POMDP that is made up as the disjoint union:

Definition 5 (Union-POMDP)

Given an MEMDP \(\mathcal {N}= \langle S, A, \iota _{\text {init}}, \{p_i\}_{i \in I} \rangle \) we define its union-POMDP \({\mathcal {N}}_{\sqcup } = \langle \langle S', A, \iota _{\text {init}}', p' \rangle , Z, O \rangle \), with states \(S' = S \times I\), initial distribution \(\iota _{\text {init}}'(\langle s, i \rangle ) = \iota _{\text {init}}(s) \cdot |I|^{-1}\), transitions \(p'(\langle s, i \rangle , a)(\langle s', i \rangle ) = p_{i}(s, a)(s')\), observations \(Z= S\), and observation function \(O(\langle s, i \rangle ) = s\).

A policy may observe the state s but not in which MDP we are. This forces any observation-based policy to take the same choice in all environments.

Lemma 1

Given MEMDP \(\mathcal {N}\), there exists a winning policy iff there exists an observation-based policy \(\sigma \) such that \({\Pr }_{{\mathcal {N}}_{\sqcup }}(T \mid \sigma )=1\).

The statement follows as, first, any observation-based policy of the POMDP can be applied to the MEMDP, second, vice versa, any MEMDP policy is observation-based, and third, the induced MCs under these policies are isomorphic.

3.2 Belief-observation MDPs

For POMDPs, memoryless policies are not sufficient, which makes computing policies intricate. We therefore add the information that the history — i.e., the path until some point — contains. In MEMDPs, this information is the (environment-)belief (support) \(J \subseteq I\), as the set of environments that are consistent with a path in the MEMDP. Given a belief \(J \subseteq I\) and a state-action-state transition \(s \xrightarrow {a} s'\), then we define \(\textsf{Up}(J, s, a, s') = \{ i \in J \mid p_{i}(s, a,s') > 0\}\), i.e., the subset of environments in which the transition exists. For a path \(\pi \in \textsc {Path}\), we define its corresponding belief \(\mathcal {B}(\pi ) \subseteq I\) recursively as:

$$\begin{aligned} \mathcal {B}(s_{0})&= I \quad \text { and }\quad \mathcal {B}(\pi \cdot s a s') = \textsf{Up}(\mathcal {B}(\pi \cdot s), s, a, s') \end{aligned}$$

The belief in a MEMDP monotonically decreases along a path, i.e., if we know that we are not in a particular environment, this remains true indefinitely.

We aim to use a model where memoryless policies suffice. To that end, we cast MEMDPs into the exponentially larger belief-observation MDPs [16]Footnote 2.

Definition 6 (BOMDP)

For a MEMDP \(\mathcal {N}= \langle S, A, \iota _{\text {init}}, \{p_i\}_{i \in I} \rangle \), we define its belief-observation MDP (BOMDP) as a POMDP \(\mathcal {G}_{\mathcal {N}}= \langle \langle S', A, \iota _{\text {init}}', p' \rangle , Z, O \rangle \) with states \(S' = S \times I \times \mathcal {P}\left( I\right) \), initial distribution \(\iota _{\text {init}}'(\langle s, j, I \rangle ) = \iota _{\text {init}}(s) \cdot |I|^{-1}\), transition relation \(p'(\langle s, j, J \rangle , a)(\langle s', j, J' \rangle ) = p_{j}(s, a, s')\) with \(J' = \textsf{Up}(J, s, a, s')\), observations \(Z= S \times \mathcal {P}\left( I\right) \), and observation function \(O(\langle s, j, J \rangle ) = \langle s, J \rangle \).

Compared to the union-POMDP, BOMDPs also track the belief by updating it accordingly. We clarify the correspondence between paths of the BOMDP and the MEMDP. For a path \(\pi \) through the MEMDP, we can mimic this path exactly in the MDPs \(\mathcal {N}_j\) for \(j \in \mathcal {B}(\pi )\). As we track \(\mathcal {B}(\pi )\) in the state, we can deduce from the BOMDP state in which environments we can be.

Lemma 2

For MEMDP \(\mathcal {N}\) and the path \(\langle s_1,j,J_1 \rangle a_1\langle s_2,j,J_2 \rangle \ldots \langle s_n,j,J_n \rangle \) of the BOMDP \(\mathcal {G}_{\mathcal {N}}\), let \(j \in J_{1}\). Then: \(J_n \ne \emptyset \) and the path \(s_1a_1 \ldots s_n\) exists in MDP \(\mathcal {N}_i\) iff \(i \in J_1 \cap J_n\).

Consequently, the belief of a path can be uniquely determined by the observation of the last state reached, hence the name belief-observation MDPs.

Lemma 3

For every pair of paths \(\pi , \pi '\) in a BOMDP, we have:

$$\begin{aligned} \mathcal {B}(\pi ) = \mathcal {B}(\pi ') \quad \text { implies }\quad O( last (\pi )) = O( last (\pi ')). \end{aligned}$$

For notation, we define \(S_J = \{ \langle s, j, J \rangle \mid j \in J, s \in S \}\), and analogously write \(Z_J = \{ \langle s, J \rangle \mid s \in S \}\). We lift the target states T to states in the BOMDP: \(T_{\mathcal {G}_{\mathcal {N}}} = \{ \langle s, j, J \rangle \mid s \in T, J \subseteq I, j \in J\}\) and define target observations \(T_Z= O(T_{\mathcal {G}_{\mathcal {N}}})\).

Definition 7 (Winning in a BOMDP)

Let \(\mathcal {G}_{\mathcal {N}}\) be a BOMDP with target observations \(T_Z\). An observation-based policy \(\sigma \) is winning from some observation \(z\in Z\), if for all \(s \in O^{-1}(z)\) it holds that \(\Pr _{\mathcal {G}_{\mathcal {N}}}( s \rightarrow O^{-1}(T_Z) \mid \sigma ) = 1\).

Furthermore, a policy \(\sigma \) is winning if it is winning for the initial distribution \(\iota _{\text {init}}\). An observation \(z\) is winning if there exists a winning policy for \(z\). The winning region \(\textsf{Win}_{\mathcal {G}_{\mathcal {N}}}^{T}\) is the set of all winning observations.

Almost-sure winning in the BOMDP corresponds to winning in the MEMDP.

Theorem 1

There exists a winning policy for a MEMDP \(\mathcal {N}\) with target states T iff there exists a winning policy in the BOMDP \(\mathcal {G}_{\mathcal {N}}\) with target states \(T_{\mathcal {G}_{\mathcal {N}}}\).

Intuitively, the important aspect is that for almost-sure reachability, observation-based memoryless policies are sufficient [13]. For any such policy, the induced Markov chains on the union-POMDP and the BOMDP are bisimilar [16].

BOMDPs make policy search conceptually easier. First, as memoryless policies suffice for almost-sure reachability, winning regions are independent of fixed policies: For policies \(\sigma \) and \(\sigma '\) that are winning in observation \(z\) and \(z'\), respectively, there must exist a policy \(\hat{\sigma }\) that is winning for both \(z\) and \(z'\). Second, winning regions can be determined in polynomial time in the size of the BOMDP [16].

3.3 Fragments of BOMDPs

To avoid storing the exponentially sized BOMDP, we only build fragments: We may select any set of observations as frontier observations and make the states with those observations absorbing. We later discuss the selection of frontiers.

Definition 8 (Sliced BOMDP)

For a BOMDP \(\mathcal {G}_{\mathcal {N}}= \langle \langle S, A, \iota _{\text {init}}, p \rangle , Z, O \rangle \) and a set of frontier observations \(F \subseteq Z\), we define a BOMDP \({\mathcal {G}_{\mathcal {N}}}{\mid \!F} = \langle \langle S, A, \iota _{\text {init}}, p' \rangle , Z, O \rangle \) with:

$$ \forall s \in S, a \in A:p'(s, a) = {\left\{ \begin{array}{ll} \textsf{dirac}(s) &{} \text {if } O(s) \in F,\\ p(s,a) &{} \text {otherwise.}\\ \end{array}\right. } $$

We exploit this sliced BOMDP to derive constraints on the set of winning states.

Lemma 4

For every BOMDP \(\mathcal {G}_{\mathcal {N}}\) with states S and targets T and for all frontier observations \(F \subseteq Z\) it holds that: \(\textsf{Win}_{{\mathcal {G}_{\mathcal {N}}}{\mid \!F}}^{T} \subseteq \textsf{Win}_{\mathcal {G}_{\mathcal {N}}}^{T} \subseteq \textsf{Win}_{{\mathcal {G}_{\mathcal {N}}}{\mid \!F}}^{T \cup F}\).

Making (non-target) observations absorbing extends the set of losing observations, while adding target states extends the set of winning observations.

4 Computational Complexity

The BOMDP \(\mathcal {G}_{\mathcal {N}}\) above yields an exponential time and space algorithm via Theorem 1. We can avoid the exponential memory requirement. This section shows the PSPACE-completeness of deciding whether a winning policy exists.

Theorem 2

The almost-sure reachability decision problem is PSPACE-complete.

The result follows from Lemmas 11 and 10 below. In Section 4.3, we show that representing the winning policy itself may however require exponential space.

4.1 Deciding Almost-Sure Winning for MEMDPs in PSPACE

We develop an algorithm with a polynomial memory footprint. The algorithm exploits locality of cyclic behavior in the BOMDP, as formalized by an acyclic environment graph and local BOMDPs that match the nodes in the environment graph. The algorithm recurses on the environment graph while memorizing results from polynomially many local BOMDPs.

The graph-structure of BOMDPs. First, along a path of the MEMDP, we will only gain information and are thus able to rule out certain environments [14]. Due to the monotonicity of the update operator, we have for any BOMDP that \(\langle s,j,J \rangle \in \textsf{Reachable}(\langle s', j, J' \rangle )\) implies \(J \subseteq J'\). We define a graph over environment sets that describes how the belief-support can update over a run.

Fig. 2.
figure 2

The environment graph for our running example.

Definition 9 (Environment graph)

Let \(\mathcal {N}\) be a MEMDP and p the transition function of \(\mathcal {G}_{\mathcal {N}}\). The environment graph \( GE _\mathcal {N}=( V _\mathcal {N}, E _\mathcal {N})\) for \(\mathcal {N}\) is a directed graph with vertices \( V _\mathcal {N}=\mathcal {P}\left( I\right) \) and edges

$$\begin{aligned} E _\mathcal {N}= \{ \langle J, J' \rangle \mid \exists s, s' \in S, a \in A, j \in I. p(\langle s,j,J \rangle , a, \langle s', j, J' \rangle ) > 0 \text { and } J \ne J' \}. \end{aligned}$$

Example 2

Figure 2 shows the environment graph for the MEMDP in Ex. 1. It consists of the different belief-supports. For example, the transition from \(\{1,2,3\}\) to \(\{2,3\}\) and to \(\{1\}\) is due to the action \(q_1\) in state \(s_0\), as shown in Fig. 1. \(\blacksquare \)

Paths in the environment graph abstract paths in the BOMDP. Path fragments where the belief-support remains unchanged are summarized into one step, as we do not create edges of the form \(\langle J,J \rangle \). We formalize this idea: Let \(\pi =\langle s_1, j, J_1 \rangle a_1\langle s_2, j, J_2 \rangle \dots \langle s_n, j, J_n \rangle \) be a path in the BOMDP. For any \(J \subseteq I\), we call \(\pi \) a J-local path, if \(J_i = J\) for all \(i \in [n]\).

Lemma 5

For a MEMDP \(\mathcal {N}\) with environment graph \( GE _\mathcal {N}\), there is a path \(J_1 \dots J_n\) iff there is a path \(\pi = \pi _1\dots \pi _n\) in \(\mathcal {G}_{\mathcal {N}}\) s.t. every \(\pi _i\) is \(J_i\)-local.

The shape of the environment graph is crucial for the algorithm we develop.

Lemma 6

Let \( GE _\mathcal {N}= ( V _\mathcal {N}, E _\mathcal {N})\) be an environment graph for MEMDP \(\mathcal {N}\). First, \( E _\mathcal {N}(J,J')\) implies \(J' \subsetneq J\). Thus, G is acyclic and has maximal path length |I|. The maximal outdegree of the graph is \(|S|^2|A|\).

The monotonicity regarding \(J, J'\) follows from definition of the belief update. The bound on the outdegree is a consequence from Lemma 9 below.

Local belief-support BOMDPs. Before we continue, we remark that the (future) dynamics in a BOMDP only depend on the current state and set of environments. More formally, we capture this intuition as follows.

Lemma 7

Let \(\mathcal {G}_{\mathcal {N}}\) be a BOMDP with states \(S'\). For any state \(\langle s, j, J \rangle \in S'\), let \(\mathcal {N}' = \textsf{ReachFragment}({\mathcal {N}}_{\downarrow J}, \textsf{dirac}(s))\) and \(Y = \{ \langle s, i, J \rangle \mid i \in J\}\). Then:

$$\begin{aligned} \textsf{ReachFragment}(\mathcal {G}_{\mathcal {N}}, \textsf{unif}(Y)) = \mathcal {G}_{\mathcal {N}'}. \end{aligned}$$

The key insight is that restricting the MEMDP does not change the transition functions for the environments \(j \in J\). Furthermore, using monotonicity of the update, we only reach BOMDP-states whose behavior is determined by the environments in J.

This intuition allows us to analyze the BOMDP locally and lift the results to the complete BOMDP. We define a local BOMDP as the part of a BOMDP starting in any state in \(S_J\). All observations not in \(Z_J\) are made absorbing.

Definition 10 (Local BOMDP)

Given a MEMDP \(\mathcal {N}\) with BOMDP \(\mathcal {G}_{\mathcal {N}}\) and a set of environments J. The local BOMDP for environments J is the fragment

$$\begin{aligned} \textsc {Loc}\mathcal {G}({J}) = \textsf{ReachFragment}({\mathcal {G}_{{\mathcal {N}}_{\downarrow J}}}{\mid \!F}, \textsf{unif}(S_J))\quad \text { where }\quad F = Z\setminus Z_J\ . \end{aligned}$$

This definition of a local BOMDP coincides with a fragment of the complete BOMDP. We then mark exactly the winning observations restricted to the environment sets \(J' \subsetneq J\) as winning in the local BOMDP and compute all winning observations in the local BOMDP. These observations are winning in the complete BOMDP. The following concretization of Lemma 4 formalizes this.

Lemma 8

Consider a MEMDP \(\mathcal {N}\) and a subset of environments J.

$$\begin{aligned} \textsf{Win}_{\textsc {Loc}\mathcal {G}({J})}^{T'_{\mathcal {G}_{\mathcal {N}}}} \cap Z_J ~=~ \textsf{Win}_{\mathcal {G}_{\mathcal {N}}}^{T_{\mathcal {G}_{\mathcal {N}}}} \cap Z_J \quad \text { with }\quad T'_{\mathcal {G}_{\mathcal {N}}} = T_{\mathcal {G}_{\mathcal {N}}} \cup (\textsf{Win}_{\mathcal {G}_{\mathcal {N}}}^{T_{\mathcal {G}_{\mathcal {N}}}} \setminus Z_J). \end{aligned}$$

Furthermore, local BOMDPs are polynomially bounded in the size of the MEMDP.

Lemma 9

Let \(\mathcal {N}\) be a MEMDP with states S and actions A. \(\textsc {Loc}\mathcal {G}({J})\) has at most \(\mathcal {O}(|S|^2\cdot |A|\cdot |J|)\) states and \(\mathcal {O}(|S|^{2}\cdot |A|\cdot |J|^2)\) transitionsFootnote 3.

A PSPACE algorithm. We present Algorithm 1 for the MEMDP decision problem, which recurses depth-first over the paths in the environment graphFootnote 4. We first state the correctness and the space complexity of this algorithm.

figure c

Lemma 10

ASWinning in Alg. 1 solves the decision problem in PSPACE.

To prove correctness, we first note that \(\textsc {Search}(\mathcal {N}, J, T)\) computes \(\textsf{Win}_{\mathcal {G}_{\mathcal {N}}}^{T_{\mathcal {G}_{\mathcal {N}}}} \cap Z_J\). We show this by induction over the structure of the environment graph. For all J without outgoing edges, the local BOMDP coincides with a BOMDP just for environments J (Lemma 7). Otherwise, observe that \(T'\) in line 5 coincides with its definition in Lemma 8 and thus, by the same lemma, we return \(\textsf{Win}_{\mathcal {G}_{\mathcal {N}}}^{T_{\mathcal {G}_{\mathcal {N}}}} \cap Z_J\). To finalize the proof, a winning policy exists in the MEMDP if the observation of the initial states of the BOMDP are winning (Theorem 1). The algorithm must terminate as it recurses over all paths of a finite acyclic graph, see Lemma 6. Following Lemma 9, the number of frontier states is then bounded by \(|S|^2\cdot |A|\). The main body of the algorithm therefore requires polynomial space, and the maximal recursion depth (stack height) is |I| (Lemma 6). Together, this yields a space complexity in \(\mathcal {O}(|S|^{2}\cdot |A|\cdot |I|^2)\).

4.2 Deciding Almost-Sure Winning for MEMDPs Is PSPACE-hard

It is not possible to improve the algorithm beyond PSPACE.

Lemma 11

The MEMDP decision problem is PSPACE-hard.

Hardness holds even for acyclic MEMDPs and uses the following fact.

Lemma 12

If a winning policy exists for an acyclic MEMDP, there also exists a winning policy that is deterministic.

In particular, almost-sure reachability coincides with avoiding the sink states. This is a safety property. For safety, deterministic policies are sufficient, as randomization visits only additional states, which is not beneficial for safety.

Regarding Lemma 11, we sketch a polynomial-time reduction from the PSPACE-complete TQBF problem [20] problem to the MEMDP decision problem. Let \(\Psi \) be a QBF formula, \( \Psi = \exists x_{1} \forall y_{1} \exists x_{2} \forall y_{2} \ldots \exists x_{n} \forall y_{n}\big [\Phi \big ] \) with \(\Phi \) a Boolean formula in conjunctive normal form. The problem is to decide whether \(\Psi \) is true.

Fig. 3.
figure 3

Constructed MEMDP for the QBF formula \(\forall x \exists y \big [ (x \vee y) \wedge (\lnot x \vee \lnot y) \big ]\).

Example 3

Consider the QBF formula \(\Psi = \forall x \exists y \big [ (x \vee y) \wedge (\lnot x \vee \lnot y) \big ]\). We construct a MEMDP with an environment for every clause, see Figure 3Footnote 5. The state space consists of three states for each variable \(v \in V\): the state v and the states \(v\top \) and \(v\bot \) that encode their assignment. Additionally, we have a dedicated target W and sink state F. We consider three actions: The actions true (\(\top \)) and false (\(\bot \)) semantically describe the assignment to existentially quantified variables. The action any \(\alpha _\otimes \) is used for all other states. Every environment reaches the target state iff one literal in the clause is assigned true.

In the example, intuitively, a policy should assign the negation of x to y. Formally, the policy \(\sigma \), characterized by \( \sigma (\pi \cdot y) = \top \) iff \(x_\bot \in \pi \), is winning. \(\blacksquare \)

As a consequence of this construction, we may also deduce the following theorem.

Theorem 3

Deciding whether a memoryless winning policy exists is NP-complete.

The proof of NP hardness uses a similar construction for the propositional SAT fragment of QBF, without universal quantifiers. Additionally, the problem for memoryless policies is in NP, because one can nondeterministically guess a (polynomially sized) memoryless policy and verify in each environment independently.

4.3 Policy Problem

Policies, mapping histories to actions, are generally infinite objects. However, we may extract winning policies from the BOMDP, which is (only) exponential in the MEMDP. Finite state controllers [34] are a suitable and widespread representation of policies that require only a finite amount of memory. Intuitively, the number of memory states reflects the number of equivalence classes of histories that a policy can distinguish. In general, we cannot hope to find smaller policies than those obtained via a BOMDP.

Theorem 4

There is a family of MEMDPs \(\{ \mathcal {N}^n \}_{n \ge 1}\) where for each n, \(\mathcal {N}^n\) has 2n environments and \(\mathcal {O}(n)\) states and where every winning policy for \(\mathcal {N}^n\) requires at least \(2^n\) memory states.

Fig. 4.
figure 4

Witness for exponential memory requirement for winning policies.

We illustrate the witness. Consider a family of MEMDPs \(\{\mathcal {N}^n\}_n\), where \(\mathcal {N}^n\) has 2n MDPs, 4n states partitioned into two parts, and at most 2n outgoing actions per state. We outline the MEMDP family in Figure 4. In the first part, there is only one action per state. The notation is as follows: in state \(s_0\) and MDP \(\mathcal {N}^n_1\), we transition with probability one to state \(a_0\), whereas in \(\mathcal {N}^n_2\) we transition with probability one to state \(b_0\). In every other MDP, we transition with probability one half to either state. In state \(s_1\), we do the analogous construction for environments 3, 4, and all others. A path \(s_0b_1 \ldots \) is thus consistent with every MDP except \(\mathcal {N}^n_1\). The first part ends in state \(s_n\). By construction, there are \(2^n\) paths ending in \(s_n\). Each of them is (in)consistent with a unique set of n environments. In the second part, a policy may guess n times an environment by selecting an action \(\alpha _i\) for every \(i \in [2n]\). Only in MDP \(\mathcal {N}^n_i\), action \(\alpha _i\) leads to a target state. In all other MDPs, the transition leads from state \(g_j\) to \(g_{j+1}\). The state \(g_{n+1}\) is absorbing in all MDPs. Importantly, after taking an action \(\alpha _i\) and arriving in \(g_{j+1}\), there is (at most) one more MDP inconsistent with the path.

Every MEMDP \(\mathcal {N}^n\) in this family has a winning policy which takes \(\sigma (\pi \cdot g_i) = \alpha _{2i-1}\) if \(a_{i} \in \pi \) and \(\sigma (\pi \cdot g_i) = \alpha _{2i}\) otherwise. Furthermore, when arriving in state \(s_n\), the state of a finite memory controller must reflect the precise set of environments consistent with the history. There are \(2^{n}\) such sets. The proof shows that if we store less information, two paths will lead to the same memory state, but with different sets of environments being consistent with these paths. As we can rule out only n environments using the n actions in the second part of the MEMDP, we cannot ensure winning in every environment.

5 A Partial Game Exploration Algorithm

In this section, we present an algorithm for the policy problem. We tune the algorithm towards runtime instead of memory complexity, but aim to avoid running out of memory. We use several key ingredients to create a pragmatic variation of Alg. 1, with support for extracting the winning policy.

First, we use an abstraction from BOMDPs to a belief stochastic game (BSG) similar to [45] that reduces the number of states and simplifies the iterative constructionFootnote 6. Second, we tailor and generalize ideas from bounded model checking [6] to build and model check only a fragment of the BSG, using explicit partial exploration approaches as in, e.g., [9, 29, 33, 42]. Third, our exploration does not continuously extend the fragment, but can also prune this fragment by using the model checking results obtained so far. The structure of the BSG as captured by the environment graph makes the approach promising and yields some natural heuristics. Fourth, the structure of the winning region allows to generalize results to unseen states. We thereby operationalize an idea from [26] in a partial exploration context. Finally, we analyze individual MDPs as an efficient and significant preprocessing step. In the following we discuss these ingredients.

Abstraction to Belief Support Games. We briefly recap stochastic games (SGs). See [17, 38] for more details.

Definition 11 (SG)

A stochastic game is a tuple \(\mathcal {B}= \langle \mathcal {M}, S_{1}, S_{2} \rangle \), where \(\mathcal {M}= \langle S, A, \iota _{\text {init}}, p \rangle \) is an MDP and \((S_{1}, S_{2})\) is a partition of S.

\(S_{1}\) are Player 1 states, and \(S_{2}\) are Player 2 states. As common, we also ‘partition’ (memoryless deterministic) policies into two functions \({\sigma _{1} :S_{1} \rightarrow A}\) and \({\sigma _{1} :S_{2} \rightarrow A}\). A Player 1 policy \(\sigma _{1}\) is winning for state s if \(\Pr (T \mid \sigma _1, \sigma _2)\) for all \(\sigma _2\). We (re)use \(\textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T}\) to denote the set of states with a winning policy.

We apply a game-based abstraction to group states that have the same observation. Player 1 states capture the observation in the BOMDP, i.e., tuples \(\langle s, J \rangle \) of MEMDP states s and subsets J of the environments. Player 1 selects the action a, the result is Player 2 state \(\langle \langle s, J \rangle , a \rangle \). Then Player 2 chooses an environment \(j\in J\), and the game mimics the outgoing transition from \(\langle s, j, J \rangle \), i.e., it mimics the transition from s in \(\mathcal {N}_j\). Formally:

Definition 12 (BSG)

Let \(\mathcal {G}_{\mathcal {N}}\) be a BOMDP with \(\mathcal {G}_{\mathcal {N}}= \langle \langle S, A, \iota _{\text {init}}, p \rangle , Z, O \rangle \). A belief support game \(\mathcal {B}_{\mathcal {N}}\) for \(\mathcal {G}_{\mathcal {N}}\) is an SG \(\mathcal {B}_{\mathcal {N}}= \langle \langle S', A', \iota _{\text {init}}', p \rangle , S_{1}, S_{2} \rangle \) with \(S' = S_{1} \cup S_{2}\) as usual, Player 1 states \(S_{1} = Z\), Player 2 states \(S_2= Z \times A\), actions \(A' = A \cup I\), initial distribution \(\iota _{\text {init}}'(\langle s, I \rangle ) = \sum _{i \in I} \iota _{\text {init}}(\langle s, i, I \rangle )\), and the (partial) transition function p defined separately for Player 1 and 2:

$$\begin{aligned} p'(z, a)&= \textsf{dirac}({\langle z, a \rangle })&\text {(Player~1)}\\ p'(\langle z, a \rangle , j, z')&= p(\langle s, j, J \rangle , a, \langle s', j, J' \rangle ) \text { with }z= \langle s,J \rangle , z' = \langle s',J' \rangle&\text {(Player~2)} \end{aligned}$$

Lemma 13

An (acyclic) MEMDP \(\mathcal {N}\) with target states T is winning if(f) there exists a winning policy in the BSG \(\mathcal {B}_{\mathcal {N}}\) with target states \(T_Z\).

Thus, on acyclic MEMDPs, a BSG-based algorithm is sound and complete, however, on cyclic MDPs, it may not find the winning policy. The remainder of the algorithm is formulated on the BSG, we use sliced BSGs as the BSG of a sliced BOMDP, or equivalently, as a BSG with some states made absorbing.

Main algorithm.

figure d

We outline Algorithm 2 for the policy problem. We track the sets of almost-sure observations and losing observations (states in the BSG). Initially, target states are winning. Furthermore, via a simple preprocessing, we determine some winning and losing states on the individual MDPs.

We iterate until the initial state is winning or losing. Our algorithm constructs a sliced BSG and decides on-the-fly whether a state should be a frontier state, returning the sliced BSG and the used frontier states. We discuss the implementation below. For the sliced BSG, we compute the winning region twice: Once assuming that the frontier states are winning, once assuming they are loosing. This yields an approximation of the winning and losing states, see Lemma 4. From the winning states, we can extract a randomized winning policy [13].

Soundness. Assuming that the \(\mathcal {B}_{\mathcal {N}}\) is indeed a sliced BSG with frontier F. Then the following invariant holds: \( W \subseteq \textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T} \text { and } L \cap \textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T} = \emptyset . \) This invariant exploits that from a sliced BSG we can (implicitly) slice the complete BSG while preserving the winning status of every state, formalized below. In future iterations we only explore the implicitly sliced BSG.

Lemma 14

Given \(W \subseteq \textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T_{\mathcal {B}_{\mathcal {N}}}}\) and \(L \subseteq S \setminus \textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T_{\mathcal {B}_{\mathcal {N}}}}\): \( \textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T_{\mathcal {B}_{\mathcal {N}}}} = \textsf{Win}_{{\mathcal {B}_{\mathcal {N}}}{\mid \!W \cup L}}^{T_{\mathcal {B}_{\mathcal {N}}} \cup W} \)

Termination depends on the sliced game generation. It suffices to ensure that in the long run, either W or L grow as there are only finitely many states. If W and L remain the same longer than some number of iterations, \(W \cup L\) will be used as frontier. Then, the new game will suffice to determine if \(s\in W\) in one shot.

figure e

Generating the sliced BSG. Algorithm 3 outlines the generation of the sliced BSG. In particular, we explore the implicit BSG from the initial state but make every state that we do not explicitly explore absorbing. In every iteration, we first check if there are states in Q left to explore and if the number of explored states in E is below a threshold \(\textsf {Bound}[i]\). Then, we take a state from the priority queue and add it to E. We find new reachable statesFootnote 7 and add them to the queue Q.

Generalizing the winning and losing states. We aim to determine that a state in the game \(\mathcal {B}_{\mathcal {N}}\) is winning without ever exploring it. First, observe:

Lemma 15

A winning policy in MEMDP \(\mathcal {N}\) is winning in \({\mathcal {N}}_{\downarrow J}\) for any J.

A direct consequence is the following statement for two environments \(J_1 \subseteq J_2\):

$$\begin{aligned} \langle s, J_{2} \rangle \in \textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T} \quad \text {implies}\quad \langle s, J_{1} \rangle \in \textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T} . \end{aligned}$$

Consequently, we can store W (and symmetrically, L) as follows. For every MEMDP state \(s \in S\), \(W_s = \{ J \mid \langle s, J \rangle \in W\}\) is downward closed on the partial order \(P=(I, \subset )\). This allows for efficient storage: We only have to store the set of pairwise maximal elements, i.e., the antichain,

$$\begin{aligned} W_s^{\max } = \{ J \in W_s \mid \forall J' \in W_s \text { with } J \not \subseteq J' \}. \end{aligned}$$

To determine whether \(\langle s,J \rangle \) is winning, we check whether \(J \subseteq J'\) for some \(J' \in W_s^{\max }\). Adding J to \(W_s^{\max }\) requires removing all \(J' \subseteq J\) and then adding J. Note, however, that \(|W_s^{\max }|\) is still exponential in |I| in the worst case.

Selection of heuristics. The algorithm allows some degrees of freedom. We evaluate the following aspects empirically. (1) The maximal size \(\texttt {bound}[i]\) of a sliced BSG at iteration i is critical. If it is too small, the sets W and L will grow slowly in every iteration. The trade-off is further complicated by the fact that the sets W and L may generalize to unseen states. (2) For a fixed \(\texttt {bound}[i]\), it is unclear how to prioritize the exploration of states. The PSPACE algorithm suggests that going deep is good, whereas the potential for generalization to unseen states is largest when going broad. (3) Finally, there is overhead in computing both W and L. If there is a winning policy, we only need to compute W. However, computing L may ensure that we can prune parts of the state space. A similar observation holds for computing W on unsatisfiable instances.

Remark 1

Algorithm 2 can be mildly tweaked to meet the PSPACE algorithm in Algorithm 1. The priority queue must ensure to always include complete (reachable) local BSGs and to explore states \(\langle s, J \rangle \) with small J first. Furthermore, W and L require regular pruning, and we cannot extract a policy if we prune W to a polynomial size bound. Practically, we may write pruned parts of W to disk.

6 Experiments

We highlight two aspects: (1) A comparison of our prototype to existing baselines for POMDPs, and (2) an examination of the exploration heuristics. The technical report [41] contains details on the implementation, the benchmarks, and more results.

Implementation. We provide a novel PArtial Game Exploration (PaGE) prototype, based on Algorithm 2, on top of the probabilistic model checker Storm  [22]. We represent MEMDPs using the Prism language with integer constants. Every assignment to these constants induces an explicit MDP. SGs are constructed and solved using existing data structures and graph algorithms.

Setup. We create a set of benchmarks inspired by the POMDP and MEMDP literature [12, 21, 26]. We consider a combination of satisfiable and unsatisfiable benchmarks. In the latter case, a winning policy does not exist. We construct POMDPs from MEMDPs as in Definition 5. As baselines, we use the following two existing POMDP algorithms. For almost-sure properties, a belief-MDP construction [7] acts similar to an efficiently engineered variant of our game-construction, but tailored towards more general quantitative properties. A SAT-based approach [26] aims to find increasingly larger policies. We evaluate all benchmarks on a system with a 3GHz Intel Core i9-10980XE processor. We use a time limit of 30 minutes and a memory limit of 32 GB.

Fig. 5.
figure 5

Performance of baselines and novel PaGE algorithm

Results. Figure 5 shows the (log scale) performance comparisons between different configurationsFootnote 8. Green circles reflect satisfiable and red crosses unsatisfiable benchmarks. On the x-axis is PaGE in its default configuration. The first plot compares to the belief-MDP construction. The tailored heuristics and representation of the belief-support give a significant edge in almost all cases. The few points below the line are due to a higher exploration rate when building the state space. The second plot compares to the SAT-based approach, which is only suitable for finding policies, not for disproving their existence. This approach implicitly searches for a particular class of policies, whose structure is not appropriate for some MEMDPs. The third plot compares PaGE in the default configuration – with negative entropy as priority function – with PaGE using positive entropy. As expected, different priorities have a significant impact on the performance.

Table 1. Satisfiable and unsatisfiable benchmark results

Table 1 shows an overview of satisfiable and unsatisfiable benchmarks. Each table shows the number of environments, states, and actions-per-state in the MEMDP. For PaGE, we include both the default configuration (negative entropy) and variation (positive entropy). For both configurations, we provide columns with the time and the maximum size of the BSG constructed. We also include the time for the two baselines. Unsurprisingly, the number of states to be explored is a good predictor for the performance and the relative performance is as in Fig. 5.

7 Conclusion

This paper considers multi-environment MDPs with an arbitrary number of environments and an almost-sure reachability objective. We show novel and tight complexity bounds and use these insights to derive a new algorithm. This algorithm outperforms approaches for POMDPs on a broad set of benchmarks. For future work, we will apply an algorithm directly on the BOMDP [16].