## Abstract

Multiple-environment MDPs (MEMDPs) capture finite sets of MDPs that share the states but differ in the transition dynamics. These models form a proper subclass of partially observable MDPs (POMDPs). We consider the synthesis of policies that robustly satisfy an almost-sure reachability property in MEMDPs, that is, *one* policy that satisfies a property *for all* environments. For POMDPs, deciding the existence of robust policies is an EXPTIME-complete problem. We show that this problem is PSPACE-complete for MEMDPs, while the policies require exponential memory in general. We exploit the theoretical results to develop and implement an algorithm that shows promising results in synthesizing robust policies for various benchmarks.

You have full access to this open access chapter, Download conference paper PDF

### Similar content being viewed by others

## 1 Introduction

Markov decision processes (MDPs) are the standard formalism to model sequential decision making under uncertainty. A typical goal is to find a policy that satisfies a temporal logic specificationÂ [5]. Probabilistic model checkers such as Storm Â [22] and Prism Â [30] efficiently compute such policies. A concern, however, is the robustness against potential perturbations in the enviro nment. MDPs cannot capture such uncertainty about the shape of the environment.

Multi-environment MDPs (MEMDPs)Â [14, 36] contain a set of MDPs, called environments, over the same state space. The goal in MEMDPs is to find a single policy that satisfies a given specification in *all* environments. MEMDPs are, for instance, a natural model for MDPs with unknown system dynamics, where several domain experts provide their interpretation of the dynamicsÂ [11]. These different MDPs together form a MEMDP. MEMDPs also arise in other domains: The guessing of a (static) password is a natural example in security. In robotics, a MEMDP captures unknown positions of some static obstacle. One can interpret MEMDPs as a (disjoint) union of MDPs in which an agent only has partial observation, i.e., every MEMDP can be cast into a linearly larger partially observable MDP (POMDP)Â [27]. Indeed, some famous examples for POMDPs are in fact MEMDPs, such as *RockSample*Â [39] and *Hallway*Â [31]. Solving POMDPs is notoriously hardÂ [32], and thus, it is worthwhile to investigate natural subclasses.

We consider *almost-sure specifications* where the probability needs to be one to reach a set of target states. In MDPs, it suffices to consider memoryless policies. Constructing such policies can be efficiently implemented by means of a graph-searchÂ [5]. For MEMDPs, we consider the following problem:

Computeonepolicy that almost-surely reaches the target inallenvironments.

Such a policy robustly satisfies an almost-sure specification for a set of MDPs.

*Our approach.* Inspired by work on POMDPs, we construct a belief-observation MDP (BOMDP)Â [16] that tracks the states of the MDPs and the (support of the) belief over potential environments. We show that a policy satisfying the almost-sure property in the BOMDP also satisfies the property in the MEMDP.

Although the BOMDP is exponentially larger than the MEMDP, we exploit its particular structure to create a PSPACE algorithm to decide whether such a robust policy exists. The essence of the algorithm is a recursive construction of a fragment of the BOMDP, restricted to a setting in which the belief-support is fixed. Such an approach is possible, as the belief in a MEMDP behaves monotonically: Once we know that we are not in a particular environment, we never lose this knowledge. This behavior is in contrast to POMDPs, where there is no monotonic behavior in belief-supports. The difference is essential: Deciding almost-sure reachability in POMDPs is EXPTIME-completeÂ [19, 37]. In contrast, the problem of deciding whether a policy for almost-sure reachability in a MEMDP exists is indeed PSPACE*-complete*. We show the hardness using a reduction from the *true quantified Boolean formula problem*. Finally, we cannot hope to extract a policy with such an algorithm, as the smallest policy for MEMDPs may require exponential memory in the number of environments.

The PSPACE algorithm itself recomputes many results. For practical purposes, we create an algorithm that iteratively explores parts of the BOMDP. The algorithm additionally uses the MEMDP structure to generalize the set of states from which a winning policy exists and deduce efficient heuristics for guiding the exploration. The combination of these ingredients leads to an efficient and competitive prototype on top of the model checker Storm.

**Related work. ** We categorize related work in three areas.

*MEMDPs.* Almost-sure reachability for MEMDPs for exactly two environments has been studied byÂ [36]. We extend the results to arbitrarily many environments. This is nontrivial: For two environments, the decision problem has a polynomial time routineÂ [36], whereas we show that the problem is PSPACE-complete for an arbitrary number of environments. MEMDPs and closely related models such as hidden-model MDPs, hidden-parameter MDPs, multi-model MDPs, and concurrent MDPsÂ [2, 10, 11, 40] have been considered for quantitative properties^{Footnote 1}. The typical approach is to consider approximative algorithms for the undecidable problem in POMDPsÂ [14] or adapt reinforcement learning algorithmsÂ [3, 28]. These approximations are not applicable to almost-sure properties.

*POMDPs.* One can build an underlying potentially infinite belief-MDPÂ [27] that corresponds to the POMDP â€“ using model checkersÂ [7, 8, 35] to verify this MDP can answer the question for MEMDPs. For POMDPs, almost-sure reachability is decidable in exponential timeÂ [19, 37] via a construction similar to ours. Most qualitative properties beyond almost-sure reachability are undecidableÂ [4, 15]. Two dedicated algorithms that limit the search to policies with small memory requirements and employ a SAT-based approachÂ [12, 26] to this NP-hard problemÂ [19] are implemented inÂ Storm. We use them as baselines.

*Robust models.* The high-level representation of MEMDPs is structurally similar to featured MDPsÂ [1, 18] that represent sets of MDPs. The proposed techniques are called family-based model checking and compute policies for every MDP in the family, whereas we aim to find one policy for all MDPs. Interval MDPsÂ [23, 25, 43] and SGsÂ [38] do not allow for dependencies between states and thus cannot model features such as various obstacle positions. Parametric MDPsÂ [2, 24, 44] assume controllable uncertainty and do not consider robustness of policies.

**Contributions. ** We establish PSPACE-completeness for deciding almost-sure reachability in MEMDPs and show that the policies may be exponentially large. Our iterative algorithm, which is the first specific to almost-sure reachability in MEMDPs, builds fragments of the BOMDP. An empirical evaluation shows that the iterative algorithm outperforms approaches dedicated to POMDPs.

## 2 Problem Statement

In this section, we provide some background and formalize the problem statement.

For a set *X*, \( Dist (X)\) denotes the set of probability distributions over *X*. For a given distribution \(d \in Dist (X)\), we denote its support as \( Supp (d)\). For a finite set *X*, let \(\textsf{unif}(X)\) denote the uniform distribution. \(\textsf{dirac}(x)\) denotes the Dirac distribution on \(x\in X\). We use short-hand notation for functions and distributions, \(f = [ x \mapsto a, y \mapsto b ]\) means that \(f(x) = a\) and \(f(y) = b\). We write \(\mathcal {P}\left( X\right) \) for the powerset of *X*. For \(n \in \mathbb {N}\) we write \([n] = \{ i \in \mathbb {N} \mid 1 \le i \le n \}\).

### Definition 1 (MDP)

A *Markov Decision Process* is a tuple \(\mathcal {M}= \langle S, A, \iota _{\text {init}}, p \rangle \) where \(S\) is the finite set of states, \(A\) is the finite set of actions, \(\iota _{\text {init}}\in Dist (S)\) is the initial state distribution, and \(p:S\times A\rightarrow Dist (S)\) is the transition function.

The transition function is total, that is, for notational convenience MDPs are *input-enabled*. This requirement does not affect the generality of our results. A *path* of an MDP is a sequence \(\pi = s_{0}a_{0}s_1a_{1}\ldots s_{n}\) such that \(\iota _{\text {init}}(s_{0}) > 0\) and \(p(s_{i},a_{i})(s_{i+1}) > 0\) for all \(0 \le i < n\). The last state of \(\pi \) is \( last (\pi )=s_n\). The set of all finite paths is \(\textsc {Path}\) and \(\textsc {Path}(S')\) denotes the paths starting in a state from \(S'\subseteq S\). The set of *reachable states* from \(S'\) is \(\textsf{Reachable}(S')\). If \(S' = Supp (\iota _{\text {init}})\) we just call them *the* reachable states. The MDP restricted to reachable states from a distribution \(d \in Dist (S)\) is \(\textsf{ReachFragment}(\mathcal {M}, d)\), where *d* is the new initial distribution. A state \(s\in S\) is *absorbing* if \(\textsf{Reachable}(\{s\})=\{s\}\). An MDP is acyclic, if each state is absorbing or not reachable from its successor states.

Action choices are resolved by a *policy* \(\sigma :\textsc {Path}\rightarrow Dist (A)\) that maps paths to distributions over actions. A policy of the form \(\sigma :S\rightarrow Dist (A)\) is called *memoryless*, *deterministic* if we have \(\sigma :\textsc {Path}\rightarrow A\); and, *memoryless deterministic* for \(\sigma :S\rightarrow A\). For an MDP \(\mathcal {M}\), we denote the probability of a policy \(\sigma \) reaching some target set \(T \subseteq S\) starting in state *s* as \({\Pr }_{\mathcal {M}}(s \rightarrow T \mid \sigma )\). More precisely, \({\Pr }_{\mathcal {M}}(s \rightarrow T \mid \sigma )\) denotes the probability of all paths from *s* reaching *T* under \(\sigma \). We use \({\Pr }_{\mathcal {M}}(T \mid \sigma )\) if *s* is distributed according to \(\iota _{\text {init}}\).

### Definition 2 (MEMDP)

A *Multiple Environment MDP* is a tuple \(\mathcal {N}= \langle S, A, \iota _{\text {init}}, \{p_i\}_{i \in I} \rangle \) with \(S, A, \iota _{\text {init}}\) as for MDPs, and \(\{p_i\}_{i \in I}\) is a *set of transition functions*, where *I* is a finite set of *environment* indices.

Intuitively, MEMDPs form sets of MDPs (environments) that share states and actions, but differ in the transition probabilities. For MEMDP \(\mathcal {N}\) with index set *I* and a set \(I' \subseteq I\), we define the restriction of environments as the MEMDP \({\mathcal {N}}_{\downarrow I'} = \langle S, A, \iota _{\text {init}}, \{p_{i}\}_{i \in I'} \rangle \). Given an environment \(i \in I\), we denote its corresponding MDP as \(\mathcal {N}_{i}=\langle S, A, \iota _{\text {init}}, p_{i} \rangle \). A MEMDP with only one environment is an MDP. Paths and policies are defined on the states and actions of MEMDPs and do not differ from MDP policies. A MEMDP is acyclic, if each MDP is acyclic.

### Example 1

Figure 1 shows an MEMDP with three environments \(\mathcal {N}_i\). An agent can ask two questions, \(q_{1}\) and \(q_{2}\). The response is either â€˜switchâ€™ (\(s_{1} \leftrightarrow s_{2}\)), or â€˜stayâ€™ (loop). In \(\mathcal {N}_1\), the response to \(q_1\) and \(q_2\) is to switch. In \(\mathcal {N}_2\), the response to \(q_1\) is stay, and to \(q_2\) is switch. The agent can guess the environment using \(a_{1}, a_{2}, a_{3}\). Guessing \(a_i\) leads to the target only in environment *i*. Thus, an agent must deduce the environment via \(q_{1}, q_{2}\) to surely reach the target. \(\blacksquare \)

### Definition 3 (Almost-Sure Reachability)

An almost-sure reachability property is defined by a set \(T \subseteq S\) of target states. A policy \(\sigma \) satisfies the property *T* for MEMDP \(\mathcal {N}= \langle S, A, \iota _{\text {init}}, \{p_{i}\}_{i \in I} \rangle \) iff \(\forall i \in I:{\Pr }_{\mathcal {N}_{i}}(T \mid \sigma )=1\).

In other words, a policy \(\sigma \) satisfies an almost-sure reachability property *T*, called *winning*, if and only if the probability of reaching *T* *within each MDP* is one. By extension, a state \(s \in S\) is winning if there exists a winning policy when starting in state *s*. Policies and states that are not winning are losing.

We will now define both the decision and policy problem:

In Section 4 we discuss the computational complexity of the decision problem. Following up, in Section 5 we present our algorithm for solving the policy problem. Details on its implementation and evaluation will be presented in Section 6.

## 3 A Reduction To Belief-Observation MDPs

In this section, we reduce the policy problem, and thus also the decision problem, to finding a policy in an exponentially larger belief-observation MDP. This reduction is an elementary building block for the construction of our PSPACE algorithm and the practical implementation. Additional information such as proofs for statements throughout the paper are available in the technical reportÂ [41].

### 3.1 Interpretation of MEMDPs as Partially Observable MDPs

### Definition 4 (POMDP)

A partially observable MDP (POMDP) is a tuple \(\langle \mathcal {M}, Z, O \rangle \) with an MDP \(\mathcal {M}= \langle S, A, \iota _{\text {init}}, p \rangle \), a set \(Z\) of *observations*, and an *observation function* \(O:S \rightarrow Z\).

A POMDP is an MDP where states are labelled with observations. We lift \(O\) to paths and use \(O(\pi ) = O(s_1) a_{1} O(s_{2}) \ldots O(s_n)\). We use observation-based policies \(\sigma \), i.e., policies s.t. for \(\pi , \pi ' \in \textsc {Path}\), \( O(\pi ) = O(\pi ') \text { implies } \sigma (\pi ) = \sigma (\pi '). \) A MEMDP can be cast into a POMDP that is made up as the disjoint union:

### Definition 5 (Union-POMDP)

Given an MEMDP \(\mathcal {N}= \langle S, A, \iota _{\text {init}}, \{p_i\}_{i \in I} \rangle \) we define its *union-POMDP* \({\mathcal {N}}_{\sqcup } = \langle \langle S', A, \iota _{\text {init}}', p' \rangle , Z, O \rangle \), with states \(S' = S \times I\), initial distribution \(\iota _{\text {init}}'(\langle s, i \rangle ) = \iota _{\text {init}}(s) \cdot |I|^{-1}\), transitions \(p'(\langle s, i \rangle , a)(\langle s', i \rangle ) = p_{i}(s, a)(s')\), observations \(Z= S\), and observation function \(O(\langle s, i \rangle ) = s\).

A policy may observe the state *s* but not in which MDP we are. This forces any observation-based policy to take the same choice in all environments.

### Lemma 1

Given MEMDP \(\mathcal {N}\), there exists a winning policy iff there exists an observation-based policy \(\sigma \) such that \({\Pr }_{{\mathcal {N}}_{\sqcup }}(T \mid \sigma )=1\).

The statement follows as, first, any observation-based policy of the POMDP can be applied to the MEMDP, second, vice versa, any MEMDP policy is observation-based, and third, the induced MCs under these policies are isomorphic.

### 3.2 Belief-observation MDPs

For POMDPs, memoryless policies are not sufficient, which makes computing policies intricate. We therefore add the information that the history â€” i.e., the path until some point â€” contains. In MEMDPs, this information is the *(environment-)belief (support)* \(J \subseteq I\), as the set of environments that are consistent with a path in the MEMDP. Given a belief \(J \subseteq I\) and a state-action-state transition \(s \xrightarrow {a} s'\), then we define \(\textsf{Up}(J, s, a, s') = \{ i \in J \mid p_{i}(s, a,s') > 0\}\), i.e., the subset of environments in which the transition exists. For a path \(\pi \in \textsc {Path}\), we define its corresponding belief \(\mathcal {B}(\pi ) \subseteq I\) recursively as:

The belief in a MEMDP monotonically decreases along a path, i.e., if we know that we are not in a particular environment, this remains true indefinitely.

We aim to use a model where memoryless policies suffice. To that end, we cast MEMDPs into the exponentially larger belief-observation MDPsÂ [16]^{Footnote 2}.

### Definition 6 (BOMDP)

For a MEMDP \(\mathcal {N}= \langle S, A, \iota _{\text {init}}, \{p_i\}_{i \in I} \rangle \), we define its *belief-observation MDP* (BOMDP) as a POMDP \(\mathcal {G}_{\mathcal {N}}= \langle \langle S', A, \iota _{\text {init}}', p' \rangle , Z, O \rangle \) with states \(S' = S \times I \times \mathcal {P}\left( I\right) \), initial distribution \(\iota _{\text {init}}'(\langle s, j, I \rangle ) = \iota _{\text {init}}(s) \cdot |I|^{-1}\), transition relation \(p'(\langle s, j, J \rangle , a)(\langle s', j, J' \rangle ) = p_{j}(s, a, s')\) with \(J' = \textsf{Up}(J, s, a, s')\), observations \(Z= S \times \mathcal {P}\left( I\right) \), and observation function \(O(\langle s, j, J \rangle ) = \langle s, J \rangle \).

Compared to the union-POMDP, BOMDPs also track the belief by updating it accordingly. We clarify the correspondence between paths of the BOMDP and the MEMDP. For a path \(\pi \) through the MEMDP, we can mimic this path exactly in the MDPs \(\mathcal {N}_j\) for \(j \in \mathcal {B}(\pi )\). As we track \(\mathcal {B}(\pi )\) in the state, we can deduce from the BOMDP state in which environments we can be.

### Lemma 2

For MEMDP \(\mathcal {N}\) and the path \(\langle s_1,j,J_1 \rangle a_1\langle s_2,j,J_2 \rangle \ldots \langle s_n,j,J_n \rangle \) of the BOMDP \(\mathcal {G}_{\mathcal {N}}\), let \(j \in J_{1}\). Then: \(J_n \ne \emptyset \) and the path \(s_1a_1 \ldots s_n\) exists in MDP \(\mathcal {N}_i\) iff \(i \in J_1 \cap J_n\).

Consequently, the belief of a path can be uniquely determined by the observation of the last state reached, hence the name belief-observation MDPs.

### Lemma 3

For every pair of paths \(\pi , \pi '\) in a BOMDP, we have:

For notation, we define \(S_J = \{ \langle s, j, J \rangle \mid j \in J, s \in S \}\), and analogously write \(Z_J = \{ \langle s, J \rangle \mid s \in S \}\). We lift the target states *T* to states in the BOMDP: \(T_{\mathcal {G}_{\mathcal {N}}} = \{ \langle s, j, J \rangle \mid s \in T, J \subseteq I, j \in J\}\) and define target observations \(T_Z= O(T_{\mathcal {G}_{\mathcal {N}}})\).

### Definition 7 (Winning in a BOMDP)

Let \(\mathcal {G}_{\mathcal {N}}\) be a BOMDP with target observations \(T_Z\). An observation-based policy \(\sigma \) is winning from some observation \(z\in Z\), if for all \(s \in O^{-1}(z)\) it holds that \(\Pr _{\mathcal {G}_{\mathcal {N}}}( s \rightarrow O^{-1}(T_Z) \mid \sigma ) = 1\).

Furthermore, a policy \(\sigma \) is *winning* if it is winning for the initial distribution \(\iota _{\text {init}}\). An *observation* \(z\) *is winning* if there exists a winning policy for \(z\). The *winning region* \(\textsf{Win}_{\mathcal {G}_{\mathcal {N}}}^{T}\) is the set of all winning observations.

Almost-sure winning in the BOMDP corresponds to winning in the MEMDP.

### Theorem 1

There exists a winning policy for a MEMDP \(\mathcal {N}\) with target states *T* iff there exists a winning policy in the BOMDP \(\mathcal {G}_{\mathcal {N}}\) with target states \(T_{\mathcal {G}_{\mathcal {N}}}\).

Intuitively, the important aspect is that for almost-sure reachability, observation-based memoryless policies are sufficientÂ [13]. For any such policy, the induced Markov chains on the union-POMDP and the BOMDP are bisimilarÂ [16].

BOMDPs make policy search conceptually easier. First, as memoryless policies suffice for almost-sure reachability, winning regions are independent of fixed policies: For policies \(\sigma \) and \(\sigma '\) that are winning in observation \(z\) and \(z'\), respectively, there must exist a policy \(\hat{\sigma }\) that is winning for both \(z\) and \(z'\). Second, winning regions can be determined in polynomial time in the size of the BOMDPÂ [16].

### 3.3 Fragments of BOMDPs

To avoid storing the exponentially sized BOMDP, we only build fragments: We may select any set of observations as *frontier* observations and make the states with those observations absorbing. We later discuss the selection of frontiers.

### Definition 8 (Sliced BOMDP)

For a BOMDP \(\mathcal {G}_{\mathcal {N}}= \langle \langle S, A, \iota _{\text {init}}, p \rangle , Z, O \rangle \) and a set of *frontier observations* \(F \subseteq Z\), we define a BOMDP \({\mathcal {G}_{\mathcal {N}}}{\mid \!F} = \langle \langle S, A, \iota _{\text {init}}, p' \rangle , Z, O \rangle \) with:

We exploit this sliced BOMDP to derive constraints on the set of winning states.

### Lemma 4

For every BOMDP \(\mathcal {G}_{\mathcal {N}}\) with states *S* and targets *T* and for all frontier observations \(F \subseteq Z\) it holds that: \(\textsf{Win}_{{\mathcal {G}_{\mathcal {N}}}{\mid \!F}}^{T} \subseteq \textsf{Win}_{\mathcal {G}_{\mathcal {N}}}^{T} \subseteq \textsf{Win}_{{\mathcal {G}_{\mathcal {N}}}{\mid \!F}}^{T \cup F}\).

Making (non-target) observations absorbing extends the set of losing observations, while adding target states extends the set of winning observations.

## 4 Computational Complexity

The BOMDP \(\mathcal {G}_{\mathcal {N}}\) above yields an exponential time *and space* algorithm via Theorem 1. We can avoid the exponential memory requirement. This section shows the PSPACE-completeness of deciding whether a winning policy exists.

### Theorem 2

The almost-sure reachability decision problem is PSPACE-complete.

The result follows from LemmasÂ 11 and 10 below. In Section 4.3, we show that representing the winning policy itself may however require exponential space.

### 4.1 Deciding Almost-Sure Winning for MEMDPs in PSPACE

We develop an algorithm with a polynomial memory footprint. The algorithm exploits locality of cyclic behavior in the BOMDP, as formalized by an acyclic *environment graph* and *local BOMDPs* that match the nodes in the environment graph. The algorithm recurses on the environment graph while memorizing results from polynomially many local BOMDPs.

**The graph-structure of BOMDPs.** First, along a path of the MEMDP, we will only gain information and are thus able to rule out certain environmentsÂ [14]. Due to the monotonicity of the update operator, we have for any BOMDP that \(\langle s,j,J \rangle \in \textsf{Reachable}(\langle s', j, J' \rangle )\) implies \(J \subseteq J'\). We define a graph over environment sets that describes how the belief-support can update over a run.

### Definition 9 (Environment graph)

Let \(\mathcal {N}\) be a MEMDP and *p* the transition function of \(\mathcal {G}_{\mathcal {N}}\). The *environment graph* \( GE _\mathcal {N}=( V _\mathcal {N}, E _\mathcal {N})\) for \(\mathcal {N}\) is a directed graph with vertices \( V _\mathcal {N}=\mathcal {P}\left( I\right) \) and edges

### Example 2

Figure 2 shows the environment graph for the MEMDP in Ex.Â 1. It consists of the different belief-supports. For example, the transition from \(\{1,2,3\}\) to \(\{2,3\}\) and to \(\{1\}\) is due to the action \(q_1\) in state \(s_0\), as shown in Fig.Â 1. \(\blacksquare \)

Paths in the environment graph abstract paths in the BOMDP. Path fragments where the belief-support remains unchanged are summarized into one step, as we do not create edges of the form \(\langle J,J \rangle \). We formalize this idea: Let \(\pi =\langle s_1, j, J_1 \rangle a_1\langle s_2, j, J_2 \rangle \dots \langle s_n, j, J_n \rangle \) be a path in the BOMDP. For any \(J \subseteq I\), we call \(\pi \) a *J-local path*, if \(J_i = J\) for all \(i \in [n]\).

### Lemma 5

For a MEMDP \(\mathcal {N}\) with environment graph \( GE _\mathcal {N}\), there is a path \(J_1 \dots J_n\) iff there is a path \(\pi = \pi _1\dots \pi _n\) in \(\mathcal {G}_{\mathcal {N}}\) s.t. every \(\pi _i\) is \(J_i\)-local.

The shape of the environment graph is crucial for the algorithm we develop.

### Lemma 6

Let \( GE _\mathcal {N}= ( V _\mathcal {N}, E _\mathcal {N})\) be an environment graph for MEMDP \(\mathcal {N}\). First, \( E _\mathcal {N}(J,J')\) implies \(J' \subsetneq J\). Thus, *G* is acyclic and has maximal path length |*I*|. The maximal outdegree of the graph is \(|S|^2|A|\).

The monotonicity regarding \(J, J'\) follows from definition of the belief update. The bound on the outdegree is a consequence from LemmaÂ 9 below.

**Local belief-support BOMDPs.** Before we continue, we remark that the (future) dynamics in a BOMDP only depend on the current state and set of environments. More formally, we capture this intuition as follows.

### Lemma 7

Let \(\mathcal {G}_{\mathcal {N}}\) be a BOMDP with states \(S'\). For any state \(\langle s, j, J \rangle \in S'\), let \(\mathcal {N}' = \textsf{ReachFragment}({\mathcal {N}}_{\downarrow J}, \textsf{dirac}(s))\) and \(Y = \{ \langle s, i, J \rangle \mid i \in J\}\). Then:

The key insight is that restricting the MEMDP does not change the transition functions for the environments \(j \in J\). Furthermore, using monotonicity of the update, we only reach BOMDP-states whose behavior is determined by the environments in *J*.

This intuition allows us to analyze the BOMDP locally and lift the results to the complete BOMDP. We define a local BOMDP as the part of a BOMDP starting in any state in \(S_J\). All observations not in \(Z_J\) are made absorbing.

### Definition 10 (Local BOMDP)

Given a MEMDP \(\mathcal {N}\) with BOMDP \(\mathcal {G}_{\mathcal {N}}\) and a set of environments *J*. The *local BOMDP* for environments *J* is the fragment

This definition of a local BOMDP coincides with a fragment of the complete BOMDP. We then mark exactly the winning observations restricted to the environment sets \(J' \subsetneq J\) as winning in the local BOMDP and compute all winning observations in the local BOMDP. These observations are winning in the complete BOMDP. The following concretization of Lemma 4 formalizes this.

### Lemma 8

Consider a MEMDP \(\mathcal {N}\) and a subset of environments *J*.

Furthermore, local BOMDPs are polynomially bounded in the size of the MEMDP.

### Lemma 9

Let \(\mathcal {N}\) be a MEMDP with states *S* and actions *A*. \(\textsc {Loc}\mathcal {G}({J})\) has at most \(\mathcal {O}(|S|^2\cdot |A|\cdot |J|)\) states and \(\mathcal {O}(|S|^{2}\cdot |A|\cdot |J|^2)\) transitions^{Footnote 3}.

**A PSPACE algorithm.** We present AlgorithmÂ 1 for the MEMDP **decision problem**, which recurses depth-first over the paths in the environment graph^{Footnote 4}. We first state the correctness and the space complexity of this algorithm.

### Lemma 10

ASWinning in Alg.Â 1 solves the decision problem in PSPACE.

To prove correctness, we first note that \(\textsc {Search}(\mathcal {N}, J, T)\) computes \(\textsf{Win}_{\mathcal {G}_{\mathcal {N}}}^{T_{\mathcal {G}_{\mathcal {N}}}} \cap Z_J\). We show this by induction over the structure of the environment graph. For all *J* without outgoing edges, the local BOMDP coincides with a BOMDP just for environments *J* (Lemma 7). Otherwise, observe that \(T'\) in lineÂ 5 coincides with its definition in Lemma 8 and thus, by the same lemma, we return \(\textsf{Win}_{\mathcal {G}_{\mathcal {N}}}^{T_{\mathcal {G}_{\mathcal {N}}}} \cap Z_J\). To finalize the proof, a winning policy exists in the MEMDP if the observation of the initial states of the BOMDP are winning (Theorem 1). The algorithm must terminate as it recurses over all paths of a finite acyclic graph, seeÂ Lemma 6. FollowingÂ Lemma 9, the number of frontier states is then bounded by \(|S|^2\cdot |A|\). The main body of the algorithm therefore requires polynomial space, and the maximal recursion depth (stack height) is |*I*| (Lemma 6). Together, this yields a space complexity in \(\mathcal {O}(|S|^{2}\cdot |A|\cdot |I|^2)\).

### 4.2 Deciding Almost-Sure Winning for MEMDPs Is PSPACE-hard

It is not possible to improve the algorithm beyond PSPACE.

### Lemma 11

The MEMDP decision problem is PSPACE-hard.

Hardness holds even for acyclic MEMDPs and uses the following fact.

### Lemma 12

If a winning policy exists for an acyclic MEMDP, there also exists a winning policy that is deterministic.

In particular, almost-sure reachability coincides with avoiding the sink states. This is a safety property. For safety, deterministic policies are sufficient, as randomization visits only additional states, which is not beneficial for safety.

Regarding Lemma 11, we sketch a polynomial-time reduction from the PSPACE-complete TQBF problemÂ [20] problem to the MEMDP decision problem. Let \(\Psi \) be a QBF formula, \( \Psi = \exists x_{1} \forall y_{1} \exists x_{2} \forall y_{2} \ldots \exists x_{n} \forall y_{n}\big [\Phi \big ] \) with \(\Phi \) a Boolean formula in conjunctive normal form. The problem is to decide whether \(\Psi \) is true.

### Example 3

Consider the QBF formula \(\Psi = \forall x \exists y \big [ (x \vee y) \wedge (\lnot x \vee \lnot y) \big ]\). We construct a MEMDP with an environment for every clause, see Figure 3^{Footnote 5}. The state space consists of three states for each variable \(v \in V\): the state *v* and the states \(v\top \) and \(v\bot \) that encode their assignment. Additionally, we have a dedicated target *W* and sink state *F*. We consider three actions: The actions *true* (\(\top \)) and *false* (\(\bot \)) semantically describe the assignment to existentially quantified variables. The action *any* \(\alpha _\otimes \) is used for all other states. Every environment reaches the target state iff one literal in the clause is assigned true.

In the example, intuitively, a policy should assign the negation of *x* to *y*. Formally, the policy \(\sigma \), characterized by \( \sigma (\pi \cdot y) = \top \) iff \(x_\bot \in \pi \), is winning. \(\blacksquare \)

As a consequence of this construction, we may also deduce the following theorem.

### Theorem 3

Deciding whether a memoryless winning policy exists is NP-complete.

The proof of NP hardness uses a similar construction for the propositional SAT fragment of QBF, without universal quantifiers. Additionally, the problem for memoryless policies is in NP, because one can nondeterministically guess a (polynomially sized) memoryless policy and verify in each environment independently.

### 4.3 Policy Problem

Policies, mapping histories to actions, are generally infinite objects. However, we may extract winning policies from the BOMDP, which is (only) exponential in the MEMDP. Finite state controllersÂ [34] are a suitable and widespread representation of policies that require only a finite amount of memory. Intuitively, the number of memory states reflects the number of equivalence classes of histories that a policy can distinguish. In general, we cannot hope to find smaller policies than those obtained via a BOMDP.

### Theorem 4

There is a family of MEMDPs \(\{ \mathcal {N}^n \}_{n \ge 1}\) where for each *n*, \(\mathcal {N}^n\) has 2*n* environments and \(\mathcal {O}(n)\) states and where every winning policy for \(\mathcal {N}^n\) requires at least \(2^n\) memory states.

We illustrate the witness. Consider a family of MEMDPs \(\{\mathcal {N}^n\}_n\), where \(\mathcal {N}^n\) has 2*n* MDPs, 4*n* states partitioned into two parts, and at most 2*n* outgoing actions per state. We outline the MEMDP family inÂ Figure 4. In the first part, there is only one action per state. The notation is as follows: in state \(s_0\) and MDP \(\mathcal {N}^n_1\), we transition with probability one to state \(a_0\), whereas in \(\mathcal {N}^n_2\) we transition with probability one to state \(b_0\). In every other MDP, we transition with probability one half to either state. In state \(s_1\), we do the analogous construction for environments 3, 4, and all others. A path \(s_0b_1 \ldots \) is thus consistent with every MDP except \(\mathcal {N}^n_1\). The first part ends in state \(s_n\). By construction, there are \(2^n\) paths ending in \(s_n\). Each of them is (in)consistent with a unique set of *n* environments. In the second part, a policy may guess *n* times an environment by selecting an action \(\alpha _i\) for every \(i \in [2n]\). Only in MDP \(\mathcal {N}^n_i\), action \(\alpha _i\) leads to a target state. In all other MDPs, the transition leads from state \(g_j\) to \(g_{j+1}\). The state \(g_{n+1}\) is absorbing in all MDPs. Importantly, after taking an action \(\alpha _i\) and arriving in \(g_{j+1}\), there is (at most) one more MDP inconsistent with the path.

Every MEMDP \(\mathcal {N}^n\) in this family has a winning policy which takes \(\sigma (\pi \cdot g_i) = \alpha _{2i-1}\) if \(a_{i} \in \pi \) and \(\sigma (\pi \cdot g_i) = \alpha _{2i}\) otherwise. Furthermore, when arriving in state \(s_n\), the state of a finite memory controller must reflect the precise set of environments consistent with the history. There are \(2^{n}\) such sets. The proof shows that if we store less information, two paths will lead to the same memory state, but with different sets of environments being consistent with these paths. As we can rule out only *n* environments using the *n* actions in the second part of the MEMDP, we cannot ensure winning in every environment.

## 5 A Partial Game Exploration Algorithm

In this section, we present an algorithm for the policy problem. We tune the algorithm towards runtime instead of memory complexity, but aim to avoid running out of memory. We use several key ingredients to create a pragmatic variation of Alg.Â 1, with support for extracting the winning policy.

First, we use an abstraction from BOMDPs to a belief stochastic game (BSG) similar toÂ [45] that reduces the number of states and simplifies the iterative construction^{Footnote 6}. Second, we tailor and generalize ideas from *bounded model checking*Â [6] to build and model check only a fragment of the BSG, using explicit *partial exploration* approaches as in, e.g.,Â [9, 29, 33, 42]. Third, our exploration does not continuously extend the fragment, but can also prune this fragment by using the model checking results obtained so far. The structure of the BSG as captured by the environment graph makes the approach promising and yields some natural heuristics. Fourth, the structure of the winning region allows to generalize results to unseen states. We thereby operationalize an idea fromÂ [26] in a partial exploration context. Finally, we analyze individual MDPs as an efficient and significant preprocessing step. In the following we discuss these ingredients.

**Abstraction to Belief Support Games.** We briefly recap stochastic games (SGs). SeeÂ [17, 38] for more details.

### Definition 11 (SG)

A *stochastic game* is a tuple \(\mathcal {B}= \langle \mathcal {M}, S_{1}, S_{2} \rangle \), where \(\mathcal {M}= \langle S, A, \iota _{\text {init}}, p \rangle \) is an MDP and \((S_{1}, S_{2})\) is a partition of *S*.

\(S_{1}\) are PlayerÂ 1 states, and \(S_{2}\) are PlayerÂ 2 states. As common, we also â€˜partitionâ€™ (memoryless deterministic) policies into two functions \({\sigma _{1} :S_{1} \rightarrow A}\) and \({\sigma _{1} :S_{2} \rightarrow A}\). A Player 1 policy \(\sigma _{1}\) is winning for state *s* if \(\Pr (T \mid \sigma _1, \sigma _2)\) for all \(\sigma _2\). We (re)use \(\textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T}\) to denote the set of states with a winning policy.

We apply a game-based abstraction to group states that have the same observation. PlayerÂ 1 states capture the observation in the BOMDP, i.e., tuples \(\langle s, J \rangle \) of MEMDP states *s* and subsets *J* of the environments. PlayerÂ 1 selects the action *a*, the result is PlayerÂ 2 state \(\langle \langle s, J \rangle , a \rangle \). Then PlayerÂ 2 chooses an environment \(j\in J\), and the game mimics the outgoing transition from \(\langle s, j, J \rangle \), i.e., it mimics the transition from *s* in \(\mathcal {N}_j\). Formally:

### Definition 12 (BSG)

Let \(\mathcal {G}_{\mathcal {N}}\) be a BOMDP with \(\mathcal {G}_{\mathcal {N}}= \langle \langle S, A, \iota _{\text {init}}, p \rangle , Z, O \rangle \). A *belief support game* \(\mathcal {B}_{\mathcal {N}}\) for \(\mathcal {G}_{\mathcal {N}}\) is an SG \(\mathcal {B}_{\mathcal {N}}= \langle \langle S', A', \iota _{\text {init}}', p \rangle , S_{1}, S_{2} \rangle \) with \(S' = S_{1} \cup S_{2}\) as usual, Player 1 statesÂ \(S_{1} = Z\), PlayerÂ 2 states \(S_2= Z \times A\), actions \(A' = A \cup I\), initial distribution \(\iota _{\text {init}}'(\langle s, I \rangle ) = \sum _{i \in I} \iota _{\text {init}}(\langle s, i, I \rangle )\), and the (partial) transition function *p* defined separately for PlayerÂ 1 andÂ 2:

### Lemma 13

An (acyclic) MEMDP \(\mathcal {N}\) with target states *T* is winning if(f) there exists a winning policy in the BSG \(\mathcal {B}_{\mathcal {N}}\) with target states \(T_Z\).

Thus, on acyclic MEMDPs, a BSG-based algorithm is sound and complete, however, on cyclic MDPs, it may not find the winning policy. The remainder of the algorithm is formulated on the BSG, we use sliced BSGs as the BSG of a sliced BOMDP, or equivalently, as a BSG with some states made absorbing.

**Main algorithm. **

We outlineÂ AlgorithmÂ 2 for the *policy problem*. We track the sets of almost-sure observations and losing observations (states in the BSG). Initially, target states are winning. Furthermore, via a simple preprocessing, we determine some winning and losing states on the individual MDPs.

We iterate until the initial state is winning or losing. Our algorithm constructs a sliced BSG and decides *on-the-fly* whether a state should be a frontier state, returning the sliced BSG and the used frontier states. We discuss the implementation below. For the sliced BSG, we compute the winning region twice: Once assuming that the frontier states are winning, once assuming they are loosing. This yields an approximation of the winning and losing states, seeÂ Lemma 4. From the winning states, we can extract a randomized winning policyÂ [13].

*Soundness.* Assuming that the \(\mathcal {B}_{\mathcal {N}}\) is indeed a sliced BSG with frontier *F*. Then the following invariant holds: \( W \subseteq \textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T} \text { and } L \cap \textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T} = \emptyset . \) This invariant exploits that from a sliced BSG we can (implicitly) slice the complete BSG while preserving the winning status of every state, formalized below. In future iterations we only explore the implicitly sliced BSG.

### Lemma 14

Given \(W \subseteq \textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T_{\mathcal {B}_{\mathcal {N}}}}\) and \(L \subseteq S \setminus \textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T_{\mathcal {B}_{\mathcal {N}}}}\): \( \textsf{Win}_{\mathcal {B}_{\mathcal {N}}}^{T_{\mathcal {B}_{\mathcal {N}}}} = \textsf{Win}_{{\mathcal {B}_{\mathcal {N}}}{\mid \!W \cup L}}^{T_{\mathcal {B}_{\mathcal {N}}} \cup W} \)

*Termination* depends on the sliced game generation. It suffices to ensure that in the long run, either *W* or *L* grow as there are only finitely many states. If *W* and *L* remain the same longer than some number of iterations, \(W \cup L\) will be used as frontier. Then, the new game will suffice to determine if \(s\in W\) in one shot.

**Generating the sliced BSG.** AlgorithmÂ 3 outlines the generation of the sliced BSG. In particular, we explore the implicit BSG from the initial state but make every state that we do not explicitly explore absorbing. In every iteration, we first check if there are states in *Q* left to explore and if the number of explored states in *E* is below a threshold \(\textsf {Bound}[i]\). Then, we take a state from the priority queue and add it to *E*. We find new reachable states^{Footnote 7} and add them to the queueÂ *Q*.

**Generalizing the winning and losing states. ** We aim to determine that a state in the game \(\mathcal {B}_{\mathcal {N}}\) is winning without ever exploring it. First, observe:

### Lemma 15

A winning policy in MEMDP \(\mathcal {N}\) is winning in \({\mathcal {N}}_{\downarrow J}\) for any *J*.

A direct consequence is the following statement for two environments \(J_1 \subseteq J_2\):

Consequently, we can store *W* (and symmetrically, *L*) as follows. For every MEMDP state \(s \in S\), \(W_s = \{ J \mid \langle s, J \rangle \in W\}\) is downward closed on the partial order \(P=(I, \subset )\). This allows for efficient storage: We only have to store the set of pairwise maximal elements, i.e., the antichain,

To determine whether \(\langle s,J \rangle \) is winning, we check whether \(J \subseteq J'\) for some \(J' \in W_s^{\max }\). Adding *J* to \(W_s^{\max }\) requires removing all \(J' \subseteq J\) and then adding *J*. Note, however, that \(|W_s^{\max }|\) is still exponential in |*I*| in the worst case.

**Selection of heuristics. ** The algorithm allows some degrees of freedom. We evaluate the following aspects empirically. (1) The maximal size \(\texttt {bound}[i]\) of a sliced BSG at iteration *i* is critical. If it is too small, the sets *W* and *L* will grow slowly in every iteration. The trade-off is further complicated by the fact that the sets *W* and *L* may generalize to unseen states. (2) For a fixed \(\texttt {bound}[i]\), it is unclear how to prioritize the exploration of states. The PSPACE algorithm suggests that going deep is good, whereas the potential for generalization to unseen states is largest when going broad. (3) Finally, there is overhead in computing both *W* and *L*. If there is a winning policy, we only need to compute *W*. However, computing *L* may ensure that we can prune parts of the state space. A similar observation holds for computing *W* on unsatisfiable instances.

### Remark 1

AlgorithmÂ 2 can be mildly tweaked to meet the PSPACE algorithm in AlgorithmÂ 1. The priority queue must ensure to always include complete (reachable) local BSGs and to explore states \(\langle s, J \rangle \) with small *J* first. Furthermore, *W* and *L* require regular pruning, and we cannot extract a policy if we prune *W* to a polynomial size bound. Practically, we may write pruned parts of *W* to disk.

## 6 Experiments

We highlight two aspects: (1) A comparison of our prototype to existing baselines for POMDPs, and (2) an examination of the exploration heuristics. The technical reportÂ [41] contains details on the implementation, the benchmarks, and more results.

*Implementation.* We provide a novel *PArtial Game Exploration* (PaGE) prototype, based on AlgorithmÂ 2, on top of the probabilistic model checker Storm Â [22]. We represent MEMDPs using the Prism language with integer constants. Every assignment to these constants induces an explicit MDP. SGs are constructed and solved using existing data structures and graph algorithms.

*Setup.* We create a set of benchmarks inspired by the POMDP and MEMDP literatureÂ [12, 21, 26]. We consider a combination of satisfiable and unsatisfiable benchmarks. In the latter case, a winning policy does not exist. We construct POMDPs from MEMDPs as in Definition 5. As baselines, we use the following two existing POMDP algorithms. For almost-sure properties, a *belief-MDP construction*Â [7] acts similar to an efficiently engineered variant of our game-construction, but tailored towards more general quantitative properties. A *SAT-based approach*Â [26] aims to find increasingly larger policies. We evaluate all benchmarks on a system with a 3GHz Intel Core i9-10980XE processor. We use a time limit of 30 minutes and a memory limit of 32 GB.

*Results.* Figure 5 shows the (log scale) performance comparisons between different configurations^{Footnote 8}. Green circles reflect satisfiable and red crosses unsatisfiable benchmarks. On the x-axis is PaGE in its default configuration. The first plot compares to the belief-MDP construction. The tailored heuristics and representation of the belief-support give a significant edge in almost all cases. The few points below the line are due to a higher exploration rate when building the state space. The second plot compares to the SAT-based approach, which is only suitable for finding policies, not for disproving their existence. This approach implicitly searches for a particular class of policies, whose structure is not appropriate for some MEMDPs. The third plot compares PaGE in the default configuration â€“ with negative entropy as priority function â€“ with PaGE using positive entropy. As expected, different priorities have a significant impact on the performance.

Table 1 shows an overview of satisfiable and unsatisfiable benchmarks. Each table shows the number of environments, states, and actions-per-state in the MEMDP. For PaGE, we include both the default configuration (negative entropy) and variation (positive entropy). For both configurations, we provide columns with the time and the maximum size of the BSG constructed. We also include the time for the two baselines. Unsurprisingly, the number of states to be explored is a good predictor for the performance and the relative performance is as in Fig.Â 5.

## 7 Conclusion

This paper considers multi-environment MDPs with an arbitrary number of environments and an almost-sure reachability objective. We show novel and tight complexity bounds and use these insights to derive a new algorithm. This algorithm outperforms approaches for POMDPs on a broad set of benchmarks. For future work, we will apply an algorithm directly on the BOMDPÂ [16].

## Data Availability Statement

Supplementary material related to this paper is openly available on Zenodo at: https://doi.org/10.5281/zenodo.7560675

## Notes

- 1.
Hidden-parameter MDPs are different than MEMDPs in that they assume a prior over MDPs. However, for almost-sure properties, this difference is irrelevant.

- 2.
This translation is notationally simpler than going via the union-POMDP.

- 3.
The number of transitions is the number of nonzero entries in

*p* - 4.
In contrast to depth-first-search, we do not memorize nodes we visited earlier.

- 5.
We depict a slightly simplified MEMDP for conciseness.

- 6.
At the time of writing, we were unaware of a polytime algorithm for BOMDPs.

- 7.
In l.Â 5 we do not rebuild the game \(\mathcal {B}\) from scratch but incrementally construct the data structures. Likewise, reachable states are a direct byproduct of this construction.

- 8.
Every point \(\langle x,y \rangle \) in the graph reflects a benchmarks which was solved by the configuration on the x-axis in x time and by the configuration on the y-axis in y time. Points above the diagonal are thus faster for the configuration on the x-axis.

## References

Roman Andriushchenko, Milan Ceska, Sebastian Junges, Joost-Pieter Katoen, and Simon StupinskÃ½. PAYNT: A tool for inductive synthesis of probabilistic programs. In

*CAV*, volume 12759 of*LNCS*, pages 856â€“869. Springer, 2021.Sebastian Arming, Ezio Bartocci, Krishnendu Chatterjee, Joost-Pieter Katoen, and Ana Sokolova. Parameter-independent strategies for pmdps via pomdps. In

*QEST*, volume 11024 of*LNCS*, pages 53â€“70. Springer, 2018.MohammadÂ Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Sequential transfer in multi-armed bandit with finite set of models. In

*NIPS*, pages 2220â€“2228, 2013.Christel Baier, Marcus GrÃ¶ÃŸer, and Nathalie Bertrand. Probabilistic \(\omega \)-automata.

*J. ACM*, 59(1):1:1â€“1:52, 2012.Christel Baier and Joost-Pieter Katoen.

*Principles of model checking*. MIT Press, 2008.Armin Biere, Alessandro Cimatti, EdmundÂ M. Clarke, Ofer Strichman, and Yunshan Zhu. Bounded model checking.

*Adv. Comput.*, 58:117â€“148, 2003.Alexander Bork, Sebastian Junges, Joost-Pieter Katoen, and Tim Quatmann. Verification of indefinite-horizon pomdps. In

*ATVA*, volume 12302 of*LNCS*, pages 288â€“304. Springer, 2020.Alexander Bork, Joost-Pieter Katoen, and Tim Quatmann. Under-approximating expected total rewards in pomdps. In

*TACAS (2)*, volume 13244 of*LNCS*, pages 22â€“40. Springer, 2022.TomÃ¡s BrÃ¡zdil, Krishnendu Chatterjee, Martin Chmelik, Vojtech Forejt, Jan KretÃnskÃ½, MartaÂ Z. Kwiatkowska, David Parker, and Mateusz Ujma. Verification of markov decision processes using learning algorithms. In

*ATVA*, volume 8837 of*LNCS*, pages 98â€“114. Springer, 2014.Peter Buchholz and Dimitri Scheftelowitsch. Computation of weighted sums of rewards for concurrent mdps.

*Math. Methods Oper. Res.*, 89(1):1â€“42, 2019.Iadine Chades, Josie Carwardine, TaraÂ G. Martin, Samuel Nicol, RÃ©gis Sabbadin, and Olivier Buffet. Momdps: A solution for modelling adaptive management problems. In

*AAAI*. AAAI Press, 2012.Krishnendu Chatterjee, Martin Chmelik, and Jessica Davies. A symbolic sat-based algorithm for almost-sure reachability with small strategies in pomdps. In

*AAAI*, pages 3225â€“3232. AAAI Press, 2016.Krishnendu Chatterjee, Martin Chmelik, Raghav Gupta, and Ayush Kanodia. Optimal cost almost-sure reachability in pomdps.

*Artif. Intell.*, 234:26â€“48, 2016.Krishnendu Chatterjee, Martin ChmelÃk, Deep Karkhanis, Petr NovotnÃ½, and AmÃ©lie Royer. Multiple-environment markov decision processes: Efficient analysis and applications. In

*ICAPS*, pages 48â€“56. AAAI Press, 2020.Krishnendu Chatterjee, Martin Chmelik, and Mathieu Tracol. What is decidable about partially observable markov decision processes with omega-regular objectives. In

*CSL*, volumeÂ 23 of*LIPIcs*, pages 165â€“180. Schloss Dagstuhl - Leibniz-Zentrum fÃ¼r Informatik, 2013.Krishnendu Chatterjee, Martin Chmelik, and Mathieu Tracol. What is decidable about partially observable markov decision processes with \(\omega \)-regular objectives.

*J. Comput. Syst. Sci.*, 82(5):878â€“911, 2016.Krishnendu Chatterjee, Marcin Jurdzinski, and ThomasÂ A. Henzinger. Simple stochastic parity games. In

*CSL*, volume 2803 of*LNCS*, pages 100â€“113. Springer, 2003.Philipp Chrszon, Clemens Dubslaff, Sascha KlÃ¼ppelholz, and Christel Baier. Profeat: feature-oriented engineering for family-based probabilistic model checking.

*Formal Aspects Comput.*, 30(1):45â€“75, 2018.Luca deÂ Alfaro. The verification of probabilistic systems under memoryless partial-information policies is hard. Technical report, UC Berkeley, 1999. Presented at ProbMiV.

M.Â R. Garey and DavidÂ S. Johnson.

*Computers and Intractability: A Guide to the Theory of NP-Completeness*. W. H. Freeman, 1979.Arnd Hartmanns, Michaela Klauck, David Parker, Tim Quatmann, and Enno Ruijters. The quantitative verification benchmark set. In

*TACAS (1)*, volume 11427 of*LNCS*, pages 344â€“350. Springer, 2019.Christian Hensel, Sebastian Junges, Joost-Pieter Katoen, Tim Quatmann, and Matthias Volk. The probabilistic model checker storm.

*Int. J. Softw. Tools Technol. Transf.*, 24(4):589â€“610, 2022.Manfred Jaeger, Giorgio Bacci, Giovanni Bacci, KimÂ Guldstrand Larsen, and PeterÂ GjÃ¸l Jensen. Approximating Euclidean by Imprecise Markov Decision Processes. In

*ISoLA (1)*, volume 12476 of*LNCS*, pages 275â€“289. Springer, 2020.Nils Jansen, Sebastian Junges, and Joost-Pieter Katoen. Parameter synthesis in markov models: A gentle survey.

*CoRR*, abs/2207.06801, 2022.Bengt Jonsson and KimÂ Guldstrand Larsen. Specification and refinement of probabilistic processes. In

*LICS*, pages 266â€“277. IEEE Computer Society, 1991.Sebastian Junges, Nils Jansen, and SanjitÂ A. Seshia. Enforcing almost-sure reachability in pomdps. In

*CAV (2)*, volume 12760 of*LNCS*, pages 602â€“625. Springer, 2021.LeslieÂ Pack Kaelbling, MichaelÂ L. Littman, and AnthonyÂ R. Cassandra. Planning and acting in partially observable stochastic domains.

*Artif. Intell.*, 101(1-2):99â€“134, 1998.Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim RocktÃ¤schel. A survey of generalisation in deep reinforcement learning.

*CoRR*, abs/2111.09794, 2021.Jan KretÃnskÃ½ and Tobias Meggendorfer. Of cores: A partial-exploration framework for markov decision processes.

*Log. Methods Comput. Sci.*, 16(4), 2020.Marta Kwiatkowska, Gethin Norman, and Dave Parker. PRISM 4.0: Verification of probabilistic real-time systems. In

*CAV*, volume 6806 of*LNCS*, pages 585â€“591. Springer, 2011.MichaelÂ L. Littman, AnthonyÂ R. Cassandra, and LeslieÂ Pack Kaelbling. Learning policies for partially observable environments: Scaling up. In

*ICML*, pages 362â€“370. Morgan Kaufmann, 1995.Omid Madani, Steve Hanks, and Anne Condon. On the undecidability of probabilistic planning and related stochastic optimization problems.

*Artif. Intell.*, 147(1-2):5â€“34, 2003.H.Â Brendan McMahan, Maxim Likhachev, and GeoffreyÂ J. Gordon. Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees. In

*ICML*, volume 119 of*ACM International Conference Proceeding Series*, pages 569â€“576. ACM, 2005.Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and LeslieÂ Pack Kaelbling. Learning finite-state controllers for partially observable environments. In

*UAI*, pages 427â€“436. Morgan Kaufmann, 1999.Gethin Norman, David Parker, and Xueyi Zou. Verification and control of partially observable probabilistic systems.

*Real Time Syst.*, 53(3):354â€“402, 2017.Jean-FranÃ§ois Raskin and Ocan Sankur. Multiple-environment markov decision processes. In

*FSTTCS*, volumeÂ 29 of*LIPIcs*, pages 531â€“543. Schloss Dagstuhl - Leibniz-Zentrum fÃ¼r Informatik, 2014.JohnÂ H. Reif. The complexity of two-player games of incomplete information.

*J. Comput. Syst. Sci.*, 29(2):274â€“301, 1984.L.Â S. Shapley. Stochastic games*.

*Proceedings of the National Academy of Sciences*, 39(10):1095â€“1100, 1953.Trey Smith and ReidÂ G. Simmons. Point-based POMDP algorithms: Improved analysis and implementation. In

*UAI*, pages 542â€“547. AUAI Press, 2005.LaurenÂ N. Steimle, DavidÂ L. Kaufman, and BrianÂ T. Denton. Multi-model markov decision processes.

*IISE Trans.*, 53(10):1124â€“1139, 2021.Marck vanÂ der Vegt, Nils Jansen, and Sebastian Junges. Robust almost-sure reachability in multi-environment mdps.

*CoRR*, abs/2301.11296, 2023.Matthias Volk, Sebastian Junges, and Joost-Pieter Katoen. Fast dynamic fault tree analysis by model checking techniques.

*IEEE Trans. Ind. Informatics*, 14(1):370â€“379, 2018.Wolfram Wiesemann, Daniel Kuhn, and BerÃ§ Rustem. Robust markov decision processes.

*Math. Oper. Res.*, 38(1):153â€“183, 2013.Tobias Winkler, Sebastian Junges, GuillermoÂ A. PÃ©rez, and Joost-Pieter Katoen. On the complexity of reachability in parametric markov decision processes. In

*CONCUR*, volume 140 of*LIPIcs*, pages 14:1â€“14:17. Schloss Dagstuhl - Leibniz-Zentrum fÃ¼r Informatik, 2019.Leonore Winterer, Sebastian Junges, Ralf Wimmer, Nils Jansen, Ufuk Topcu, Joost-Pieter Katoen, and Bernd Becker. Strategy synthesis for pomdps in robot planning via game-based abstractions.

*IEEE Trans. Autom. Control.*, 66(3):1040â€“1054, 2021.

## Author information

### Authors and Affiliations

### Corresponding author

## Editor information

### Editors and Affiliations

## Rights and permissions

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Copyright information

Â© 2023 The Author(s)

## About this paper

### Cite this paper

van der Vegt, M., Jansen, N., Junges, S. (2023). Robust Almost-Sure Reachability in Multi-Environment MDPs. In: Sankaranarayanan, S., Sharygina, N. (eds) Tools and Algorithms for the Construction and Analysis of Systems. TACAS 2023. Lecture Notes in Computer Science, vol 13993. Springer, Cham. https://doi.org/10.1007/978-3-031-30823-9_26

### Download citation

DOI: https://doi.org/10.1007/978-3-031-30823-9_26

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-031-30822-2

Online ISBN: 978-3-031-30823-9

eBook Packages: Computer ScienceComputer Science (R0)