Simple Strategies in Multi-Objective MDPs

We consider the verification of multiple expected reward objectives at once on Markov decision processes (MDPs). This enables a trade-off analysis among multiple objectives by obtaining a Pareto front. We focus on strategies that are easy to employ and implement. That is, strategies that are pure (no randomization) and have bounded memory. We show that checking whether a point is achievable by a pure stationary strategy is NP-complete, even for two objectives, and we provide an MILP encoding to solve the corresponding problem. The bounded memory case is treated by a product construction. Experimental results using Storm and Gurobi show the feasibility of our algorithms.


Introduction
MDPs. Markov decision processes (MDPs) [4,3] are a key model in stochastic decision making. The classical setting involves a system subject to a stochastic model of its environment, and the goal is to synthesize a system controller, represented as a strategy for the MDP, ensuring a given level of expected performance. Tools such as Prism [30] and Storm [16] support MDP model checking.
Multi-objective MDPs. MDPs where the goal is to achieve a combination of objectives (rather than just one) are popular in e.g., AI [41] and verification [2]. This is driven by applications, where controllers have to fulfill multiple, potentially conflicting objectives, requiring a trade-off analysis. This includes multi-dimension MDPs [14,20,40,13] where weight vectors are aggregated at each step and MDPs where the specification mixes different views (e.g., average and worst case performance) of the same weight [11,8]. With multiple objectives, optimal strategies no longer exist in general: instead, Pareto-optimal strategies are considered. The Pareto front, i.e., the set of non-dominated achievable value vectors is usually non-trivial. Elaborate techniques are needed to explore it efficiently, e.g., [23,24]. Simple strategies. Another stumbling block in multi-objective MDPs is the complexity of strategies: Pareto-optimal strategies typically need both memory and randomization. A simple conjunction of reachability objectives already requires randomization and exponential memory (in the number of reachability sets) [40].
Some complex objectives even need infinite memory, e.g., [11,8]. In controller synthesis, strategies requiring randomization and/or (much) memory may not be practical. Limited-memory strategies are required on devices with limited resources [7]. Randomization is elegant and powerful from a theoretical view, but has practical limitations, e.g., it limits reproducibility which complicates debugging. Randomized strategies are also often despised for medical applications [33] and product design -all products should have the same design, not a random one. This motivates to consider the analysis of simple strategies, i.e., strategies using no randomization and a limited amount of memory (given as a parameter). While most works study the Pareto front among all strategies, we establish ways to explore efficiently the Pareto front among simple strategies only. Problem statement. We consider pure (i.e., no randomization) and boundedmemory strategies and study two problems: (a) achievability queries -is it possible to achieve a given value vector -and (b) approximation of the Pareto front. Considering pure, bounded-memory strategies is natural as randomization can be traded for memory [12]: without randomization, optimal strategies may require arbitrarily large memory, (see Ex. 4). We study mixtures of expected (accumulated) reward objectives, covering various studied settings like reachability [20,40], shortest path [39,40,28,9] and total reward objectives [23,24]. Contributions. We first consider the achievability problem for pure stationary (i.e., memoryless) strategies and show that finding optimal strategies for multiobjective MDPs is NP-complete, even for two objectives. This contrasts the case of general strategies, where the problem is polynomial-time if the number of objectives is fixed [40]. We provide a mixed integer linear program (MILP) encoding. The crux lies in dealing with end components. The MILP is polynomial in the input MDP and the number of objectives. Inspired by [22], we give an alternative MILP encoding which is better suited for total reward objectives. To approximate the Pareto front under pure stationary strategies, we solve multiple MILP queries. This iteratively divides the solution space into achievable and non-achievable regions. Bounded-memory strategies are treated via a product construction. Our approach works for finite and infinite expected rewards. Practical evaluation. We successfully compute Pareto fronts for 13 benchmarks using our implementation in Storm, exploiting the MILP solver Gurobi. Despite the hard nature of the problem, our experiments show that Pareto fronts for models with tens of thousands of states can be successfully approximated. Related work. NP completeness for discounted rewards under pure strategies was shown in [14]. [19] claims that this generalizes to PCTL objectives but no proof is given. [42] treats multi-objective bounded MDPs whose transition probabilities are intervals. A set of Pareto optimal policies is computed using policy iteration and an efficient heuristic is exploited to compute a set of mutually non-dominated policies that are likely to be Pareto optimal. Pure stationary Pareto optimal strategies for discounted rewards are obtained in [44] using value-iteration but is restricted to small MDPs where all probabilities are 0 or 1. In [34],Tchebycheffoptimal strategies for discounted rewards are obtained via an LP approach; such strategies minimize the distance to a reference point and are not always pure.

Markov Decision Processes, Strategies, and End Components
Definition 1 (Markov decision process [36]). A Markov decision process (MDP) is a tuple M = S, Act, P, s I with finite set of states S, initial state s I ∈ S, finite set of actions Act, and transition function P : S × Act × S → [0, 1] with s ∈S P(s, α, s ) ∈ {0, 1} for all s ∈ S and α ∈ Act.
We fix an MDP M = S, Act, P, s I . Intuitively, P(s, α, s ) is the probability to take a transition from s to s when choosing action α. An infinite path in M is a sequence π = s 0 α 1 s 1 α 2 · · · ∈ (S×Act) ω with P(s i , α i+1 , s i+1 ) > 0 for all i ∈ N. We write π[i] = s i for the (i+1)th state visited by π and define the length of π as |π| = ∞. A finite path is a finite prefixπ = s 0 α 1 . . . α n s n of infinite path π, where last(π) = s n ∈ S, |π| = n andπ[i] = s i for i ≤ n. The set of finite (infinite) paths in M is denoted by Paths M fin (Paths M inf ). The enabled actions at a state s ∈ S are given by the set Act(s) = {α ∈ Act | ∃ s ∈ S : P(s, α, s ) > 0}. We assume Act(s) = ∅ for all s. If |Act(s)| = 1 for all s ∈ S, M is called a Markov Chain (MC). We write M s for the MDP obtained by replacing the initial state of M by s ∈ S. For s ∈ S and α ∈ Act, we define the set of successor states succ(s, α) = {s | P(s, α, s ) > 0}. For s ∈ S, the set of predecessor state-action pairs is given by pre(s ) = { s, α | P(s, α, s ) > 0}. For a set E ⊆ S × Act, we define S E = {s ∈ S | ∃ α : s, α ∈ E}, Act E = {α ∈ Act | ∃ s : s, α ∈ E}, and P E (s, α, s ) = [ s, α ∈ E] · [s ∈ S E ] · P(s, α, s ). We say E is closed for M if ∀ s, α ∈ E : α ∈ Act(s) and succ(s, α) ⊆ S E .  The maximal ECs of a Markov chain are also called bottom strongly connected components (BSCCs). A strategy resolves nondeterminism in MDPs: Definition 4 (Strategy). A (general) strategy for MDP M is a function σ : Paths M fin → Dist(Act) with supp(σ(π)) ⊆ Act(last(π)) for allπ ∈ Paths M fin . Let σ be a strategy for M. Intuitively, σ(π)(α) is the probability to perform action α after observing historyπ ∈ Paths M fin . A strategy is pure if all histories are mapped to Dirac distributions, i.e., the support is a singleton. A strategy is stationary if its decisions only depend on the current state, i.e., ∀π,π ∈ Paths M fin : last(π) = last(π ) implies σ(π) = σ(π ). We often assume σ : S → Dist(Act) for stationary and σ : S → Act for pure stationary strategies σ. Let Σ M and Σ M PS be the sets of general and pure stationary strategies, respectively. A set of paths Π ⊆ Paths M inf is compliant with σ ∈ Σ M if for all π = s 0 α 1 s 1 · · · ∈ Π and prefixesπ of π satisfy σ(π)(α |π|+1 ) > 0. The induced Markov chain of M and σ ∈ Σ M PS is given by M σ = M E σ , s I with E σ = { s, σ(s) | s ∈ S}. MDP M and strategy σ ∈ Σ M induce a probability measure Pr M σ on subsets Π ⊆ Paths M inf given by a standard cylinder set construction [4,22]. The expected value of X : σ and E M σ coincide with the corresponding measures on MC M σ .

Objectives
A reward structure R : S × Act × S → R ≥0 assigns non-negative rewards to transitions. We accumulate rewards on (in)finite paths π = s 0 α 1 s 1 α 2 . . . : For a set of goal states G ⊆ S, let R♦G(π) = R(π), wherê π is the smallest prefix of π with last(π) ∈ G (orπ = π if no such prefix exists). Intuitively, R♦G(π) is the reward accumulated on π until a state in G is reached. A (reward) objective has the form E ∼ (R♦G) for ∼ ∈ {≥, ≤}. We write M, σ, p |= E ∼ (R♦G) iff E M σ (R♦G) ∼ p, i.e., for M and σ, the expected accumulated reward until reaching G is at least (or at most) p ∈ R ∞ . We call the objective maximizing if ∼ = ≥ and minimizing otherwise. If G = ∅ (i.e., R♦G(π) = R(π) for all paths π), we call the objective a total reward objective. Let the reward structure R G be given by R(s, α, s ) = [s ∈ G]. Then, inf denotes the set of paths that visit a state in G. We use P ∼ (♦G) as a shortened for E ∼ (R G ♦G) and call such an objective a reachability objective.
Each objective ψ j considers a different reward structure R j . The MDP M, strategy σ, and point p ∈ (R ∞ ) satisfy a multi-objective query Q = ψ 1 , . . . , ψ (written M, σ, p |= Q) iff ∀ j : M, σ, p j |= ψ j . Then, we also say σ achieves p and call p achievable. Let Ach M (Q) (Ach M PS (Q)) denote the set of points achieved by a general (pure stationary) strategy. The closure of a set P ⊆ (R ∞ ) with respect to query Q is cl Q (P ) = p ∈ (R ∞ ) | ∃ p ∈ P : ∀ j : p j ∼ j p j . For p, p ∈ (R ∞ ) , we say that p dominates p if p ∈ cl Q ({p}). In this case, M, σ, p |= Q implies M, σ, p |= Q for any σ ∈ Σ M . We are interested in the Pareto front, which is the set of non-dominated achievable points.  Figure 1: An MDP and a plot of the pure stationary and general Pareto fronts.
Definition 6 (Pareto front). The (general) Pareto front for M and Q is The Pareto front is the smallest set P ⊆ (R ∞ ) with cl Q (P ) = Ach M (Q). In a similar way, we define the pure stationary Pareto front Pareto M PS (Q) which only consider points in Ach M PS (Q). Example 1. Let M be the MDP in Fig. 1a and Q = P ≥ (♦G ), P ≥ (♦G ) . A pure stationary strategy choosing β at s 1 reaches both, s 4 ∈ G and s 3 ∈ G with probability 0.7 and thus achieves 0.7, 0.7 . Similarly, 0, 1 and 1, 0 are achievable by a pure stationary strategy. Point 1, 0.8 is achievable by a nonstationary pure strategy that chooses α at s 1 , γ at the first visit of s 2 , and δ in all other cases. Changing this strategy by picking γ only with probability 0.5 achieves 0.5, 0.9 .

Deciding Achievability
The achievability problem asks whether a given point is achievable.
For GMA, the point can be achieved by a general strategy that can potentially make use of memory and randomization. As discussed earlier, this class of strategies is not suitable for various applications. In this work, we focus on a variant of the achievability problem that only considers pure stationary strategies. Sect. 5 also addresses pure strategies that can store more information from the history, e.g., whether a goal state set has been reached already.

Complexity Results
GMA is PSPACE hard (already with only reachability objectives) [40] and solvable within exponential runtime [20,23]. To the best of our knowledge, a PSPACE upper bound on the complexity of GMA is unknown. This complexity is rooted in the dimension of the query Q: for fixed , the algorithms of [20,23] have polynomial runtime. In contrast, PSMA is NP-complete, even if restricted to 2 objectives.
Lemma 1. PSMA with only reachability objectives is NP-hard.
Proof. The result follows by a reduction from the subset sum problem. Given n ∈ N, a ∈ N n and z ∈ N, the subset sum problem is to decide the existence of v ∈ {0, 1} n such that v · a = z. This problem is NP-complete [25]. For a given instance of the subset sum problem, we construct the MDP M = S, Act, P, s I with state space S = {s I , s 1 , . . . , s n , g 1 , g 2 }, actions Act = {α, Y, N }, and for all i ∈ {1, . . . , n}, P(s I , α, s i ) = a i 1·a and P(s i , Y, g 1 ) = P(s i , N, g 2 ) = 1. States g 1 and g 2 are made absorbing, i.e., P(g 1 , α, g 1 ) = P(g 2 , α, g 2 ) = 1.
We claim that the PSMA problem for M , answers "yes" iff there is a vector v satisfying the subset sum problem for n, a and z. Consider the bijection f : is a solution to the instance of the subset sum problem. Our construction is inspired by similar ideas from [14,40].
Proof. Containment follows by guessing a pure stationary strategy and evaluating it on the individual objectives. This can be done in polynomial time [4]. Hardness follows by either Lemma 1 or 2.
Proofs of Lemmas 1 and 2 only consider 2-dimensional multi-objective queries. Hence, in contrast to GMA, the hardness of PSMA is not due to the size of the query. Corollary 1. PSMA with only two objectives is NP-complete.

A Mixed Integer Linear Programming Approach
An MDP M = S, Act, P, s I has exactly |Σ M PS | = s∈S |Act(s)| many pure stationary strategies. A simple algorithm for PSMA enumerates all σ ∈ Σ M PS and checks whether M, σ, p |= Q holds. In practice, however, such a brute-force approach is not feasible. For the MDPs that we consider in our experiments in Sect. 6, the number of pure stationary strategies often exceeds 10 10 000 . Instead, our approach is to encode an instance for PSMA as an MILP problem.
For an MILP instance as above, each of the n rows of the inequation system Ax ≤ b represent a constraint that is linear over the integral and m real-valued variables given by x. We call the constraints feasible if there is a solution to the inequation system. The task is to decide whether the constraints are feasible and if so, find a solution that maximizes a linear optimization function c T x. The optimization function can be omitted if we are only interested in feasibility. MILP is NP-complete [35]. However, tools such as Gurobi [27] and SCIP [26] implement practically efficient algorithms that can solve large instances.
For the rest of this section, let M = S, Act, P, s I , Q = ψ 1 , . . . , ψ with ψ j = E ∼j (R j ♦G j ), and p ∈ (R ∞ ) be an instance for PSMA. We provide a translation of the PSMA instance to an instance for MILP that has a feasible solution iff p ∈ Ach M PS (Q). The MILP encoding considers integer variables to encode a pure stationary strategy σ ∈ Σ M PS . The other variables and constraints encode the expected reward for each objective on the induced MC M σ .

Unichain MDP and Finite Rewards
Restriction 1 (Unichain MDP). MDP M has exactly one end component.
, state s, and pure stationary strategy σ.
For simplicity, we first explain our encoding for unichain MDP with finite reward. Sect. 3.5 lifts Restriction 1 and Sect. 3.6 lifts Restriction 2 with more details given in [17, App. B]. Sect. 3.4 presents an alternative to the encoding of this section, which is smaller but restricted to total reward objectives. Fig. 2 shows the MILP encoding in case Restrictions 1 and 2 hold. We assume These sets can be obtained a priori by analyzing the graph structure of M [4]. Moreover, we consider upper bounds U j s ∈ Q for the expected reward at state s ∈ S j ? such that . We compute such upper bounds using singleobjective model checking techniques [4,5]. The MILP encoding applies the characterization of expected rewards for MCs as a linear equation system [4]. Output: If ψj is maximising, ± = + and [min] = 0. Otherwise, ± = − and [min] = 1.
x j s ≤ α∈Act(s) ± x j s I ∼j p j Assert value at initial state (9) Lines 1 and 2 encode a strategy σ ∈ Σ M PS by considering a binary variable a s,α for each state s and enabled action α such that σ(s)(α) = 1 iff Φ(a s,α ) = 1 for a solution Φ. Due to Line 2, exactly one action has to be chosen at each state. Lines 3 to 8 encode for each objective ψ j the expected rewards obtained for the encoded strategy σ. For every s ∈ S, the variable x j s represents a (lower or upper) bound on the expected reward at s. Line 3 sets this value for all s ∈ S j 0 , reflecting the analogous case from Lemma 3. For s ∈ S j ? , we distinguish maximizing (∼ j = ≥) and minimizing (∼ j = ≤) objectives ψ j .
For maximizing ψ j , we have Φ(x j s ) ≤ E Ms σ (R j ♦G j ) for every solution Φ. This is achieved by considering a variable x j s,α for each enabled action α ∈ Act(s). In Line 6, we use the equation system characterization from Lemma 3 to assert that the value of x j s,α can not be greater than the expected reward at s, given that the encoded strategy σ selects α. If σ does not select α (i.e., Φ(a s,α ) = 0), Line 7 implies Φ(x j s,α ) = 0. Otherwise, this constraint has no effect. Line 8 ensures that every solution satisfies Φ(x j s ) ≤ Φ(x j s,α ) ≤ E Ms σ (R j ♦G j ) for α with Φ(a s,α ) = 1. For minimizing ψ j , we have −Φ(x j s ) ≥ E Ms σ (R j ♦G j ) for every solution Φ, i.e., we consider the negated reward values. The encoding is as for maximizing objectives. However, Line 7 yields Φ(x j s,α ) = −U j s if α is not selected. Thus, in Line 8 we add U j s for each of the (|Act(s)| − 1) non-selected actions. Line 9 and our observations above yield E Ms σ (R j ♦G j ) ≥ Φ(x j sI ) ≥ p j for maximizing and E Ms σ (R j ♦G j ) ≤ −Φ(x j sI ) ≤ p j for minimizing objectives. Therefore, p is achievable if a solution Φ exists. On the other hand, if p is achievable by some σ ∈ Σ M PS , the solution Φ exists with Φ(a s,α ) = σ(s)(α), Theorem 2. For unichain M and finite rewards, the constraints in Fig. 2 are feasible iff p ∈ Ach M PS (Q).

Alternative Encoding for Total Rewards
We now consider PSMA instances where all objectives ψ j = E ∼j (R j ♦G j ) are expected total reward objectives, i.e., G j = ∅. For such instances, we can employ an encoding from [23] (restated in Lemma 4) for GMA. In fact, we can often translate reachability reward objectives to total reward objectives, e.g., if the set of goal states can not be left or if all objectives consider the same goal states.
In [23], the lemma is applied to decide achievability of multiple total reward objectives under strategies that are stationary, but not necessarily pure. Intuitively, E M σ Φ (R s,α ♦S 0 ) coincides with the expected number of times action α is taken at state s until S 0 is reached. Since this value can be infinite if Pr M σ Φ (♦S 0 ) < 1, a solution Φ can only exist if it induces a strategy that almost surely reaches S 0 .
The encoding for unichain MDP with finite rewards and total reward objectives is shown in Fig. 3, where S 0 = j S j 0 and S ? = S \ S 0 . We consider the constraints in conjunction with Lines 1 and 2 from Fig. 2. Let Φ be a solution and let σ be the strategy encoded by such a solution, i.e., σ(s)(α) = Φ(a s,α ).
Lines 10 to 12 reflect the equations of Lemma 4. Since M is unichain and we assume finite rewards, there is just one end component in which no reward can be collected. Hence, S 0 is almost surely reached. Line 10 ensures that the ∀ s ∈ S ? , α ∈ Act(s) : ys,α ∈ [0, Vs · as,α]  With Lemma 4 we get that Φ(y s,σ(s) ) is the expected number of times state s is visited under strategy σ. Therefore, in Line 13 we sum up for each state s ∈ S ? the expected amount of reward collected at s. This yields Φ(x j sI ) = E M σ (R j ♦G j ). Finally, Line 14 asserts that the resulting values exceed the thresholds given by p.
Theorem 3. For unichain M, finite rewards, and total reward objectives, the constraints in Fig. 3 and Lines 1 and 2 of Fig. 2 are feasible iff p ∈ Ach M PS (Q).
The encoding for total reward objectives considers fewer variables compared to the encoding of Sect. 3.3 (cf. Proposition 1). In practice, this often leads to faster solving times as we will see in Sect. 6.

Extension to Multichain MDP
We now lift the restriction to unichain MDP, i.e., we consider multichain MDP with finite rewards. We focus on the encoding of Sect.  For multichain MDP it can be the case that for some strategy σ the set S j 0 is not reached with probability 1, i.e., there is a positive probability to stay in the set S j ? forever. For the induced Markov chain M σ , this means that there is a reachable BSCC consisting only of states in S j ? . Since BSCCs of M σ coincide with end components of M, we need to inspect the ECs of M that only consist of  To cope with multiple ECs, we consider the constraints from Fig. 2 in conjunction with the constraints from Fig. 4. Let Φ be a solution to these constraints and let σ be the encoded strategy σ with σ(s)(α) = Φ(a s,α ). For each objective ψ j and state s, a binary variable e j s is set to 1 if s lies on a BSCC of the induced MC M σ . We only need to consider states s ∈ S E for E ∈ MECS (M E j ? ). Line 15 ensures that the value of x j s is set to 0 if s lies on a BSCC of M σ . Lines 16 to 18 introduce binary variables e j s,α for each state-action pair in the EC such that any solution Φ satisfies Φ(e j s,α ) = 1 iff Φ(e j s ) = Φ(a s,α ) = 1. Line 17 yields that Φ(e j s,α ) = 1 implies Φ(e j s ) = 1 for all successors s of s and the selected action α. Hence, for all s with Φ(e j s ) = 1 and for all s reachable from s in M σ , we have Φ(e j s ) = 1 and s , σ(s ) ∈ E. Therefore, we can only set e j s to 1 if there is a BSCC E ⊆ E that either contains s or that is almost surely reached from s without leaving E. As finite rewards are assumed, E can not contain a transition with positive reward, yielding E M σ (R j ♦G j ) = 0 if Φ(e j s ) = 1. An assignment that sets all variables e j s and e j s,α to 0 trivially satisfies the constraints in Lines 15 to 18. In Lines 19 to 22 we therefore ensure that if a BSCC E ⊆ E exists in M σ , Φ(e j s ) = 1 holds for at least one s ∈ S E . The idea is based on the observation that if a BSCC E ⊆ E exists, there is a state s ∈ S E that does not reach the set S \ S E almost surely. We consider the MDP M E , a mild extension of M E given by where P E extends P E such that P E (s E ⊥ , ⊥, s E ⊥ ) = 1 and ∀ s ∈ S E : -P E (s E I , α I , s) = 1/|S E |, P E (s, ⊥, s E ⊥ ) = 1, and -∀ α ∈ {α ∈ Act(s) | s,α / ∈ E} : P E (s, α, s E ⊥ ) = 1.  Lemma 4), the BSCC has to contain at least one state s with Φ(e j s ) = 1. In summary, Lines 19 to 22 imply that every BSCC E ⊆ E of M σ contains at least one state s with Φ(e j s ) = 1. Then, with Lines 16 to 18 we get that Φ(e j s ) = 1 has to hold for all s ∈ S E . In M σ , the set S j 0 ∪ s | Φ(e j s ) = 1 is therefore reached almost surely and all the states in this set get assigned value 0. In this case, the solution of the equation system from Lemma 3 becomes unique again.

Extension to Infinite Rewards
Our approach can be modified to allow PSMA instances where infinite expected reward can be collected, i.e., where Restriction 2 does not hold. Infinite reward can be collected if we cycle through an EC of M that contains a transition with positive reward. Such instances are of practical interest as this often corresponds to strategies that do not accomplish a certain goal (e.g., a robot that stands still and therefore requires infinite time to finish its task). We sketch the necessary modifications. More details are in [17,App. B]. Let S ∞ be the set of states where every pure strategy induces infinite reward for at least one minimizing objective. To ensure that the MILP instance has a (realvalued) solution, we consider the sub-MDP of M obtained by removing S ∞ .
If infinite reward can be collected in an EC, it should not be considered in Fig. 4. We therefore let E range over maximal ECs that only consist of (a) states in S j ? and (b) transitions with reward 0. The upper bounds U j s for the maximal expected rewards at each state can not be set to ∞. However, for the encoding it suffices to compute values that are sufficiently large. However, we remark that in practice our approach from [17, App. B] can lead to very large values, yielding numerical instabilities.
For maximizing objectives, we introduce one additional objective which, in a nutshell, checks that the probability to reach a 0-reward BSCC is below 1. If this is the case, there is a positive probability to reach a BSCC in which infinitely many reward can be collected.

Computing the Pareto Front
Our next goal is to compute the pure stationary Pareto front Pareto M PS (Q) for MDP M and multi-objective query Q. This set can be very large, in particular if the objectives are strongly conflicting with many different tradeoffs. In the worst case, every pure stationary strategy induces a point p ∈ Pareto M PS (Q) (e.g., for Q = E ≤ (R♦G), E ≥ (R♦G) ). We try to find an approximation of Pareto M PS (Q).
For simplicity, we only consider inputs that satisfy restriction Restriction 2, i.e., . Ideas of Sect. 3.6 can be used for some other inputs. An all-embracing treatment of infinite rewards, in particular for maximizing ψ j , is subject to future work.
Our approach for PSP ≈ successively divides the solution space into candidate regions. For each region R (initially, let R = [0, U 1 ] × · · · × [0, U ]), we use the MILP encoding from Sect. 3 with an optimization function to find a point p ∈ R ∩ Pareto M PS (Q) (or find out that no such point exists). The region R is divided into (i) an achievable region R A ⊆ Ach M PS (Q), (ii) an unachievable region R U ⊆ R \Ach M PS (Q), (iii) further candidate regions R 1 , . . . , R n that are analyzed subsequently, and (iv) the remaining area R \ (R A ∪ R U ∪ R 1 ∪ · · · ∪ R n ) which does not require further analysis as we are only interested in an -approximation. The procedure stops as soon as no more candidate regions are found.
Example 3. Fig. 6 sketches the approach for an MDP M and a query Q with two maximizing objectives. We maintain a set of achievable points (light green) and a set of unachievable points (red). Initially, our candidate region corresponds to R 1 = [0, U 1 ]×[0, U 2 ] given by the white area in Fig. 6a. We consider the direction vector w 1 which is orthogonal to the line connecting U 1 , 0 and 0, U 2 . To find some point p ∈ Pareto M PS (Q) ∩ R 1 , we solve the MILP resulting from the constraints as in Sect. 3, the constraint x 1 sI , x 2 sI ∈ R 1 , and the optimization function w 1 · x 1 sI , x 2 sI . Fig. 6b shows the obtained point p 1 ∈ R 1 . Since p 1 is achievable, we know that any point in cl Q ({p 1 }) has to be achievable as well. Moreover, the set {p ∈ R 1 | w 1 · p > w 1 · p 1 } indicated by the area above the diagonal line in Fig. 6b can not contain an achievable point. The gray areas do not have to be checked in order to obtain an -approximation. We continue with R 2 indicated by the white area and the direction vector w 2 , orthogonal to the line connecting 0, U 2 and p 1 . As before, we solve an MILP now yielding the point p 2 in Fig. 6c. We find achievable points cl Q ({p 2 }) but no further unachievable points. The next iteration considers candidate region R 3 and direction vector (d) Figure 6: Example exploration of achievable points.
w 1 , yielding point p 3 shown in Fig. 6d. The trapezoidal area is added to the unachievable points whereas cl Q ({p 3 }) is achievable. Finally, we check R 4 for which the corresponding MILP instance is infeasible, i.e., R 4 is unachievable.
The ideas sketched above can be lifted to > 2 objectives. Inspired by [24,Alg. 4], we choose direction vectors that are orthogonal to the convex hull of the achievable points found so far. In fact, for total reward objectives we can apply the approach of [24] to compute the points in Pareto M PS (Q) ∩ Pareto M (Q) first and only perform MILP-solving for the remaining regions. As the distance between two found points p, p is at least |p − p | ≥ , we can show that our approach terminates after finding at most j U j / j points. Other strategies for choosing direction vectors are possible and can strongly impact performance.

Bounded Memory
For GMA, it is necessary and sufficient to consider strategies that require memory exponential in the number of objectives [20,24,40] by storing which goal state set has been reached already. In contrast, restricting to pure (but not necessarily stationary) strategies imposes nontrivial memory requirements that do not only depend on the number of objectives, but also on the point that is to be achieved.
Example 4. Let M be the MDP in Fig. 5b and Q = P ≥ (♦G ) , P ≥ (♦G ) . The point p k = 0.5 k , 1−0.5 k for k ∈ N is achievable by taking α with probability 0.5 k . p k is also achievable with the pure strategy σ k where σ k (π) = α iff |π| ≥ k. σ k uses k memory states. Pure strategies with fewer memory states do not suffice.
We search for pure strategies with bounded memory. For an MDP M and K > 0, let Σ M P,K denote the set of pure K-memory strategies, i.e., any σ ∈ Σ M P,K can be represented by a Mealy machine using up to K states (c.f. [17,App. D]). For a query Q, let Ach M P,K (Q) be the set of points achievable by some σ ∈ Σ M P,K .
Pure Bounded Multi-objective Achievability Problem (PBMA) Input: MDP M, multi-objective query Q, memory bound K, point p ∈ (R ∞ ) Output: Yes iff p ∈ Ach M P,K (Q) The pure bounded Pareto approximation problem is defined similarly. We reduce a PBMA instance to an instance for PSMA. The idea is to incorporate a memory structure of size K into M and then construct a pure stationary strategy in this product MDP (see, e.g., [29] for a similar construction). The set of strategies can be further refined by considering e.g., a memory structure that only allows counting or that only remembers visits of goal states. See [17, App. D] for details.

Evaluation
We implemented our approach for PSP ≈ in the model checker Storm [16] using Gurobi [27] as back end for MILP-solving. The implementation takes an MDP (e.g., in Prism syntax), a multi-objective query, and a precision ε > 0 as input and computes an -approximation of the Pareto front. Here, we set j = ε · δ j , where δ j is the difference between the maximal and minimal achievable value for objective ψ j . We also support reward objectives for Markov automata via [38]. The computations within Gurobi might suffer from numerical instabilities. To diminish their impact, we use the exact engine of Storm to confirm for each MILP solution that the encoded strategy achieves the encoded point. However, sub-optimal solutions returned by Gurobi may still yield inaccurate results. We evaluate our approach on 13 multi-objective benchmarks from [24,28,38], each considering one or two parameter instantiations. Application areas range over scheduling (dpm [37], eajs [1], jobs [10], polling [43]), planning (rg [6], rover [28], serv [32], uav [21]), and protocols (mutex [38], str [38], team [15], wlan [31]).
The results for pure stationary strategies are summarized in Table 1. For each benchmark we denote the number of objectives and whether the alternative encoding from Sect. 3.4 has been applied ( * ). For each parameter instantiation (Par.), the number of states (|S|), the percentage of the states that are contained in an end component (%E), and the average number of available actions at each state (Act) are given. For each precision ε ∈ {0.01, 0.001}, we then depict the  runtime of Storm and the number of points on the computed approximation of the Pareto front. TO denotes that the approach did not terminate within 2 hours, MO denotes insufficient memory (16 GB). All experiments used 8 cores of an Intel R Xeon R Platinum 8160 Processor.
Storm is often able to compute pure stationary Pareto fronts, even for models with over 100 000 states (e.g., uav). However, the model structure strongly affects the performance. For example, the second instance of jobs is challenging although it only considers 117 states, a low degree of nondeterminism, and no (non-trivial) end components. Small increments in the model size can increase runtimes significantly (e.g., dpm or uav). If a higher precision is requested, much more points need to be found, which often leads to timeouts. Similarly, for more than 2 objectives the desired accuracy can often not be achieved within the time limit. The approach can be stopped at any time to report on the current approximation, e.g., after 2 hours Storm found 65 points for Instance 1 of mutex.
For almost all benchmarks, the objectives could be transformed to total reward objectives, making the more efficient encoding form Sect. 3.4 applicable. We plot the runtimes of the two encoding in Fig. 7a. The alternative encoding is superior for almost every benchmark. In fact, the original encoding timed out for many models as indicated at the horizontal line at the top of the figure.
In Fig. 7b we plot the Pareto front for the first polling instance under general strategies (Gen), pure 2-memory strategies that can change the memory state exactly once (PM 2 ), pure strategies that observe which goal state set G j has been visited already (PM G ), and pure stationary strategies (PS). Adding simple memory structures already leads to noticeable improvements in the quality of strategies. In particular, PM 2 strategies perform quite well, and even outperform PM G strategies (which would be optimal if randomization were allowed).
Data availability. The artifact [18] accompanying this paper contains source code, benchmark files, and replication scripts for our experiments.