Counterexample Explanation by Learning Small Strategies in Markov Decision Processes
Abstract
For deterministic systems, a counterexample to a property can simply be an error trace, whereas counterexamples in probabilistic systems are necessarily more complex. For instance, a set of erroneous traces with a sufficient cumulative probability mass can be used. Since these are too large objects to understand and manipulate, compact representations such as subchains have been considered. In the case of probabilistic systems with nondeterminism, the situation is even more complex. While a subchain for a given strategy (or scheduler, resolving nondeterminism) is a straightforward choice, we take a different approach. Instead, we focus on the strategy itself, and extract the most important decisions it makes, and present its succinct representation.
Keywords
Decision Tree Markov Decision Process Training Sequence Compact Representation Binary Decision Diagram1 Introduction
The standard models for dynamic stochastic systems with both probabilistic and nondeterministic behaviour are Markov decision processes (MDPs) [1, 2, 3]. They are widely used in verification of probabilistic systems [4, 5] in several ways. Firstly, in concurrent probabilistic systems, such as communication protocols, the nondeterminism arises from scheduling [6, 7]. Secondly, in probabilistic systems operating in open environments, such as various stochastic reactive systems, nondeterminism arises from environmental inputs [8, 9]. Thirdly, for underspecified probabilistic systems, a controller is synthesized, resolving the nondeterminism in a way that optimizes some objective, such as energy consumption or time constraints in embedded systems [4, 5].
In analysis of MDPs, the behaviour under all possible strategies (schedulers, controllers, policies) is examined. For example, in the first two cases, the result of the verification process is either a guarantee that a given property holds under all strategies, or a counterexample strategy. In the third case, either a witness strategy guaranteeing a given property is synthesized, or its nonexistence is stated. In all settings, it is desirable that the output strategies should be “small and understandable” apart from correct. Intuitively, it is a strategy with a representation small enough for the human debugger to read and understand where the bug is (in the verification setting), or for the programmer to implement in the device (in the synthesis setting). In this paper, we focus on the verification setting and illustrate our approach mainly on probabilistic protocols. Nonetheless, our results immediately carry over to the synthesis setting.
Obtaining a small and simple strategy may be impossible if the strategy is required to be optimal, i.e., in our setting reaching the error state with the highest possible probability. Therefore, there is a tradeoff between simplicity and optimality of the strategies. However, in order to debug a system, a simple counterexample or a series thereof is more valuable than the most comprehensive, but incomprehensible counterexample. In practice, a simple strategy reaching the error with probability smaller by a factor of \(\varepsilon \), e.g. one per cent, is a more valuable source of information than a huge description of an optimal strategy. Similarly, controllers in embedded devices should strive for optimality, but only as long as they are small enough to fit in the device. In summary, we are interested in finding small and simple closetooptimal (\(\varepsilon \)optimal) strategies.
How can one obtain a small and simple strategy? This seems to require some understanding of the particular system and the bug. How can we do something like that automatically? The approaches have so far been limited to BDD representations of the strategy, or generating subchains representing a subset of paths induced by the strategy. While BDDs provide a succinct representation, they are not well readable and understandable. Further, subchains do not focus on the decisions the strategy makes at all. In contrast, a huge effort has been spent on methods to obtain “understanding” from large sets of data, using machine learning methods. In this paper, we propose to extend their use in verification, namely of reachability properties in MDPs, in several ways. Our first aim of using these methods is to efficiently exploit the structure that is present in the models, written in e.g. PRISM language with variables and commands. This structure gets lost in the traditional numerical analysis of the MDPs generated from the PRISM language description. The second aim is to distil more information from the generated MDPs, namely the importance of each decision. Both lead to an improved understanding of the strategy’s decisions.
Our Approach. We propose three steps to obtain the desired strategies. Each of them has a positive effect on the resulting size.
(1) Obtaining a (Possibly Partially Defined and Liberal) \(\varepsilon \) optimal Strategy. The \(\varepsilon \)optimal strategies produced by standard methods, such as value iteration of PRISM [10], may be too large to compute and overly specific. Firstly, as argued in [11], typically only a small fraction of the system needs to be explored in order to find an \(\varepsilon \)optimal strategy, whereas most states are reached with only a very small probability. Without much loss, the strategy may not be defined there. For example, in the MDP M depicted in Fig. 1, the decision in q (and \(v_i\)’s) is almost irrelevant for the overall probability of reaching t from s. Such a partially defined strategy can be obtained using learning methods [11].
(2) Identifying Important Parts of the Strategy. We define a concept of importance of a state w.r.t. a strategy, corresponding to the probability of visiting the state by the strategy. Observe that only a fraction of states can be reached while following the strategy, and thus have positive importance. On the unreachable states, with zero importance, the definition of the strategy is useless. For instance, in M, both states p and q must have been explored when constructing the strategy in order to find out whether it is better to take action a or b. However, if the resulting strategy is to use b and d, the information what to do in \(u_i\)’s is useless. In addition, we consider \(v_i\)’s to be of zero importance, too, since they are never reached on the way to target.
Furthermore, apart from ignoring states with zero importance, we want to partially ignore decisions that are unlikely to be made (in less important states such as q), and in contrast, stress more the decisions in important states likely to be visited (such as s). Note that this is difficult to achieve in data structures that remember all the stored data exactly, such as BDDs. Of course, we can store decisions in states with importance above a certain threshold. However, we obtain much smaller representations if we allow more variability and reflect the whole quantitative information, as shown in Step 3.
(3) Data Structures for Compact Representation of Strategies. The explicit representation of a strategy by a table of pairs (state, action to play) results in a huge amount of data since the systems often have millions of states. Therefore, a symbolic representation by binary decision diagrams (BDD) looks as a reasonable option. However, there are several drawbacks of using BDDs. Firstly, due to the bitlevel representation of the stateaction pairs, the resulting BDD is not very readable. Secondly, it is often still too large to be understood by human, for instance due to a bad ordering of the variables. Thirdly, it cannot quantitatively reflect the differences in the importance of states.
Therefore, we propose to use decision trees instead, e.g. [15], a structure similar to BDDs, but with nodes labelled by various predicates over the system’s variables. They have several advantages. Firstly, they yield an explanation of the decision, as opposed to e.g. neural networks, and thus provide an explanation how the strategy works. Secondly, sophisticated algorithms for their construction, based on entropy, result in smaller representation than BDD, where a good ordering of variables is known to be notoriously difficult to find [4]. Thirdly, as suggested in Step 2, they allow for less probable remembering of less stressed data if this sufficiently simplifies the tree and decreases its size. Finally, the major drawback of decision trees in machine learning—frequent overfitting of the training data—is not an issue in our setting since the tree is not used for classification of test data, but only of the training data.

We provide a method for obtaining succinct representation of \(\varepsilon \)optimal strategies as decision trees. The method is based on a new concept of importance measure and on wellestablished machine learning techniques.

Experimental data show that even for some systems larger than the available memory, our method yields trees with only several dozens of nodes.

We illustrate the understandability of the representation on several examples from PRISM benchmarks [16], reading off respective bug explanations.
Decision trees have been used in connection with realtime dynamic programming and reinforcement learning [28, 29]. Learning a compact decision tree representation of a policy has been investigated in [30] for the case of body sensor networks, but the paper aims at a completely different application field (a simple model of sensor networks as opposed to generic PRISM models), uses different objectives (discounted rewards), and does not consider the importance of a state that, as we show, may substantially decrease sizes of resulting policies.
Our results are related to the problem of computing minimal/small counterexamples in probabilistic verification. Most papers concentrate on solving this problem for Markov chains and lineartime properties [31, 32, 33, 34], branchingtime properties [35, 36, 37], and in the context of simulation [38]. A couple of tools have been developed for probabilistic counterexample generation, namely DiPro [39] and COMICS [40]. For a detailed survey see [41]. While previous approaches focus on presenting diagnostic paths forming the counterexample, our approach focuses on decisions made by the respective strategy.
Concerning MDPs, [33] uses mixed integer linear programming to compute minimal critical subsystems, i.e. whole subMDPs as opposed to a compact representation of “right" decisions computed by our methods. [42] uses a directed onthefly search to compute sets of most probable diagnostic paths (which somehow resembles our notion of importance), but the paths are encoded explicitly by AND/OR trees as opposed to our use of decision trees. Neither of these papers takes advantage of an internal structure of states and their methods substantially differ from ours. The notion of paths encoded as AND/OR trees has also been studied in [43] to represent probabilistic counterexamples visually as fault trees, and then derive causal (the cause and effect) relationship between events. [44] develops abstractionbased framework for modelchecking MDPs based on games, which allows to trade compactness for precision, but does not give a procedure for constructing (a compact representation of) counterexample strategies. [45, 46] computes a smallest set of guarded commands (of a PRISMlike language) that induce a critical subsystem, but, unlike our methods, does not provide a compact representation of actual decisions needed to reach an erroneous state; moreover, there is not always a command based counterexample.
Counterexamples play a crucial role in CEGAR analysis of MDPs, and have been widely studied, such as, gamebased abstraction refinement [47]; noncompositional CEGAR approach for reachability [48] and safepCTL [49]; compositional CEGAR approach for safepCTL and qualitative logics [38, 50]; and abstractionrefinement for quantitative properties [51, 52]. All of these works only consider a single strategy represented explicitly, whereas our approach considers a succinct representation of a set of strategies, and can accelerate the abstractionrefinement loop.
2 Preliminaries
We use \(\mathbb {N}\), \(\mathbb {Q}\), and \(\mathbb {R}\) to denote the sets of positive integers, rational and real numbers, respectively. The set of all rational probability distributions over a finite set X is denoted by \( Dist (X)\). Further, \(d\in Dist (X)\) is Dirac if \(d(x)=1\) for some \(x\in X\). Given a function \(f:X\rightarrow \mathbb {R}\), we write \({{\mathrm{arg\,max}}}_{x\in X} f(x)=\{x\in X\mid f(x)=\max _{x'\in X} f(x')\}\).
Markov Chains. A Markov chain is a tuple \(M = (L,P,\mu )\) where L is a finite set of locations, \(P:L\rightarrow Dist (L)\) is a probabilistic transition function, and \(\mu \in Dist (L)\) is the initial probability distribution. We denote the respective unique probability measure for M by \(\mathbb P\).
Markov Decision Processes. A Markov decision process (MDP) is a tuple \(G=(S,A, Act ,\delta ,\hat{s})\) where S is a finite set of states, A is a finite set of actions, \( Act : S\rightarrow 2^A\setminus \{\emptyset \}\) assigns to each state s the set \( Act (s)\) of actions enabled in s, \(\delta : S\times A\rightarrow Dist (S)\) is a probabilistic transition function that, given a state and an action, gives a probability distribution over the successor states, and \(\hat{s}\) is the initial state. A run in G is an infinite alternating sequence of states and actions \(\omega =s_1 a_1 s_2 a_2\cdots \) such that for all \(i \ge 1\), we have \(a_i\in Act (s_i)\) and \(\delta (s_i,a_i)(s_{i+1}) > 0\). A path of length k in G is a finite prefix \(w= s_1 a_1\cdots a_{k1} s_k\) of a run in G.
Strategies and Plays. Intuitively, a strategy (or a policy) in an MDP G is a “recipe” to choose actions. Formally, a strategy is a function \(\sigma :S\rightarrow Dist (A)\) that given the current state of a play gives a probability distribution over the enabled actions.^{1} In general, a strategy may randomize, i.e. return nonDirac distributions. A strategy is deterministic if it gives a Dirac distribution for every argument.
Liberal Strategies. A liberal strategy is a function \(\varsigma :S\rightarrow 2^A\) such that for every \(s\in S\) we have that \(\emptyset \not = \varsigma (s)\subseteq Act (s)\). Given a liberal strategy \(\varsigma \) and a state s, an action \(a\in Act (s)\) is good (in s w.r.t. \(\varsigma \)) if \(a\in \varsigma (s)\), and bad otherwise. Abusing notation, we denote by \(\varsigma \) the strategy that to every state s assigns the uniform distribution on \(\varsigma (s)\) (which, in particular, allows us to use \(G^{\varsigma }_s\), \(\mathbb P^{\varsigma }_s\) and apply the notion of \(\varepsilon \)optimality to \(\varsigma \)).
Reachability Objectives. Given a set \(F\subseteq S\) of target states, we denote by \(\Diamond {F}\) the set of all runs that visit a state of F. For a state \(s\in S\), the maximal reachability probability (or simply value) in s, is \( Val (s):=\max _{\sigma } \mathbb P^{\sigma }_s[\Diamond {F}]\). Given \(\epsilon \ge 0\), we say that a strategy \(\sigma \) is \(\varepsilon \) optimal if \(\mathbb P^{\sigma }[\Diamond {F}] \ge Val (\hat{s})\varepsilon \), and we call a 0optimal strategy optimal.^{2} To avoid overly technical notation, we assume that states of F, subject to the reachability objective, are absorbing, i.e. for all \(s\in F,a\in Act (s)\) we have \(\delta (s,a)(s)=1\).
End Components. A nonempty set \(S'\subseteq S\) is an end component (EC) of G if there is \( Act ':S'\rightarrow 2^A\setminus \{\emptyset \}\) such that (1) for all \(s\in S'\) we have \( Act '(s)\subseteq Act (s)\), (2) for all \(s\in S'\), we have \(a\in Act '(s)\) iff \(\delta (s,a)\in Dist (S')\), and (3) for all \(s,t\in S'\) there is a path \(\omega = s_1 a_1\cdots a_{k1} s_k\) such that \(s_1 = s\), \(s_k=t\), and \(s_i\in S',a_i\in Act '(s_i)\) for every i. An end component is a maximal end component (MEC) if it is maximal with respect to the subset ordering. Given an MDP, the set of MECs is denoted by \(\mathsf {MEC}\). Given a MEC, actions of \( Act '(s)\) and \( Act (s)\setminus Act '(s)\) are called internal and external (in state s), respectively.
3 Computing \(\varepsilon \)Optimal Strategies
There are many algorithms for solving quantitative reachability in MDPs, such as the value iteration, the strategy improvement, linear programming based methods etc., see [2]. The main method implemented in PRISM is the value iteration, which successively (under)approximates the value \( Val (s,a)=\sum _{s'\in A} \delta (s,a)(s')\cdot Val (s')\) of every stateaction pair (s, a) by a value V(s, a), and stops when the approximation is good enough. Denoting by \(V(s):=\max _{a\in Act (s)} V(s,a)\), every step of the value iteration improves the approximation V(s, a) by assigning \(V(s,a):=\sum _{s'\in S} \delta (s,a)(s')\cdot V(s')\) (we start with V such that \(V(s)=1\) if \(s\in F\), and \(V(s)=0\) otherwise).
The disadvantage of the standard value iteration (and also most of the above mentioned traditional methods) is that it works with the whole state space of the MDP (or at least with its reachable part). For instance, consider states \(u_i,v_i\) of Fig. 1. The paper [11] adapts methods of bounded realtime dynamic programming (BRTDP, see e.g. [53]) to speed up the computation of the value iteration by improving V(s, a)^{3} only on “important” stateaction pairs identified by simulations.
Even though RTDP methods may substantially reduce the size of an \(\varepsilon \)optimal strategy, its explicit representation is usually large and difficult to understand. Thus we develop succinct representations of strategies, based on decision trees, that will reduce the size even further and also provide a human readable representation. Even though the above methods are capable of yielding deterministic \(\varepsilon \)optimal strategies, that can be immediately fed into machine learning algorithms, we found it advantageous to give the learning algorithm more freedom in the sense that if there are more \(\varepsilon \)optimal strategies, we let the algorithm choose (uniformly). This is especially useful within MECs where many actions have the same value. Therefore, we extract liberal \(\varepsilon \)optimal strategies from the value approximation V, output either by the value iteration or BRTDP.
Computing Liberal \(\varepsilon \) Optimal Strategies. Let us show how to obtain a liberal strategy \(\varsigma \) from the value iteration, or BRTDP. For simplicity, we start with MDP without MECs.
 1.
\(V(s,a)\le Val (s,a)\) for all \(s\in S\) and \(a\in A\)
 2.
\( Val (\hat{s})V(\hat{s})\le \varepsilon \)
 3.
\(V(s,a)\le \sum _{s'\in S} \delta (s,a)(s')\cdot V(s')\) for all \(s\in S\) and \(a\in Act {s}\)
The outputs V of both the value iteration, and BRTDP are valid \(\varepsilon \)underapproximations. We define a liberal strategy \(\varsigma ^V\) by \(\varsigma ^V(s)={{\mathrm{arg\,max}}}_{a\in Act (s)}\, V(s,a)\) for all \(s\in S\).^{4}
Lemma 1
For every \(\varepsilon >0\) and a valid \(\varepsilon \)underapproximation V, \(\varsigma ^V\) is \(\varepsilon \)optimal.^{5}
 4.
Each MEC \(S'\subseteq S\) has at least one exit.

For every \(s\in S\) which is not in any MEC, we put \(\varsigma ^V(s)={{\mathrm{arg\,max}}}_{a\in Act (s)}\, V(s,a)\).
 For every \(s\in S\) which is in a MEC \(S'\),

if s is an exit, then \(\varsigma ^V(s)=\{a\in Act (s)\mid (s,a)\text { is maximalexternal in }S'\}\)

otherwise, \(\varsigma ^V(s)=\{a\in Act (s)\mid a\text { is internal}\}\)

Using these extended definitions, Lemma 1 remains valid. Further, note that \(\varsigma ^V(s)\) is defined even for states with trivial underapproximation \(V(s)=0\), for instance a state s that was never subject to any value iteration improvement. Then the values \(\varsigma (s)\) may not be stored explicitly, but follow implicitly from not storing any V(s), thus assuming \(V(s,\cdot )=0\).
4 Importance of Decisions
Note that once we have computed an \(\varepsilon \)optimal liberal strategy \(\varsigma \), we may, in principle, compute a compact representation of \(\varsigma \) (using e.g. BDDs), and obtain a strategy with possibly smaller representation than above.
Example 1
For the MDP of Fig. 1 with the objective \(\Diamond \{t\}\) and a strategy \(\varsigma \) choosing b, we have \( Imp ^{\varsigma }(s)=1\) and \( Imp ^{\varsigma }(q)=5/995\). Trivially, \( Imp ^\varsigma (t)=1\). For all other states, the importance is zero.
Obviously, decisions made in states of zero importance do not affect \(\mathbb P^{\varsigma }[\Diamond {F}]\) since these states never occur on paths from \(\hat{s}\) to F. However, note that many states of S may be reachable in \(G^{\varsigma }\) with positive but negligibly small probability. Clearly, the value of \(\mathbb P^{\varsigma }[\Diamond {F}]\) depends only marginally on choices made in these states. Formally, let \(\varsigma _{\varDelta }\) be a strategy obtained from \(\varsigma \) by changing each \(\varsigma (s)\) with \( Imp ^\varsigma (s)\le \varDelta \) to an arbitrary subset of \( Act (s)\). We obtain the following obvious property:
Lemma 2
For every liberal strategy \(\varsigma \), we have \(\displaystyle \lim _{\varDelta \rightarrow 0} \mathbb P^{\varsigma _{\varDelta }}[\Diamond {F}]=\mathbb P^{\varsigma }[\Diamond {F}]\).
In fact, every \(\varDelta <\min (\{ Imp ^\varsigma (s)\mid s\in S\}\setminus \{0\})\) satisfies \(\mathbb P^{\varsigma _{\varDelta }}[\Diamond {F}]=\mathbb P^{\varsigma }[\Diamond {F}]\). But often even larger \(\varDelta \) may give \(\mathbb P^{\varsigma _{\varDelta }}[\Diamond {F}]\) sufficiently close to \(\mathbb P^{\varsigma }[\Diamond {F}]\). Such \(\varDelta \) may be found using e.g. trial and error approach.^{7}
Most importantly, we can use the importance of a state to affect the probability that decisions in this state are indeed remembered in the data structure. Data structures with such a feature are used in various learning algorithms. In the next section, we discuss decision trees. Due to this extra variability, which decisions to learn, the resulting decision trees are smaller than BDDs for strictly defined \(\varsigma _\varDelta \).
5 Efficient Representations
Let \(G=(S,A, Act ,\delta ,\hat{s})\) be an MDP. In order to symbolically represent strategies in G, we need to assume that states and actions have some internal structure. Inspired by PRISM language [5], we consider a set \(\mathcal {V}=\{v_1,\ldots ,v_n\}\) of integer variables, each \(v_i\) gets its values from a finite domain \( Dom (v_i)\). We suppose that \(S=\prod _{i=1}^n Dom (v_i)\subseteq \mathbb {Z}^n\), i.e. each state is a vector of integers. Further, we assume that the MDP arises as a product of m modules, each of which can separately perform nonsynchronizing actions as well as synchronously with other modules perform a synchronizing action. Therefore, we suppose \(A\subseteq \bar{A} \times \{0,\ldots ,m\}\), where \(\bar{A}\subseteq \mathbb {N}\) is a finite set and the second component determines the module performing the action (0 stands for synchronizing actions).^{8}
5.1 BDD Representation
 1.
Instead of a string of bits describing all variables, a string of integers (one per variable) can be used. Branching is then done not on the value of each bit, but according to an inequality comparing the variable to a constant. This significantly improves the readability.
 2.
Instead of building the automaton according to a chosen order of bits, we let a heuristic choose the order of the inequalities and the actual constants in the inequalities.
 3.
Instead of representing the language precisely, we allow the heuristic to choose which data to represent and which not. The likelihood that each datum is represented corresponds to its importance, which we provide as another input.
The latter two steps lead to significantly smaller graphs than BDDs. All this can be done in an efficient way using decision trees learning.
5.2 Representation Using Decision Trees
Decision Trees. A decision tree for a domain \(\prod _{i=1}^d X_i\subseteq \mathbb {Z}^d\) is a tuple \(\mathcal {T}=(T,\rho ,\theta )\) where T is a finite rooted binary (ordered) tree with a set of inner nodes N and a set of leaves L, \(\rho \) assigns to every inner node a predicate of the form \([x_i\sim const ]\) where \(i\in \{1,\ldots ,d\}\), \(x_i\in X_i\), \( const \in \mathbb {Z}\), \(\mathord {\sim }\in \{\le ,<,\ge ,>,=\}\), and \(\theta \) assigns to every leaf a value \( good \), or \( bad \).^{9}
Example 2
Consider dimension \(d=1\), domain \(X_1=\{1,\ldots ,7\}\). A tree representing a set \(\{1,2,3,7\}\) is depicted in Fig. 2. To depict the ordered tree clearly, we use unbroken lines for the first child, corresponding to the satisfied predicate, and dashed line for the second one, corresponding to the unsatisfied predicate.
In our setting, we use the domain \(S\times A\) defined by Equation (1) which is of the form \(\prod _{i=1}^{n+2} X_i\) where for each \(1\le i\le n\) we have \(X_i= Dom (v_i)\), \(X_{n+1}=\bar{A}\) and \(X_{n+2}=\{0,1,\ldots ,m\}\). Here the coordinates \( Dom (v_i)\) are considered “unbounded” and, consequently, the respective predicates use inequalities. In contrast, we know the possible values of \(\bar{A}\times \{0,1,\ldots ,m\}\) in advance and they are not too many. Therefore, these coordinates are considered “discrete” and the respective predicates use equality. Examples of such trees are given in Sect. 6 in Figs. 4 and 5. Now a decision tree \(\mathcal {T}\) for this domain determines a liberal strategy \(\varsigma :S\rightarrow 2^A\) by \(a\in \varsigma (s)\) iff \((s,a)\in \mathcal {L}(\mathcal {T})\).
 1.
Start with a single node (root), and assign to it the whole training sequence.
 2.Given a node n with a sequence \(\tau \),

if all training examples in \(\tau \) are positive, set \(\theta (n)= good \) and stop;

if all training examples in \(\tau \) are negative, set \(\theta (n)= bad \) and stop;
 otherwise,

choose a predicate with the “highest gain” (with lowest entropy, see e.g. [15, Sects. 3.4.1, 3.7.2]),

split \(\tau \) into sequences satisfying and not satisfying the predicate, assign them to the first and the second child, respectively,

go to step 2 for each child.


Intuitively, the predicate with the highest gain splits the sequence so that it maximizes the portion of positive data in the satisfying subsequence and the portion of negative data in the nonsatisfying subsequence.
In addition, the final tree can be pruned. This means that some leaves are merged, resulting in a smaller tree at the cost of some imprecision of storing (the language of the tree changes). The pruning phase is quite sophisticated, hence for the sake of simplicity and brevity, we omit the details here. We use the standard C4.5 algorithm and refer to [15, 57]. In Sect. 6, we comment on effects of parameters used in pruning.
Learning a Strategy. Assume that we already have a liberal strategy \(\varsigma :S\rightarrow 2^A\). We show how we learn good and bad stateaction pairs so that the language of the resulting tree is close to the set of good pairs. The training sequence will be composed of stateaction pairs where good pairs are positive examples, and bad pairs are negative ones. Since our aim is to ensure that important states are learnt and not pruned away, we repeat pairs with more important states in the training sequence more frequently.
6 Experiments
In this section, we present the experimental evaluation of the presented methods, which we have implemented within the probabilistic model checker PRISM [5]. All the results presented in this section were obtained on a single Intel(R) Xeon(R) CPU (3.50 GHz) with memory limited to 10 GB.
First, we discuss several alternative options to construct the training data and to learn them in a decision tree. Further, we compare decision trees to other data structures, namely sets and BDDs, with respect to the sizes necessary for storing a strategy. Finally, we illustrate how the decision trees can be used to gain insight into our benchmarks.
6.1 Decision Tree Learning
Generating Training Data. The strategies we work with come from two different sources. Firstly, we consider strategies constructed by PRISM, which can be generated using the explicit or sparse model checking engine. Secondly, we consider strategies constructed by the BRTDP algorithm [11], which are defined on a part of the state space only.
Recall that given a strategy, the training data for the decision trees is constructed from c simulation runs according to the strategy. In our experiments, we found that \(c=10000\) produces good results in all the examples we consider. Note that we stop each simulation as soon as the target or a state with no path to the target state is reached.

The leaf size parameter \(M\in \mathbb {N}\) determines that each leaf node with less than M instances in the training data is merged with its siblings. Therefore, only values smaller than the number of instances per classification class are reasonable, since higher numbers always result in the trivial tree of size 1.

The confidence factor \(C\in (0,0.5)\) is used internally for determining the amount of pruning during decision tree construction. Smaller values incur more pruning and therefore smaller trees.
Detailed information and an empirical study of the parameters for J48 is available in [59].
Comparison of representation sizes for strategies obtained from PRISM and from BRTDP computation. Sizes are presented as the number of states for explicit lists of values, the number of nodes for BDDs, and the number of nodes for decision trees (DT). For DT, we display the tree size obtained from the binary search described above. DT Error reports the relative error of the strategy determined by the decision tree (on the induced Markov chain) compared to the optimal value, obtained by model checking with PRISM.
PRISM  BRTDP  

Example  \(\vert S \vert \)  Value  Explicit  BDD  DT  DT Error  Explicit  BDD  DT  DT Error 
Firewire  481,136  1.000  479,834  4,233  1  0.000 %  766  4,763\(^{1}\)  1  0.000 % 
Investor  35,893  0.958  28,151  783  27  0.886 %  21,931  2,780  35  0.836 % 
Mer_17M  1,773,664  0.200  Memory out  1,887  619  17  0.000 %  
Mer_big\(^{2}\)  Approx. \(10^{13}\)  0.200  Memory out  1,894  646  19  0.692 %  
Zeroconf  89,586  0.009  60,463  409  7  0.106 %  1,630  905  7  0.235 % 
6.2 Results
Effects of various learning variants on the tree size. Smallest trees computed from PRISM or BRTDP and the average time to compute one number are presented.

firewire models the Tree Identify Protocol of the IEEE 1394 High Performance Serial Bus, which is used to transport video and audio signals within a network of multimedia devices. The reachability objective is that one node gets the root and the other one the child role.
investor models a market investor and shares, which change their value probabilistically over time. The reachability objective is that the investor finishes a trade at a time, when his shares are more valuable than some threshold value.
mer is a mutual exclusion protocol, that regulates the access of two users to two different resources. The protocol should prohibit that both resources are accessed simultaneously.
zeroconf is a network protocol which allows users to choose their IP addresses autonomously. The protocol should detected and prohibit IP address conflict.
For every example, Table 1 shows the size of the state space, the value of the optimal strategy, and the sizes of strategies obtained from explicit model checking by PRISM and by BRTDP, for each discussed data structure.
Learning Variants. In order to justify our choice of the importance function \( Imp \), we compare it to several alternatives.
 1.
When constructing the training data, we can use the importance measure \( Imp \), and add states as often as is indicated by its importance (\(\textsc {I}\)), or neglect it and simply add every visited state exactly once (\(\textsc {O}\)).
 2.
Further, states on the simulation are learned conditioned on the fact that the target state is reached (\(\Diamond \)). Another option is to consider all simulations (\(\forall \)).
 3.
Finally, instead of the probability to visit the state (\(\mathbb P\)), one can consider the expected number of visits (\(\mathbb E\)).
In Table 2, we report the sizes of the decision trees obtained for the all learning variants. We conclude that our choice (\(\textsc {I}\Diamond \mathbb P\)) is the most useful one.
6.3 Understanding Decision Trees
We show how the constructed decision trees can help us to gain insight into the essential features of the systems.
zeroconf example. In Fig. 4 we present a decision tree that is a strategy for zeroconf and shows how an unresolved IP address conflict can occur in the protocol. First we present how to read the strategy represented in Fig. 4. Next we show how the strategy can explain the conflict in the protocol. Assume that we are classifying a stateaction pair (s, a), where action a is enabled in state s.
 1.
No matter what the current state s is, the action rec is always classified as \( bad \) according to the root of the tree. Therefore, the action rec should be played with positive probability only if all other available actions in the current state are also classified as \( bad \).
 2.
If action a is different from rec, the right son of the root node is reached. If action a is different from action l>0&b=1&ip_mess=1 > b’=0&z’=0&n1’=min(n1+1,8)&ip_mess’=0 (the whole PRISM command is a single action), then a is classified as \( good \) in state s. Otherwise, the left son is reached.
 3.
In node \(\mathtt{z \le 0}\) the classification of action a (that is the action that labels the parent node) depends on the variable valuation of the current state. If the value of var. z is greater than 0, then a is classified as \( good \) in state s, otherwise it is classified as \( bad \).
firewire example. For firewire, we obtain a trivial tree with a single node, labelled \( good \). Therefore, playing all available actions in each state guarantees reaching the target almost surely. In contrast to other representations, we have automatically obtained the information that the network always reaches the target configuration, regardless of the individual behaviour of its components and their interleaving.
mer example. In the case of mer, there exists a strategy that violates the required property that the two resources are not accessed simultaneously. The decision tree for the mer strategy is depicted in Fig. 5. In order to understand how a state is reached, where both resources are accessed at the same time, it is necessary to determine which user accesses which resource in that state.
 1.
The two tree nodes labelled by 1 explain what resource user 1 should access. The root node labelled by action s1=0&r1=0 > r1’=2 specifies that the request to access resource 2 (variable r1 is set to 2) is classified as \( bad \). The only remaining action for user 1 is to request access to resource 1. This action is classified as \( good \) by the right son of the root node.
 2.
Analogously, the tree nodes labelled by 2 specify that user 2 actions should request access to resource 2 (follows from s2=0&r2=0 > r2’=2). Once resource 2 is requested it should change its internal state s2 to 1 (follows from s2=0&r2=2 > s2’=1). It follows, that in the state violating the property, user 1 has access to resource 1 and user 2 to resource 2.
7 Conclusion
In this work we presented a new approach to represent strategies in MDPs in a succinct and comprehensible way. We exploited machine learning methods to achieve our goals. Interesting directions of future works are to investigate whether other machine learning methods can be integrated with our approach, and to extend our approach from reachability objectives to other objectives (such as longrun average and discountedsum).
Footnotes
 1.
In general, a strategy may be history dependent. However, for objectives considered in this paper, memoryless strategies (depending on the last state visited) are sufficient. Therefore, we only consider memoryless strategies in this paper.
 2.
For every MDP, there is a memoryless deterministic optimal strategy, see e.g. [2].
 3.
Here we use V for the lower approximation denoted by \(V_L\) in [11].
 4.
Furthermore, one could consider liberal strategies playing also \(\varepsilon \)optimal actions. However, our experiments did not prove better performance.
 5.
Intuitively this means that randomizing among good actions of \(\varepsilon \)optimal strategies preserves \(\varepsilon \)optimality in the reachability setting (in contrast to other settings, e.g. with parity objectives).
 6.
Instead of the conditional probability of reaching s, we could consider the conditional expected number of visits of s. We discuss the differences and compare the efficiency together with the case of no conditioning on reaching the target in Sect. 6.
 7.
One may give a theoretical bound on convergence of \(\mathbb P^{\varsigma _{\varDelta }}[\Diamond {F}]\) to \(\mathbb P^{\varsigma }[\Diamond {F}]\) as \(\varDelta \rightarrow 0\), using e.g. Lemma 5.1 of [54]. However, for large MDPs the bound would be impractical.
 8.
On the one hand, PRISM does not allow different modules to have local variables with the same name, hence we do not distinguish which module does a variable belong to. On the other hand, while PRISM declares no names for nonsynchronizing actions, we want to exploit the connection between the corresponding actions of different copies of the same module.
 9.
There exist many variants of decision trees in the literature allowing arbitrary branching, arbitrary values in the leaves, etc., e.g. [15]. For simplicity, we define only a special suitable subclass.
Notes
Acknowledgements
This research was funded in part by Austrian Science Fund (FWF) Grant No P 23499N23, FWF NFN Grant No S11407N23 (RiSE) and Z211N23 (Wittgenstein Award), European Research Council (ERC) Grant No 279307 (Graph Games), ERC Grant No 267989 (QUAREM), the Czech Science Foundation Grant No P202/12/G061, and People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme (FP7/2007–2013) REA Grant No 291734.
References
 1.Howard, R.A.: Dynamic Programming and Markov Processes. The MIT press, New York, London, Cambridge (1960)Google Scholar
 2.Puterman, M.L.: Markov Decision Processes. Wiley, New York (1994)CrossRefzbMATHGoogle Scholar
 3.Filar, J., Vrieze, K.: Competitive Markov Decision Processes. Springer, New York (1997)zbMATHGoogle Scholar
 4.Baier, C., Katoen, J.P.: Principles of Model Checking (Representation and Mind Series). The MIT Press, Cambridge (2008)Google Scholar
 5.Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic realtime systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011) CrossRefGoogle Scholar
 6.Courcoubetis, C., Yannakakis, M.: The complexity of probabilistic verification. J. ACM 42(4), 857–907 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
 7.Vardi, M.: Automatic verification of probabilistic concurrent finite state programs. In: FOCS, pp. 327–338 (1985)Google Scholar
 8.Segala, R.: Modeling and verification of randomized distributed realtime systems. Ph.D thesis, MIT Press (1995). Technical report MIT/LCS/TR676Google Scholar
 9.De Alfaro, L.: Formal verification of probabilistic systems. Ph.D thesis, Stanford University (1997)Google Scholar
 10.Kwiatkowska, M., Parker, D.: Automated verification and strategy synthesis for probabilistic systems. In: Van Hung, D., Ogawa, M. (eds.) ATVA 2013. LNCS, vol. 8172, pp. 5–22. Springer, Heidelberg (2013) CrossRefGoogle Scholar
 11.Brázdil, T., Chatterjee, K., Chmelík, M., Forejt, V., Křetínský, J., Kwiatkowska, M., Parker, D., Ujma, M.: Verification of markov decision processes using learning algorithms. In: Cassez, F., Raskin, J.F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 98–114. Springer, Heidelberg (2014) Google Scholar
 12.Bernet, J., Janin, D., Walukiewicz, I.: Permissive strategies: from parity games to safety games. ITA 36(3), 261–275 (2002)MathSciNetzbMATHGoogle Scholar
 13.Bouyer, P., Markey, N., Olschewski, J., Ummels, M.: Measuring permissiveness in parity games: meanpayoff parity games revisited. In: Bultan and Hsiung [60] pp. 135–149Google Scholar
 14.Dräger, K., Forejt, V., Kwiatkowska, M., Parker, D., Ujma, M.: Permissive controller synthesis for probabilistic systems. In: Ábrahám, E., Havelund, K. (eds.) TACAS 2014 (ETAPS). LNCS, vol. 8413, pp. 531–546. Springer, Heidelberg (2014) CrossRefGoogle Scholar
 15.Mitchell, T.M.: Machine Learning, 1st edn. McGrawHill Inc., New York (1997)zbMATHGoogle Scholar
 16.Kwiatkowska, M., Norman, G., Parker, D.: The PRISM benchmark suite. In: QEST, pp. 203–204 (2012)Google Scholar
 17.Boutilier, C., Dearden, R., Goldszmidt, M.: Exploiting structure in policy construction. In: IJCAI95, pp. 1104–1111 (1995)Google Scholar
 18.Kearns, M., Koller, D.: Efficient reinforcement learning in factored MDPs. In: IJCAI, pp. 740–747. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1999)Google Scholar
 19.Kushmerick, N., Hanks, S., Weld, D.: An algorithm for probabilistic leastcommitment planning. In: Proceedings of AAAI94, pp. 1073–1078 (1994)Google Scholar
 20.Hoey, J., Staubin, R., Hu, A., Boutilier, C.: Spudd: stochastic planning using decision diagrams. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 279–288. Morgan Kaufmann (1999)Google Scholar
 21.Chapman, D., Kaelbling, L.P.: Input generalization in delayed reinforcement learning: an algorithm and performance comparisons. pp. 726–731. Morgan Kaufmann (1991)Google Scholar
 22.Koller, D., Parr, R.: Computing factored value functions for policies in structured MDPs. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pp. 1332–1339. Morgan Kaufmann (1999)Google Scholar
 23.Boutilier, C., Dean, T., Hanks, S.: Decisiontheoretic planning: structural assumptions and computational leverage. JAIR 11, 1–94 (1999)MathSciNetzbMATHGoogle Scholar
 24.De Alfaro, L., Kwiatkowska, M., Norman, G., Parker, D., Segala, R.: Symbolic model checking of probabilistic processes using MTBDDS and the kronecker representation. In: Graf, S. (ed.) TACAS 2000. LNCS, vol. 1785, pp. 395–410. Springer, Heidelberg (2000) CrossRefGoogle Scholar
 25.Hermanns, H., Kwiatkowska, M., Norman, G., Parker, D., Siegle, M.: On the use of MTBDDs for performability analysis and verification of stochastic systems. J. Log. Algebraic Program. Spec. Issue Probab. Tech. Des. Anal. Syst. 56(1–2), 23–67 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
 26.Miner, A.S., Parker, D.: Symbolic representations and analysis of large probabilistic systems. In: Baier, C., Haverkort, B.R., Hermanns, H., Katoen, J.P., Siegle, M. (eds.) Validation of Stochastic Systems. LNCS, vol. 2925, pp. 296–338. Springer, Heidelberg (2004) CrossRefGoogle Scholar
 27.Wimmer, R., Braitling, B., Becker, B., Hahn, E.M., Crouzen, P., Hermanns, H., Dhama, A., Theel, O.: Symblicit calculation of longrun averages for concurrent probabilistic systems. In: QEST, pp. 27–36, IEEE Computer Society, Washington, DC, USA (2010)Google Scholar
 28.Boutilier, C., Dearden, R.: Approximating value trees in structured dynamic programming. In: Proceedings of the Thirteenth International Conference on Machine Learning, pp. 54–62 (1996)Google Scholar
 29.Pyeatt, L.D.: Reinforcement learning with decision trees. In: The 21st IASTED International MultiConference on Applied Informatics (AI 2003), Innsbruck, Austria, pp. 26–31, 10–13 Feb 2003Google Scholar
 30.Raghavendra, C.S., Liu, S., Panangadan, A., Talukder, A.: Compact representation of coordinated sampling policies for body sensor networks. In: Proceedings of Workshop on Advances in Communication and Networks (Smart Homes for TeleHealth), pp. 6–10, IEEE (2010)Google Scholar
 31.Han, T., Katoen, J.P., Damman, B.: Counterexample generation in probabilistic model checking. IEEE Trans. Softw. Eng. 35(2), 241–257 (2009)CrossRefGoogle Scholar
 32.Andrés, M.E., D’Argenio, P., Van Rossum, P.: Significant diagnostic counterexamples in probabilistic model checking. In: Chockler, H., Hu, A.J. (eds.) HVC 2008. LNCS, vol. 5394, pp. 129–148. Springer, Heidelberg (2009) CrossRefGoogle Scholar
 33.Wimmer, R., Jansen, N., Ábrahám, E., Katoen, J.P., Becker, B.: Minimal counterexamples for lineartime probabilistic verification. TCS 549, 61–100 (2014)CrossRefzbMATHGoogle Scholar
 34.Jansen, N., Ábrahám, E., Katelaan, J., Wimmer, R., Katoen, J.P., Becker, B.: Hierarchical counterexamples for discretetime markov chains. In: Bultan and Hsiung [60] pp. 443–452Google Scholar
 35.Damman, B., Han, T., Katoen, J.P.: Regular expressions for PCTL counterexamples. In: QEST, pp. 179–188, IEEE Computer Society (2008)Google Scholar
 36.Fecher, H., Huth, M., Piterman, N., Wagner, D.: PCTL model checking of markov chains: truth and falsity as winning strategies in games. Perform. Eval. 67(9), 858–872 (2010)CrossRefGoogle Scholar
 37.Aljazzar, H., Leue, S.: Directed explicit statespace search in the generation of counterexamples for stochastic model checking. IEEE Trans. Softw. Eng. 36(1), 37–60 (2010)CrossRefGoogle Scholar
 38.Komuravelli, A., Păsăreanu, C.S., Clarke, E.M.: Assumeguarantee abstraction refinement for probabilistic systems. In: Madhusudan, P., Seshia, S.A. (eds.) CAV 2012. LNCS, vol. 7358, pp. 310–326. Springer, Heidelberg (2012) CrossRefGoogle Scholar
 39.Aljazzar, H., LeitnerFischer, F., Leue, S., Simeonov, D.: DiPro  a tool for probabilistic counterexample generation. In: Groce, A., Musuvathi, M. (eds.) SPIN Workshops 2011. LNCS, vol. 6823, pp. 183–187. Springer, Heidelberg (2011) CrossRefGoogle Scholar
 40.Jansen, N., Ábráham, E., Volk, M., Wimmer, R., Katoen, J.P., Becker, B.: The COMICS tool  computing minimal counterexamples for DTMCs. In: Chakraborty, S., Mukund, M. (eds.) ATVA 2012. LNCS, vol. 7561, pp. 349–353. Springer, Heidelberg (2012)CrossRefGoogle Scholar
 41.Ábrahám, E., Becker, B., Dehnert, C., Jansen, N., Katoen, J.P., Wimmer, R.: Counterexample generation for discretetime markov models: an introductory survey. In: Bernardo, M., Damiani, F., Hähnle, R., Johnsen, E.B., Schaefer, I. (eds.) SFM 2014. LNCS, vol. 8483, pp. 65–121. Springer, Heidelberg (2014) CrossRefGoogle Scholar
 42.Aljazzar, H., Leue, S.: Generation of counterexamples for model checking of markov decision processes. In: QEST, pp. 197–206, IEEE Computer Society (2009)Google Scholar
 43.LeitnerFischer, F., Leue, S.: Probabilistic fault tree synthesis using causality computation. IJCCBS 4(2), 119–143 (2013)CrossRefGoogle Scholar
 44.Kattenbelt, M., Huth, M.: Verification and refutation of probabilistic specifications via games. In: IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2009, IIT Kanpur, India, pp. 251–262, 15–17 Dec 2009Google Scholar
 45.Wimmer, R., Jansen, N., Vorpahl, A., Ábrahám, E., Katoen, J.P., Becker, B.: Highlevel counterexamples for probabilistic automata. In: Joshi, K., Siegle, M., Stoelinga, M., D’Argenio, P.R. (eds.) QEST 2013. LNCS, vol. 8054, pp. 39–54. Springer, Heidelberg (2013) CrossRefGoogle Scholar
 46.Dehnert, C., Jansen, N., Wimmer, R., Ábrahám, E., Katoen, J.P.: Fast debugging of PRISM models. In: Cassez, F., Raskin, J.F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 146–162. Springer, Heidelberg (2014) Google Scholar
 47.Kwiatkowska, M.Z., Norman, G., Parker, D.: Gamebased abstraction for Markov decision processes. In: QEST, pp. 157–166 (2006)Google Scholar
 48.Hermanns, H., Wachter, B., Zhang, L.: Probabilistic CEGAR. In: Gupta, A., Malik, S. (eds.) CAV 2008. LNCS, vol. 5123, pp. 162–175. Springer, Heidelberg (2008) CrossRefGoogle Scholar
 49.Chadha, R., Viswanathan, M.: A counterexampleguided abstractionrefinement framework for Markov decision processes. ACM Trans. Comput. Log. 12(1), 1 (2010)MathSciNetCrossRefGoogle Scholar
 50.Chatterjee, K., Chmelík, M., Daca, P.: CEGAR for qualitative analysis of probabilistic systems. In: Biere, A., Bloem, R. (eds.) CAV 2014. LNCS, vol. 8559, pp. 473–490. Springer, Heidelberg (2014) Google Scholar
 51.D’Argenio, P.R., Jeannet, B., Jensen, H.E., Larsen, K.G.: Reachability analysis of probabilistic systems by successive refinements. In: De Luca, L., Gilmore, S. (eds.) PROBMIV 2001, PAPMPROBMIV 2001, and PAPM 2001. LNCS, vol. 2165, pp. 39–56. Springer, Heidelberg (2001) CrossRefGoogle Scholar
 52.D’Argenio, P.R.: Reduction and refinement strategies for probabilistic analysis. In: Hermanns, H., Segala, R. (eds.) PROBMIV 2002, PAPMPROBMIV 2002, and PAPM 2002. LNCS, vol. 2399, pp. 57–76. Springer, Heidelberg (2002) CrossRefGoogle Scholar
 53.McMahan, H.B., Likhachev, M., Gordon, G.J.: Bounded realtime dynamic programming: RTDP with monotone upper bounds and performance guarantees. In: ICML (2005)Google Scholar
 54.Brázdil, T., Kiefer, S., Kučera, A.: Efficient analysis of probabilistic programs with an unbounded counter. J. ACM 61(6), 41:1–41:35 (2014)CrossRefGoogle Scholar
 55.Von Essen, C., Jobstmann, B., Parker, D., Varshneya, R.: Semisymbolic computation of efficient controllers in probabilistic environments. Technical report, Verimag (2012)Google Scholar
 56.Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)Google Scholar
 57.Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993) Google Scholar
 58.Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
 59.Drazin, S., Montag, M.: Decision tree analysis using weka. Machine LearningProject II, University of Miami, pp. 1–3 (2012)Google Scholar
 60.Bultan, T., Hsiung, P.A. (eds.): Automated Technology for Verification and Analysis, ATVA 2011. 9th International Symposium, Taipei, Taiwan, October 1114, 2011. Proceedings, vol. 6996, LNCS. Springer, Heidelberg (2011)Google Scholar