figure a
figure b

1 Introduction

Markov decision process (MDP) [7, 30, 32] is a basic model for systems featuring both probabilistic and non-deterministic behaviour. They come in two flavours: discrete-time MDP (often simply MDP) and continuous-time MDP (CTMDP). While the evolution of MDP happens in discrete steps, their natural real-time extension CTMDP additionally feature random time delays governed by exponential probability distributions. Their application domain ranges across a wide spectrum, e.g. operations research [10, 16], power management and scheduling [31], networked and distributed systems [19, 22], or communication protocols [28], to name a few. One of the key aspects of such systems is their performance, often formalized as mean payoff (also called long-run average reward), one of the classic and most studied objectives on (CT)MDP [30] with numerous applications [17]. In this context, probabilistic model checking and performance evaluation intersect [5]. While the former takes the verification perspective of the worst-case analysis and the latter the perspective of optimization for the best case, they are mathematically dual and thus algorithmically the same.

The range of analysis techniques provided by literature is very rich, encompassing linear programming, policy iteration, or value iteration. However, these are applicable only in the setting where the (CT)MDP is known (whitebox setting). In order to handle the blackbox setting, where the model is unknown or only partially known, statistical model checking (SMC) [37] relaxes the requirement of the hard guarantees on the correctness (claimed precision) of the result. Instead it uses probably approximately correct (PAC) analysis, which provides essentially a confidence interval on the result: with probability (confidence) at least \(1-\delta \), the result of the analysis is \(\varepsilon \)-close to the true value. This kind of analysis may be applicable to those systems for which we do not have exclusive access to their internal functionalities, but we can still observe their behaviour.

In this paper, we provide the first algorithm with PAC bounds on the mean payoff in blackbox MDP. We treat both the discrete-time and continuous-time MDP, and the SMC algorithm not only features PAC bounds (returning the result with prescribed precision and confidence), but an anytime algorithm (gradually improving the result and, if terminated prematurely, can return the current approximation with its precision and the required confidence).

The difficulty with blackbox models is that we do not know the exact transition probabilities, not even the number of successors for an action from a state. The algorithm thus must simulate the MDP to obtain any information. The visited states can be augmented to a model of the MDP and statistics used to estimate the transition probabilities. The estimates can be used to compute mean payoff precisely on the model. The results of [12] and [33] then provide a method for estimating the number of times each state-action pair needs to be visited in an MDP to obtain a PAC bound on the expected mean-payoff value of the original MDP. However, notice that this requires that the topology be learnt perfectly, for which we either need some knowledge of the state space or recent development in the spirit of [3]. On the one hand, this simple algorithm thus follows in a straightforward way from the recent results in the literature (although to the best of our knowledge it has not been presented as such yet). On the other hand, the required number of samples using these bounds is prohibitively large, and therefore, giving guarantees with such analysis is not feasible at all in practice. In fact, the numbers are astronomic already for Markov chains with a handful of states [13]. We discuss further drawbacks of such a naïve solution in Sect. 3. Our main contribution in this paper is a practical algorithm. It takes the most promising actions from every state and uses the on-demand value iteration [2], not even requiring an exhaustive exploration of the entire MDP. Using techniques of [3, 13], we can show that the partial model captures enough information. Most importantly, instead of using [12, 33], the PAC bounds are derived directly from the concrete confidence intervals, reflecting the width of each interval and the topology of the model, in the spirit of the practical SMC for reachability [3].

Our contribution can be summarized as follows:

  • We provide the first algorithm with PAC bounds on the mean payoff in blackbox MDP (Sect. 4) and its extension to blackbox CTMDP (Sect. 5).

  • We discuss the drawbacks of a possible more straightforward solution and how to overcome them (in Sect. 3 on the conceptual level, before we dive into the technical algorithms in the subsequent sections).

  • We evaluate the algorithm on the standard benchmarks of MDP and CTMDP and discuss the effect of heuristics, partial knowledge of the model, and variants of the algorithms (Sect. 6).

Related Work. SMC of unbounded-horizon properties of MDPs was first considered in [23, 29] for reachability. [20] gives a model-free algorithm for \(\omega \)-regular properties, which is convergent but provides no bounds on the current error. Several approaches provide SMC for MDPs and unbounded-horizon properties with PAC guarantees. Firstly, the algorithm of [18] requires (1) the mixing time T of the MDP (2) the ability to restart simulations also in non-initial states (3) visiting all states sufficiently many times, and thus (4) the knowledge of the size of the state space |S|. Secondly, [9], based on delayed Q-learning [34], lifts the assumptions (2) and (3) and instead of (1) requires only (a bound on) the minimum transition probability \(p_{\mathsf {min}}\). Thirdly, [3] additionally lifts the assumption (4), keeping only \(p_{\mathsf {min}}\), as in this paper. In [13], it is argued that while unbounded-horizon properties cannot be analysed without any information on the system, knowledge of (a lower bound on) the minimum transition probability \(p_{\mathsf {min}}\) is a relatively light and realistic assumption in many scenarios, compared to the knowledge of the whole topology. In this paper, we thus adopt this assumption.

In contrast to SMC that uses possibly more (re-started) runs of the system, there are online learning approaches, where the desired behaviour is learnt for the single run. Model-based learning algorithms for mean payoff have been designed both for minimizing regret [4, 36] as well as for PAC online learning [25, 26].

Due to lack of space, the proofs and some more experimental results and discussions appear in [1].

2 Preliminaries

A probability distribution on a finite set X is a mapping \(\rho : X\mapsto [0,1]\), such that \(\sum _{x\in X} \rho (x) = 1\). We denote by \(\mathcal {D}(X)\) the set of probability distributions on X.

Definition 1

(MDP). A Markov decision process is a tuple of the form \(\mathcal {M}= (\mathsf {S}, s_\textsf {init}, \mathsf {Act}, \mathsf {Av}, \mathbb {T}, r)\), where \(\mathsf {S}\) is a finite set of states, \(s_\textsf {init}\in \mathsf {S}\) is the initial state, \(\mathsf {Act}\) is a finite set of actions, \(\mathsf {Av}: \mathsf {S}\rightarrow 2^{\mathsf {Act}}\) assigns to every state a set of available actions, \(\mathbb {T}: \mathsf {S}\times \mathsf {Act}\rightarrow \mathcal {D}(\mathsf {S})\) is a transition function that given a state s and an action \(a\in \mathsf {Av}(s)\) yields a probability distribution over successor states, and \(r: \mathsf {S}\rightarrow \mathbb {R}^{\ge 0}\) is a reward function, assigning rewards to states.

For ease of notation, we write \(\mathbb {T}(s, a, t)\) instead of \(\mathbb {T}(s, a)(t)\). We denote by \(\mathsf {Post}(s,a)\), the set of states that can be reached from s through action a. Formally, \(\mathsf {Post}(s,a) = \{t \,|\, \mathbb {T}(s, a, t) > 0\}\).

The choices of actions are resolved by strategies, generally taking history into account and possibly randomizing. However, for mean payoff it is sufficient to consider positional strategies of the form \(\pi : \mathsf {S}\rightarrow \mathsf {Act}\). The semantics of an MDP with an initial state \(s_\textsf {init}\) is given in terms of each strategy \(\sigma \) inducing a Markov chain \(\mathcal {M}^\sigma _{s_\textsf {init}}\) with the respective probability space and unique probability measure \(\mathbb P^{\mathcal {M}^\sigma _{s_\textsf {init}}}\), and the expected value \(\mathbb E^{\mathcal {M}^\sigma _{s_\textsf {init}}}[F]\) of a random variable F (see e.g. [6]). We drop \(\mathcal {M}^\sigma _{s_\textsf {init}}\) when it is clear from the context.

End Components An end-component (EC) \(M = (T,A)\), with \(\emptyset \ne T \subseteq \mathsf {S}\) and \(A:T \rightarrow 2^{\mathsf {Act}}\) of an MDP \(\mathcal {M}\) is a sub-MDP of \(\mathcal {M}\) such that: for all \(s \in T\), we have that A(s) is a subset of the actions available from s; for all \(a \in A(s)\), we have \(\mathsf {Post}(s,a) \subseteq T\); and, it’s underlying graph is strongly connected. A maximal end-component (MEC) is an EC that is not included in any other EC. Given an MDP \(\mathcal {M}\), the set of its MECs is denoted by \(\mathsf {MEC}(\mathcal {M})\). For \(\mathsf {MEC}(\mathcal {M}) = \{(T_1, A_1), \dots , (T_n, A_n)\}\), we define \(\mathsf {MEC}_{\mathsf {S}} = \bigcup _{i=1}^n T_i\) as the set of all states contained in some MEC.

Definition 2

(continuous-time MDP (CTMDP)). A continuous-time Markov decision process is a tuple of the form \(\mathcal {M}= (\mathsf {S}, s_\textsf {init}, \mathsf {Act}, \mathsf {Av}, \mathsf {R}, r)\), where \(\mathsf {S}\) is a finite set of states, \(s_\textsf {init}\in \mathsf {S}\) is the initial state, \(\mathsf {Act}\) is a finite set of actions, \(\mathsf {Av}: \mathsf {S}\rightarrow 2^{\mathsf {Act}}\) assigns to every state a set of available actions, \(\mathsf {R}: \mathsf {S}\times \mathsf {Act}\times \mathsf {S}\rightarrow \mathbb {R}_{\ge 0}\) is a transition rate matrix that given a state s and an action \(a\in \mathsf {Av}(s)\) defines the set of successors t of s on action a if \(\mathsf {R}(s,a,t) > 0\), and \(r: \mathsf {S}\rightarrow \mathbb {R}_{\ge 0}\) is a reward rate function, assigning a reward function to a state denoting the reward obtained for spending unit time in s.

A strategy in a CTMDP decides immediately after entering a state which action needs to be chosen from the current state. For a given state \(s \in \mathsf {S}\), and an action \(a \in \mathsf {Av}(s)\), we denote by \(\lambda (s, a) = \sum _{t} \mathsf {R}(s,a,t) > 0\) the exit rate of a in s. The residence time for action a in s is exponentially distributed with mean \(\frac{1}{\lambda (s, a)}\). An equivalent way of looking at CTMDP is that in state s, we wait for a time which is exponentially distributed with mean \(\lambda (s, a)\), and then with probability \(\varDelta (s,a,t)=\mathsf {R}(s,a,t) / \lambda (s,a)\), we make a transition to state t. The reward accumulated for spending time \(\mathsf{t}\) in s is \(r(s) \cdot \mathsf{t}\).

Uniformization. A uniform CTMDP has a constant exit rate C for all state-action pairs i.e., \(\lambda (s,a) = C\) for all states \(s \in \mathsf {S}\) and actions \(a \in \mathsf {Av}(\mathsf {s})\). The procedure of converting a non-uniform CTMDP into a uniform one is called uniformization. Consider a non-uniform CTMDP \(\mathcal {M}\). Let \(C \in \mathbb {R}_{\ge 0}\) such that \(C \geqslant \lambda (s,a)\) for all \(s \in \mathsf {S}\) and \(a \in \mathsf {Act}\). We can obtain a uniform CTMDP \(\mathcal {M}_{C}\) by assigning the new rates.

$$\begin{aligned} \mathsf {R}'(s,a,t) = {\left\{ \begin{array}{ll} \mathsf {R}(s,a,t) &{} \text {if } s \ne t \\ \mathsf {R}(s,a,t)+C - \lambda (s,a) &{} \text {if } s=t \end{array}\right. } \end{aligned}$$
(1)

For every action \(a \in \mathsf {Av}(s)\) from each state s in the new CTMDP we have a self loop if \(\lambda (s,a) < C\). Due to a constant transition rate, the mean interval time between two any two actions is constant.

Mean Payoff. In this work, we consider the (maximum) mean payoff (or long-run average reward) of an MDP \(\mathcal {M}\), which intuitively describes the (maximum) average reward per step we expect to see when simulating the MDP for time going to infinity. Formally, let \(S_i,A_i,R_i\) be random variables giving the state visited, action played, and reward obtained in step i, and for CTMDP, \(T_i\) the time spent in the state appearing in step i. For MDP, \(R_i:=r(S_i)\), whereas for CTMDP, \(R_i:=r(S_i)\cdot T_i\); consequently, for a CTMDP and a strategy \(\pi \), we have \(\mathbb E^\pi _s(R_i) = \frac{r(S_i)}{\lambda (S_i, A_i)}\).

Thus given a strategy \(\pi \), the n-step average reward is

$$\begin{aligned} v^\pi _n(s) := \mathbb E^\pi _s \left( \frac{1}{n}\sum _{i=0}^{n-1} R_i \right) = \frac{1}{n}\sum _{i=0}^{n-1} \frac{r(S_i)}{\lambda (S_i, A_i)}. \end{aligned}$$

with the latter equality holding for CTMDP. For both MDP and CTMDP, the mean payoff is then

$$\begin{aligned} v(s) := \max _{\pi }\liminf _{n\rightarrow \infty } v^\pi _n, \end{aligned}$$

where the maximum over all strategies can also be without loss of generality restricted to the set of positional strategies \(\varPi ^{\mathsf {MD}}\). A well-known alternative characterization we use in this paper is

$$\begin{aligned} v(s) = \max _{\pi \in \varPi ^{\mathsf {MD}}} \sum _{M \in \mathsf {MEC}(\mathcal {M})} \mathbb P^\pi _s[\Diamond \Box M] \cdot v_M, \end{aligned}$$
(2)

where \(\Diamond \) and \(\Box \) respectively denote the standard LTL operators eventually and always respectively. Further, \(\Diamond \Box M\) denotes the set of paths that eventually remain forever within M and \(v_M\) is the unique value achievable in the (CT)MDP restricted to the MEC M. Note that \(v_M\) does not depend on the initial state chosen for the restriction.

We consider algorithms that have a limited information about the MDP.

Definition 3

(Blackbox and greybox). An algorithm inputs an MDP or a CTMDP as blackbox if

  • it knows \(s_\textsf {init}\),

  • for a given state,Footnote 1 an oracle returns its available actions,

  • given a state s and action a, it can sample a successor t according to \(\mathbb {T}(s,a)\),

  • it knows \(p_{\mathsf {min}}\leqslant \min _{\begin{array}{c} s \in \mathsf {S},a \in \mathsf {Av}(s)\\ t \in \mathsf {Post}(s,a) \end{array}} \mathbb {T}(s,a,t)\), an under-approximation of the minimum transition probability.

When input as greybox, it additionally knows the number \(|{\mathsf {Post}(s,a)}|\) of successors for each state s and action a. Note that the exact probabilities on the transitions in an MDP or the rates in a CTMDP are unknown for both blackbox and greybox learning settings.

3 Overview of Our Approach

Since no solutions are available in the literature and our solution consists of multiple ingredients, we present it in multiple steps to ease the understanding. First, we describe a more naïve solution and pinpoint its drawbacks. Second, we give an overview of a more sophisticated solution, eliminating the drawbacks. Third, we fill in its details in the subsequent sections. Besides, each of the three points is first discussed on discrete-time MDPs and then on continuous-time MDPs. The reason for this is twofold: the separation of concerns simplifies the presentation; and the algorithm for discrete-time MDP is equally important and deserves a standalone description.

3.1 Naïve Solution

We start by suggesting a conceptually simple solution. We can learn mean payoff MP in an MDP \(\mathcal {M}\) as follows:

  1. (i)

    Via simulating the MDP \(\mathcal {M}\), we learn a model \({\mathcal {M}}^\prime \) of \(\mathcal {M}\), i.e., we obtain confidence intervals on the transition probabilities of \(\mathcal {M}\) (of some given width \(\varepsilon _{TP}\), called TP-imprecision, and confidence \(1-\delta _{TP}\), where \(\delta _{TP}\) is called TP-inconfidence).

  2. (ii)

    We compute the mean payoff \(\widehat{MP}\) on the (imprecise) model \({\mathcal {M}}^\prime \).

  3. (iii)

    We compute the MP-imprecision \(\varepsilon _{MP}=|\widehat{MP}-MP|\) of the mean payoff from the TP-imprecision by the “robustness” theorem [8] which quantifies how mean payoff can change when the system is perturbed with a given maximum perturbation. Further, we compute the overall MP-inconfidence \(\delta _{MP}\) from the TP-inconfidence \(\delta _{TP}\); in particular, we can simply accumulate all the uncertainty and set \(\delta _{MP}=|\mathbb {T}|\cdot \delta _{TP}\), where \(|\mathbb {T}|\) is the number of transitions. The result is then probably approximately correct, being \(\varepsilon _{MP}\)-precise with confidence \(1-\delta _{MP}\). (Inversely, from a desired \(\varepsilon _{MP}\) we can also compute a sufficient \(\varepsilon _{TP}\) to be used in the first step.)

Learning the model, i.e. the transition probabilities, can be done by observing the simulation runs and collecting, for each state-action pair (sa), a statistics of which states occur right after playing a in s. The frequency of each successor t among all successors then estimates the transition probability \(\mathbb {T}(s,a,t)\). This is the standard task of estimating the generalized Bernoulli variable (a fixed distribution over finitely many options) with confidence intervals. We stop simulating when each transition probability has a precise enough confidence interval (with \(\varepsilon _{TP}\) and \(\delta _{TP}\) yielded by the robustness theorem from the desired overall precision).Footnote 2 The drawbacks are (D1: uniform importance) that even transitions with little to no impact on the mean payoff have to be estimated precisely (with \(\varepsilon _{TP}\) and \(\delta _{TP}\)); and (D2: uniform precision required) that, even restricting our attention to “important” transitions, it may take a long time before the last one is estimated precisely (while others are already estimated overly precisely).

Subsequently, using standard algorithms the mean payoff \(\widehat{MP}\) can be computed precisely by linear programming [30] or precisely enough by value iteration [2]. The respective MP can then be estimated by the robustness theorem [8], which yields for a given maximum perturbation of transition probabilities (in our case, \(\varepsilon _{TP}\)/2) an upper bound on the respective perturbation of the mean payoff \(\varepsilon _{MP}/2\). The drawbacks are (D3: uniform precision utilized) that more precise confidence intervals for transitions (obtained due to D2) are not utilized, only the maximum imprecision is taken into account; and (D4: a-priori bounds) that the theorem is extremely conservative. Indeed, it reflects neither the topology of the MDP nor how impactful each transition is and thus provides an a-priori bound, extremely loose compared to the possible values of mean payoff that can be actually obtained for concrete values within the confidence intervals. This is practically unusable beyond a handful of states even for Markov chains [13].

For CTMDP \(\mathcal {M}\), we additionally need to estimate the rates (see below how). Subsequently, we can uniformize the learnt CTMDP \({\mathcal {M}}^\prime \). Mean payoff of the uniformized CTMDP is then equal to the mean payoff of its embedded MDPFootnote 3. Hence, we can proceed as before but we also have to compute (i) confidence intervals for the rates from finitely many observations, and (ii) the required precision and confidence of these intervals so that the respective induced error on the mean payoff is not too large. Hence all the drawbacks are inherited and, additionally, also applied to the estimates of the rates. Besides, (D5: rates) while imprecisions of rates do not increase MP-imprecision too much, the bound obtained via uniformization and the robustness theorem is very loose. Indeed, imprecise rates are reflected as imprecise self-loops in the uniformization, which themselves do not have much impact on the mean payoff, but can increase the TP-imprecision and thus hugely the MP-imprecision from the robustness theorem.

Finally, note that for both types of MDP, (D6: not anytime) this naïve algorithm is not an anytime algorithmFootnote 4 since it works with pre-computed \(\varepsilon _{TP}\) and \(\delta _{TP}\). Instead it returns the result with the input precision if given enough time; if not given enough time, it does not return anything (also, if given more time, it does not improve the precision).

3.2 Improved Solution

Now we modify the solution so that the drawbacks are eliminated. The main ideas are (i) to allow for differences in TP-imprecisions (\(\varepsilon _{TP}\) can vary over transitions) and even deliberately ignore less important transitions and instead improve precision for transitions where more information is helpful the most; (ii) rather than using the a-priori robustness theorem, to utilize the precision of each transition to its maximum; and (iii) to give an anytime algorithm that reflects the current confidence intervals and, upon improving them, can efficiently improve the mean-payoff estimate without recomputing it from scratch. There are several ingredients used in our approach.

Firstly, [2] provides an anytime algorithm for approximating mean payoff in a fully known MDP. The algorithm is a version of value iteration, called on-demand, performing improvements (so called Bellman updates) of the mean-payoff estimate in each state. Moreover, the algorithm is simulation-based, performing the updates in the visited states, biasing towards states where a more precise estimate is helpful the most (“on demand”). This matches well our learning setting. However, the approach assumes precise knowledge of the transition probabilities and, even more importantly, heavily relies on the knowledge of MECs. Indeed, it decomposes the mean-payoff computation according to Eq. 2 into computing mean payoff within MECs and optimizing (weighted) reachability of the MECs (with weights being their mean payoffs). When the MECs are unknown, none of these two steps can be executed.

Secondly, [3] provides an efficient way of learning reachability probabilities (in the greybox and blackbox settings). Unfortunately, since it considers TP-inconfidence to be the same for all transitions, causing different TP-imprecisions, the use of robustness theorem in [3] makes the algorithm used there practically unusable in many cases. On a positive note, the work identifies the notion of \(\delta _{TP}\)-sure EC, which reflects how confident we are, based on the simulations so far, that a set of states is an EC. This notion will be crucial also in our algorithm.

Both approaches are based on “bounded value iteration”, which computes at any moment of time both a lower and an upper bound on the value that we are approximating (mean payoff or reachability, respectively). This yields anytime algorithms with known imprecision, the latter—being a learning algorithm on an incompletely known MDP—only with some confidence. Note that the upper bound converges only because ECs are identified and either collapsed (in the former) or deflated [24] (in the latter), meaning their upper bounds are decreased in a particular way to ensure correctness.

Our algorithm on (discrete-time) MDP \(\mathcal {M}\) performs, essentially, the following. It simulates \(\mathcal {M}\) in a similar way as [3]. With each visit of each state, not only it updates the model (includes this transition and improves the estimate of the outgoing transition probabilities), but also updates the estimate of the mean payoff by a Bellman update. Besides, at every moment of time, the current model yields a hypothesis what the actual MECs of \(\mathcal {M}\) are and the respective confidence. While we perform the Bellman updates on all visited states deemed transient, the states deemed to be in MECs are updated separately, like in [2]. However, in contrast to [2], where every MEC is fully known and can thus be collapsed, and in contrast to the “bounded” quotient of [3] (see Appendix A of [1]), we instead introduce a special action stay in each of its states, which simulates staying in the (not fully known) MEC and obtaining its mean-payoff estimate via reachability:

Definition 4

(stay -augmented MDP). Let \(\mathcal {M}= (\mathsf {S}, s_\textsf {init}, \mathsf {Act}, \mathsf {Av}, \mathbb {T}, r)\) be an MDP and \(l,u: \mathsf {MEC}(\mathcal {M}) \rightarrow [0,1]\) be real functions on MECs. We augment the stay action to \(\mathcal {M}\) to obtain \({\mathcal {M}}^\prime = ({\mathsf {S}}^\prime , s_\textsf {init}, {\mathsf {Act}}^\prime , {\mathsf {Av}}^\prime , {\mathbb {T}}^\prime , {r}^\prime )\), where

  • \({\mathsf {S}}^\prime = \mathsf {S}\uplus \{s_+, s_-, s_?\}\),

  • \({\mathsf {Act}}^\prime = \mathsf {Act}\uplus \{\mathsf{stay}\}\),

  • \({\mathsf {Av}}^\prime (s) = {\left\{ \begin{array}{ll} \mathsf {Av}(s)&{} \text {for }s\in S\setminus \bigcup \mathsf {MEC}(\mathcal {M})\\ \mathsf {Av}(s)\cup \{\mathsf{stay}\}&{} \text {for }s\in \bigcup \mathsf {MEC}(\mathcal {M})\\ \{\mathsf{stay}\}&{} \text {for } s \in \{s_+, s_-, s_?\} \end{array}\right. }\)

  • \({\mathbb {T}}^\prime \) extends \(\mathbb {T}\) by \({\mathbb {T}}^\prime (s, \mathsf{stay}) = \lbrace {s_+ \mapsto l(M), s_- \mapsto 1 - u(M), s_? \mapsto u(M) -}{ l(M)}\rbrace \) on \(s \in M\in \mathsf {MEC}(\mathcal {M})\) and by \({\mathbb {T}}^\prime (s,\mathsf{stay},s)=1\) for \(s \in \{s_+, s_-, s_?\}\).

  • \({r}^\prime \) extends \({r}\) by \({r}^\prime (s_+) = {r}^\prime (s_?) = {r}^\prime (s_-) = 0\).Footnote 5

Corollary 1

If \(l,u\) are valid lower and upper bounds on the mean-payoff within MECs of \(\mathcal {M}\) then \(\max _\sigma \mathbb P^{M^\sigma }[\Diamond \{s_+\}] \leqslant v(s_\textsf {init})\leqslant max_\sigma \mathbb P^{M^\sigma }[\Diamond \{s_+,s_?\}]\)Footnote 6 where, \(\max _\sigma \mathbb P^{M^\sigma }[\Diamond S]\) gives the maximum probability of reaching some state in S over all strategies.

This turns the problem into reachability, and thus allows for deflating (defined for reachability in [3]) and an algorithm combining [3] and [2]. The details are explained in the next section. To summarize (D1) and (D2) are eliminated by not requiring uniform TP-imprecisions; (D3) and (D4) are eliminated via updating lower and upper bounds (using deflating) instead of using the robustness theorem.

Concerning CTMDP, in Sect. 5 we develop a confidence interval computation for the rates. Further, we design an algorithm deriving the MP-imprecision resulting from the rate imprecisions, that acts directly on the CTMDP and not on the embedded MDP of the uniformization. This effectively removes (D5).

4 Algorithm for Discrete-Time MDP

Now that we explained the difficulties of a naïve approach, and the concepts from literature together with novel ideas to overcome them, we describe the actual algorithm for the discrete-time setting. Following a general outline of the algorithm, we give detailed explanations behind the components and provide the statistical guarantees the algorithm gives. Detailed pseudocode of the algorithms for this section is provided in Appendix B of [1].

Overall Algorithm and Details. Our version of an on-demand value iteration for mean payoff in black-box MDP is outlined in Algorithm 1. Initially, the input MDP \(\mathcal {M}\) is augmented with terminal states \((\{s_+, s_-, s_?\})\) to obtain the stay-augmented MDP \({\mathcal {M}}^\prime \). We learn a stay-augmented MDP \({\mathcal {M}}^\prime = ({\mathsf {S}}^\prime , s_\textsf {init}, {\mathsf {Act}}^\prime , {\mathsf {Av}}^\prime ,\) \({\mathbb {T}}^\prime , {r}^\prime )\) by collecting samples through several simulation runs (Lines 5-8). Over the course of the algorithm, we identify MECs with \(\delta _{TP}\) confidence (Line 13) and gradually increase precision on their respective values (Lines 9-11). As stated earlier, these simulations are biased towards actions that lead to MECs potentially having higher rewards. Values for MECs are encoded using the \(\mathsf {stay}\) action (Line 12) and propagated throughout the model using bounded value iteration (Lines 14-19). In Line 14, we reinitialize the values of the states in the partial model since new MECs may be identified and also existing MECs may change. Finally, we claim that the probability estimates \({\mathbb {T}}^\prime \) are correct with confidence \(\delta _{MP}\) and if the bounds on the value are precise enough, we terminate the algorithm. Otherwise, we repeat this overall process with improved bounds (Line 20).

Simulation. The \(\mathsf {SIMULATE}\) function simulates a run over the input blackbox MDP \(\mathcal {M}\) and returns the visited states in order. The simulation of \(\mathcal {M}'\) is executed by simulating \(\mathcal {M}\) together with a random choice if action stay is taken. Consequently, a simulation starts from \(s_\textsf {init}\) and ends at one of the terminal states \((\{s_+, s_-, s_?\})\). During simulation, we enhance our estimate of \({\mathcal {M}}^\prime \) by visiting new states, exploring new actions and improving our estimate of \({\mathbb {T}}^\prime \) with more samples. When states are visited for the first time, actions are chosen at random, and subsequently, actions with a higher potential reward are chosen. If a simulation is stuck in a loop, we check for the presence of an MEC with \(\delta _{TP}\) confidence. If a \(\delta _{TP}\)-sure MEC is found, we add a stay action with \(l,u = 0,1\), otherwise we keep simulating until the required confidence is achieved. After that, we take the action with the highest upper bound that is leaving the MEC to continue the simulation. We do several such simulations to build a large enough model before doing value iteration in the next steps.

figure c

Estimating Transition Probabilities. [3] gives an analysis to estimate bounds on transition probabilities for reachability objective in MDPs. For completeness, we briefly restate it here. Given an MP-inconfidence \(\delta _{MP}\), we distribute the inconfidence over all individual transitions as

$$\begin{aligned} \delta _{TP}:= \dfrac{\delta _{MP}\cdot p_{\mathsf {min}}}{|\{{a} \vert \, \mathsf {s}\in {\mathsf {S}}^\prime \wedge {a} \in {\mathsf {Av}}^\prime (s) \}|}, \end{aligned}$$

where \(\frac{1}{p_{\mathsf {min}}}\) gives an upper bound on the maximum number of possible successors for an available action from a stateFootnote 7. The Hoeffding’s inequality gives us a bound on the number of times an action a needs to be sampled from state s, denoted \(\#(s,a)\), to achieve a TP-imprecision \(\varepsilon _{TP}\geqslant \sqrt{\dfrac{\ln \delta _{TP}}{-2\#(s, a)}}\) on \(\mathbb {T}(s,a,t)\), such that

$$\begin{aligned} \widehat{\mathbb {T}}(s,a,t) := \max (0, \dfrac{\#(s, a, t)}{\#(s, a)} - \varepsilon _{TP}) \end{aligned}$$

where, \(\#(s, a, t)\) is the number of times t is sampled when action a is chosen in s. Updating mean-payoff values Using \(\widehat{\mathbb {T}}(s,a,t)\), we compute estimates of the upper and lower bounds of the values corresponding to every action from a state visited in the partial model that is constructed so far. We use the following modified Bellman Eq. [3]:

$$\begin{aligned} \widehat{\mathsf {L}}(s,a)&:= \sum \limits _{t:\#(s,a,t)>0} \widehat{\mathbb {T}}(s,a,t) \cdot \mathsf {L}(t)\\ \widehat{\mathsf {U}}(s,a)&:= \sum \limits _{t:\#(s,a,t)>0} \widehat{\mathbb {T}}(s,a,t) \cdot \mathsf {U}(t) + \Big (1-\sum \limits _{t:\#(s,a,t)>0} \widehat{\mathbb {T}}(s,a,t)\Big ), \end{aligned}$$

where \(\mathsf {L}(t) = \max \limits _{{a} \in \mathsf {Av}(t)} \widehat{\mathsf {L}}(t,a)\) and \(\mathsf {U}(t) = \max \limits _{{a} \in \mathsf {Av}(t)}\widehat{\mathsf {U}}(t,a)\) are bounds on the value of from a state, v(s). When a state is discovered for the first time during the simulation, and is added to the partial model, we initialize \(\mathsf {L}(s)\), and \(\mathsf {U}(s)\) to 0, and 1, respectively. Note that \(\sum \limits _{t:\#(s,a,t) > 0} \widehat{\mathbb {T}}(s,a,t) < 1\). We attribute the remaining probability to unseen successors and assume their value to be 0 (1) to safely under-(over-)approximate the lower (upper) bounds. We call these blackbox Bellman update equations, since it assumes that all the successors of a state-action pair may not have been visited.

Estimating Values of End-Components. End-components are identified with an inconfidence of \(\delta _{TP}\). As observed in [13], assuming an action has been sampled n times, the probability of missing a transition for that action is at most \((1-p_{\mathsf {min}})^n\). Thus, for identifying (TA) as a \(\delta _{TP}{} \textit{-sure}\) MEC, every action in A that is available from a state \(s \in T\) needs to be sampled at least \(\frac{\ln \delta _{TP}}{\ln (1-p_{\mathsf {min}})}\) times.

Once a \(\delta _{TP}{} \textit{-sure}\) MEC M is identified, we estimate its upper (\(v^u_M\)) and lower (\(v^l_M\)) bounds using value iteration.Footnote 8 While running value iteration, we assume, with a small inconfidence, that there are no unseen outgoing transitions. So we use the following modified Bellman update equations inside the MEC where we under-(over-)approximate the lower(upper) bound to a much lesser degree.

$$\begin{aligned} \widehat{\mathsf {L}}(s,a) := \sum \limits _{t:\#(s, a, t)>0} \widehat{\mathbb {T}}(s, a, t) \cdot \mathsf {L}(t) + \min \limits _{t:\#(s, a, t)>0}\mathsf {L}(t) \cdot (1-\sum \limits _{t:\#(s, a, t)>0} \widehat{\mathbb {T}}(s, a, t)) \end{aligned}$$
$$\begin{aligned} \widehat{\mathsf {U}}(s, a) := \sum \limits _{t:\#(s, a, t)>0} \widehat{\mathbb {T}}(s, a, t) \cdot \mathsf {U}(t) + \max \limits _{t:\#(s, a, t)>0}\mathsf {U}(t) \cdot (1-\sum \limits _{t:\#(s, a, t)>0} \widehat{\mathbb {T}}(s, a, t)) \end{aligned}$$

Following the assumption, we call these greybox (See Definition 3) Bellman update equations. The value iteration algorithm further gives us bounds on \(v^u_M\) and \(v^l_M\). We say that the upper estimate of \(v^u_M\) (\(\widehat{v}^{u}_M\)) and the lower estimate of \(v^l_M\) (\(\widehat{v}^{l}_M\)) are the overall upper and lower bounds of the mean-payoff value of M, respectively. To converge the overall bounds, we need value iteration to return more precise estimates of \(v^l_M\) and \(v^u_M\), and we need to sample the actions inside M many times to reduce the difference between \(v^l_M\) and \(v^u_M\). We call this procedure, \(\mathsf {UPDATE\_MEC\_VALUE}\).

Now, some MECs may have very low values or may not be reachable from \(s_\textsf {init}\) with high probability. In such cases, no optimal strategy may visit these MECs, and it might not be efficient to obtain very precise mean-payoff values for every MEC that is identified in an MDP. We follow the on-demand heuristic [2] where we progressively increase the precision on mean-payoff values as an MEC seems more likely to be a part of an optimal strategy. The stay action on MECs helps in guiding simulation towards those MECs that have a higher lower bound of the mean-payoff value. In particular, whenever the simulation ends up in \(s_+\) or \(s_?\), we run \(\mathsf {UPDATE\_MEC\_VALUE}\) with higher precision on the MEC that led to these states. If the simulation ends up in these states through a particular MEC more often, it indicates that the MEC is likely to be a part of an optimal strategy, and it would be worth increasing the precision on its mean-payoff value.

Deflate Operation. Unlike in the case of computation of mean payoff for whitebox models [3] where a MEC is collapsed following the computation of its value, for blackbox learning, once a set of states is identified as a \(\delta _{TP}{} \textit{-sure}\) MEC, we cannot collapse them. This is because collapsing would prevent a proper future analysis of those states, which is undesirable in a blackbox setting. However, this leads to other problems. To illustrate this, we consider an MDP that only has a single MEC M and one outgoing action from every individual state. Recall from Eq. 2 that we compute the mean-payoff by reducing it to a reachability problem. Once the mean-payoff for the MEC, and the probabilities corresponding to stay action in Line 12 are computed, to compute the reachability probability, the upper and lower bounds of all states in the MECs are initialized to 1 and 0 respectively. Now suppose that the sum of probabilities to \(s_+\) and \(s_?\) be p denoting the upper bound on the value of the mean-payoff to be \(p \cdot r_{\max }\). Clearly, the upper bound on the reachability value of this MDP is p. Now, when we do BVI to calculate this value, from every state in M, there would be at least two action choices, one that stays inside the MEC, and one that corresponds to the stay action. Initially, all states, except the terminal states, would have upper and lower values set to 0 and 1, respectively. Thus, among the two action choices, one would have upper value p, while the other would have upper value 1, and hence, the Bellman update assigns the upper value of the state to 1. As one can see, this would go on, and convergence wouldn’t happen, and hence the true mean-payoff value will not be propagated to the initial state of the MDP. To avoid this, we need the deflate operation which lowers the upper reachability value to the best outgoing action, i.e. in this case, the stay action with value p.

Statistical Guarantees. The following theorem shows that the mean-payoff value learnt by Algorithm 1 is PAC on an input blackbox MDP.

Theorem 1

Algorithm 1 has the property that when it stops, it returns an interval for the mean-payoff value of the MDP that is PAC for the given MP-inconfidence \(\delta _{MP}\) and the MP-imprecision \(\varepsilon _{MP}\).

Anytime Algorithm. As a direct consequence, we obtain an anytime algorithm from Algorithm 1 by (1) dropping the termination test on Line 20, i.e. replacing it with until false, and (2) upon query (or termination) by the user, we output \((\mathsf {U}(s_\textsf {init})+\mathsf {L}(s_\textsf {init}))/2\) as the estimate and, additionally, we output \((\mathsf {U}(s_\textsf {init})\) - \(\mathsf {L}(s_\textsf {init}))/2\) as the current imprecision.

Using Greybox Update Equations During Blackbox Learning. We also consider the variant where we use greybox update equations to estimate the mean-payoff values. However, assuming we keep the TP-imprecision unchanged, the overall TP-inconfidence now has to include the probability of missing some successor of a state s for an action aFootnote 9. Given a number of samples \(\#(s,a)\), the probability that we miss a particular successor is at most \((1-{p_{\mathsf {min}}})^{\#(s, a)}\), and hence the overall TP-inconfidence corresponding to using greybox equations for blackbox learning increases to \(\delta _{TP}+ (1-{p_{\mathsf {min}}})^{\#(s, a)}\).

We also note that the use of greybox update equations on estimating the transition probabilities also gives us a PAC guarantee but with an increased MP-Inconfidence resulting from an increased TP-inconfidence.

5 Algorithm for Continuous-Time MDP

In this section, we describe an algorithm to learn blackbox CTMDP models for mean-payoff objective while respecting the PAC guarantees. As in the case of MDPs, we reduce the mean-payoff problem to a reachability problem. We follow the same overall framework as in MDPs, where we compute the probability to reach the end-components under an optimal strategy, and we compute their respective mean-payoff values. Computing reachability probabilities in a CTMDP is the same as computing reachability probabilities in the underlying embedded MDP. Similar to estimating \(\mathbb {T}(s,a,t)\) in Sect. 4 for MDPs, we estimate \(\varDelta (s,a,t)\)Footnote 10 for CTMDPs, and follow the simulation-based procedure in Algorithm 1 to compute reachability probabilities. However, unlike MECs in MDPs, where the mean-payoff value depends solely on the transition probabilities, the mean-payoff value in a CTMDP also depends on the rates \(\lambda (s,a)\) for \(s \in T\) and \(a \in A(s)\) for an MEC \(M=(T,A)\). Thus to compute the value of an MEC, we also estimate the rates of the state-action pairs. Once we get the estimates of the rates, we uniformize the CTMDP to obtain a uniform CTMDP that can be treated as an MDP by disregarding the rates while preserving the mean-payoff value [30]. Detailed pseudocode of the algorithms for this section are provided in Appendix F of [1].

Estimating Rates. Recall that for an action a, the time spent in s is exponentially distributed with a parameter \(\lambda (s,a)\), and \(\frac{1}{\lambda (s,a)}\) is the mean of this distribution. During the simulation of a CTMDP, for every state s reached and action a chosen from s, we construct a sequence \(\tau _{s,a}\) of the time difference between the entry and the corresponding exit from s when action a is chosen. Then, the average over the sequence \(\tau _{s,a}\) gives us an estimate \(\frac{1}{\widehat{\lambda }(s,a)}\) of \(\frac{1}{\lambda (s,a)}\) (Abbreviated to \(\frac{1}{\lambda }\) from now on when (sa) is clear from the context.).

Assuming a multiplicative error \(\alpha _R\) on our estimates of \(\frac{1}{\lambda }\), the lemma below uses Chernoff boundsFootnote 11 to give the number of samples that need to be collected from an exponential distribution so that the estimated mean \(\frac{1}{\widehat{\lambda }}\) is at most \(\alpha _R\)-fraction away from the actual mean \(\frac{1}{\lambda }\) with probability at least \(1-\delta _R\), where \(\alpha _R, \delta _R \in (0, 1)\). Further by Cramer’s theorem [15], it follows that this is the tightest possible bound for the number of samples collected.

Lemma 1

Let \(X_1, \dots , X_n\) be exponentially distributed i.i.d. random variables with mean \(\frac{1}{\lambda }\). Then we have that

$$ \mathbb P\Big [\left| {\frac{1}{\widehat{\lambda }} - \frac{1}{\lambda } \geqslant \frac{1}{\lambda } \cdot \alpha _R)}\right| \Big ] \leqslant \displaystyle {\inf _{-\lambda< t < 0}} \Big (\frac{\lambda }{\lambda +t}\Big )^n \cdot e^{\frac{tn}{\lambda }(1+\alpha _R)} + \displaystyle {\inf _{ t > 0}} \Big (\frac{\lambda }{\lambda + t}\Big )^n \cdot e^{\frac{tn}{\lambda }(1-\alpha _R)}, $$

where \(\frac{1}{n}\sum _{i=1}^n X_i = \frac{1}{\widehat{\lambda }}\).

Assuming the right-side of the inequality is at most \(\delta _R\), we have that \(\lambda \in [\hat{\lambda }(1-\alpha _R), \hat{\lambda }(1+\alpha _R)]\), or \(\widehat{\lambda } \in [\frac{\lambda }{1+\alpha _R}, \frac{\lambda }{1-\alpha _R}]\) with probability at least \(1-\delta _R\). Table 1 shows the number of samples required for various values of \(\alpha _R\) and \(\delta _R\)Footnote 12.

Table 1. Lookup table for number of samples based on \(\alpha _R\) and \(\delta _{R}\)

Given a maximum multiplicative error \(\alpha _R\) on the mean of the exponential distributions of the state-action pairs in a CTMDP, we say that the rate \(\lambda \) is known \(\alpha _R\)-precisely if \(\widehat{\lambda } \in [\frac{\lambda }{1+\alpha _R}, \frac{\lambda }{1-\alpha _R}]\). We now quantify the bounds on the estimated mean-payoff value. Let \(\mathcal {M}\) be a CTMDP, \(v_{\mathcal {M}}\) be its actual mean-payoff value, and let \(\widehat{v}_{\mathcal {M}}\) denote its mean-payoff when the rates of the state-action pairs are known \(\alpha _R\)-precisely. Then we have the following.

Lemma 2

Given a CTMDP \(\mathcal {M}\) with rates known \(\alpha _R\)-precisely, with transition probabilities known precisely, and with maximum reward per unit time over all states \(r_{max}\), we have \(v_{\mathcal {M}}(\frac{1-\alpha _R}{1+\alpha _R}) \le \widehat{v}_{\mathcal {M}} \le v_{\mathcal {M}}(\frac{1+\alpha _R}{1-\alpha _R})\) and \(|\widehat{v}_{\mathcal {M}} - v_{\mathcal {M}}| \le r_{max} \frac{2 \alpha _R}{1 - \alpha _R}\).

Estimating Mean-Payoff Values of MECs. Using our bounds on the rates of the transitions, we now compute bounds on the mean-payoff values of MECs in CTMDPs. We first show that the mean payoff is maximized or minimized at the boundaries of the estimates of the rates. Intuitively, to maximise the mean-payoff value, for a state \(s_i\) with a high reward, we would like to maximise the time spent in \(s_i\) or equivalently, minimise the rate \(\lambda (s_i,a)\) for every outgoing action a from \(s_i\). We do the opposite when we want to find a lower bound on the mean-payoff value in the MEC. Consider an MEC M having states \(T=\{s_1, ..., s_m\}\). Assume that \(\lambda _i\) is the rate of an action a from state \(s_i\), such that a positional mean-payoff maximizing strategy \(\sigma \) chooses a from \(s_i\). Then, the expected mean-payoff value of M is given by,

$$\begin{aligned} v_{M} = \frac{\sum \limits _{s_{i} \in T} \frac{r\left( s_{i}\right) \pi _{i}}{\lambda _{i}}}{\sum \limits _{s_{i} \in T} \frac{\pi _{i}}{\lambda _{i}}}, \end{aligned}$$
(3)

where \(\pi _{i}\) denotes the expected fraction of total time spent in \(s_{i}\) under \(\sigma \).

Now, we have estimates \(\frac{1}{\widehat{\lambda }_i}\) of \(\frac{1}{\lambda }\), such that, \(\lambda _{i} \in \left[ \widehat{\lambda }_{i} \left( 1-\alpha _R\right) , \widehat{\lambda }_{i} \left( 1+\alpha _R\right) \right] \) with high probability. Let \(\lambda _{i}^{l}=\widehat{\lambda }_{i} \left( 1-\alpha _R\right) \) and \(\lambda _{i}^{u}=\widehat{\lambda }_{i} \left( 1+\alpha _R\right) \).

Proposition 1

In Eq. 3, the maximum and the minimum values of \(v_{M}\) occur at the boundaries of the estimates of \(\lambda _i\) for each \(1 \leqslant i \leqslant m\).

In particular, \(v_{M}\) is maximized when,

$$\begin{aligned} \lambda _{i} = {\left\{ \begin{array}{ll} \lambda _{i}^{l}, &{} \text {if}\ r(s_{i}) \ge v_{M} \\ \lambda _{i}^{u}, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(4)

Once we fix the rates for each of the states in M, we uniformize M to obtain a uniform CTMDP \(M_C\) which is an MEC and can be treated as an MDP for computing its mean-payoff value [30]. Let for a state-action pair, the rate be \(\lambda (s,a)\), and the uniformization constant be C. For a successor t from s under action a such that \(t\ne s\), we have \(\varDelta (s,a,t)=\frac{\#(s,a,t)}{\#(s,a)} \cdot \frac{\lambda (s,a)}{C}\), and \(\varDelta (s,a,s)=1-\sum \limits _{t \ne s}\varDelta (s,a,t)\). Finally, value iteration on \(M_C\) with appropriate confidence width gives us the lower and the upper estimates of the mean-payoff value of the MEC M.

We now describe an iterative procedure to identify those states of the MEC for which the upper bound on the estimates of the rates are assigned, and those states for which the lower bound on the estimates of the rates are assigned in order to maximize or minimize the mean-payoff value of the MEC. Assume w.l.o.g. that the states \(s_1, \dots , s_m\) are sorted in decreasing order of their rewards \(r(s_i)\). In iteration j, we set \(\lambda _i=\lambda _i^l\) for \(1 \leqslant i \leqslant j\), and we set \(\lambda _i=\lambda _i^u\) for the remaining states and recompute \(v_M\). The maximum value of \(v_M\) across all iterations gives the upper bound on \(v_M\). Similarly we can find the lower bound on \(v_{M}\). Overall, value iteration is done 2|T| timesFootnote 13.

Overall Algorithm. As stated in the beginning of this section, an algorithm for computing the mean payoff in blackbox CTMDP models largely follows the same overall framework as stated in Sect. 4. By sampling the actions, we obtain estimates of the rates and the transition probabilities. The reachability probabilities to the MECs of the CTMDP are estimated using the estimates of the transition probabilities while the mean-payoff values of MECs are estimated using uniformization as decribed above. The confidence widths on the transition probabilities in a uniformized MEC are assigned based on the number of samples \(\#(s,a)\) for a state-action pair (sa).

Statistical Guarantees. Let \(\delta _{TP}\) and \(\delta _R\) be the \(\text {TP-inconfidence }\) and the inconfidence on individual transition rates, respectively. Further, let \(\delta _{MP1}\) and \(\delta _{MP2}\) be the overall inconfidence on the transition probabilities and transition rates, respectively. Then, \(\delta _{TP}:= \dfrac{{\delta _{MP}}_{1}\cdot p_{\mathsf {min}}}{|\{{a} \vert \, \mathsf {s}\in \widehat{S} \wedge {a} \in \mathsf {Av}(s) \}|}\), and \(\delta _R := \dfrac{{\delta _{MP}}_{2}}{|\{{a} \vert \, \mathsf {s}\in \widehat{S} \wedge {a} \in \mathsf {Av}(s) \}|}\). Thus, we have that the overall inconfidence on the mean-payoff value, \(\delta _{MP} = \delta _{MP1}+\delta _{MP2}\). Thus, to achieve a given inconfidence on the mean-payoff value, we fix \(\delta _{TP}\) and \(\delta _R\), and adjust the imprecisions \(\varepsilon _{TP}\) and \(\alpha _R\) accordingly.Footnote 14

As in the case of MDPs, our learning algorithm for blackbox CTMDP models is an anytime algorithm that is PAC for the given MP-inconfidence \(\delta _{MP}\).

6 Experimental Results

We implemented our algorithms as an extension of Prism [27] and tested it on 15 MDP benchmarks and 10 CTMDP benchmarks. Several of these benchmarks were selected from the Quantitative Verification Benchmark Set [21]Footnote 15. The results for MDP and CTMDP blackbox learning are shown in Table 2 and Table 3 respectively. Here, we scale the upper and lower bounds to 1 and 0, and show the average values taken over 10 experiments. The experiments were run on a desktop machine with an Intel i5 3.2 GHz quad core processor and 16 GB RAM. The \(\text {MP-imprecision}\) \(\varepsilon _{MP}\) is set to \(10^{-2}\), revisitThreshold k is set to 6, \(\text {MP-inconfidence }\) \(\delta _{MP}\) is set to 0.1 and n is set to 10000. We further use a timeout of 30 minutes. In the case of a timeout, the reported upper and lower bounds on the mean payoff still correspond to the input MP-inconfidence \(\delta _{MP}\), although the \(\text {MP-imprecision }\)may not be the desired one.

Blackbox Learning for MDPs. We see that in Table 2 for blackbox learning, 9 out of 15 benchmarks converge well, such that the precision is within 0.1. In fact, for many of these 9 benchmarks, a precision of 0.1 is achieved much before the timeout (TO). In Fig. 1a and Fig. 1b, we show this for zeroconf and pacman. zeroconf has a large transient part and a lot of easily reachable single state MECs. Since it has a true value of 1, the upper and the lower values converge after exploring only a few MECs. Our algorithm only needed to explore a very small percentage of the states to attain the input precision. cs_nfail has many significant MECs, and the learning algorithm needs to explore each of these MECs, while in sensor there is a relatively large MEC of around 30 states, and the simulation inside this MEC takes considerable amount of time.

Table 2. Results on MDP benchmarks.

virus consists of a single large MEC of more than 800 states, and its true value is 0. As we simulate the MEC more and more, the TP-imprecision on the transition probabilities decreases and the upper bound on the mean-payoff reduces over time. ij10 contains one MEC with 10 states in it. The value converges faster and reaches a value of 1, during blackbox learning. This model has relatively high number of actions, more than 5, for many of its states outside the MEC. This leads to a higher TP-imprecision. Further, due to the conservative nature of the blackbox update equations, the upper and the lower values converge very slowly.

consensus, ij10, ij3, pacman, wlan were used in [3] for learning policies for reachability objectives. The target states in these benchmarks are sink states with self loops, and we add a reward of 1 on these target states so that the rechability probability becomes the same as the mean payoff. The mean-payoff results we observe are similar to the bounds reported for reachability probability in [3], and our experiments also take similar time as reported in [3].

The blackjack model [35] is similar to zeroconf model. It has 3829 states and 2116 MECs. It has a large transient part and a lot of single state MECs. However, unlike zeroconf all of the MECs have a value of 0. Thus, simulation takes more time as the TP-imprecision reduces slowly.

Blackbox Learning with Greybox Update Equations. We show the results of these experiments in the right side of Table 2. As observed, convergence is much faster here for all the benchmarks. All our benchmarks converged correctly within a few seconds to a few minutes. Hence for a small degradation in MP-inconfidence use of greybox update equations works well in practice. We show the effect on MP-inconfidence in more detail in Table 8 in Appendix G of [1].

Table 3. Results on CTMDP benchmarks

Blackbox Learning for CTMDPs. In Table 3 we show the results for CTMDP benchmarks. The number of states in these benchmarks vary from as low as 12 to more than 7000. All the models used here have a lot of small end-components. We observe that the upper and the lower values take more time to converge as the size of the model grows. Figure 1c and Fig. 1d show the convergence of lower and upper bounds for QueuingSystem and SJS3. As in the case of MDPs, using greybox update equations speeds up the learning process significantly.

Fig. 1.
figure 1

Convergence of lower and upper bounds for blackbox update equations and greybox update equations.

Greybox Learning. Recall from Definition 3 that in greybox learning, for every state-action pair, we know the number of successors of the state for the given action. As expected, their convergence is much faster than that for blackbox learning, but the convergence is comparable to the case where we do blackbox learning with greybox update equations. The details of the greybox learning experiments can be found in Appendix G of [1].

7 Conclusion

We presented the first PAC SMC algorithm for computing mean payoff in unknown MDPs and CTMDPs, where the only information needed is a lower bound on minimum transition probability, as advocated in [13]. In contrast to a naive algorithm, which follows in a quite straightforward way from the literature, our algorithm is practically applicable, overcoming the astronomic number of simulation steps required. To this end, in particular, the inconfidence had to be distributed in non-uniformly over the transitions and then imprecision propagated by value iteration with precision guarantees. In future, we would like to thoroughly analyse how well weakening the PAC bounds can be traded for a yet faster convergence. On the practical side, applying importance sampling and importance splitting could further improve the efficiency.