Global PAC Bounds for Learning Discrete Time Markov Chains

Learning models from observations of a system is a powerful tool with many applications. In this paper, we consider learning Discrete Time Markov Chains (DTMC), with different methods such as frequency estimation or Laplace smoothing. While models learnt with such methods converge asymptotically towards the exact system, a more practical question in the realm of trusted machine learning is how accurate a model learnt with a limited time budget is. Existing approaches provide bounds on how close the model is to the original system, in terms of bounds on local (transition) probabilities, which has unclear implication on the global behavior. In this work, we provide global bounds on the error made by such a learning process, in terms of global behaviors formalized using temporal logic. More precisely, we propose a learning process ensuring a bound on the error in the probabilities of these properties. While such learning process cannot exist for the full LTL logic, we provide one ensuring a bound that is uniform over all the formulas of CTL. Further, given one time-to-failure property, we provide an improved learning algorithm. Interestingly, frequency estimation is sufficient for the latter, while Laplace smoothing is needed to ensure non-trivial uniform bounds for the full CTL logic.


Introduction
Discrete-Time Markov Chains (DTMC) are commonly used in model checking to model the behavior of stochastic systems [3,4,7,26]. A DTMC is described by a set of states and transition probabilities between these states. The main issue with modeling stochastic systems using DTMCs is to obtain the transition probabilities. One appealing approach to overcome this issue is to observe the system and to learn automatically these transition probabilities [8,30], e.g., using frequency estimation or Laplace (or additive) smoothing [12]. Frequency All authors have contributed equally.
c The Author(s) 2020 S. K. Lahiri and C. Wang (Eds.): CAV 2020, LNCS 12225, pp. 304-326, 2020. https://doi.org/10.1007/978-3-030-53291-8_17 estimation works by observing a long run of the system and estimating each individual transition by its empirical frequency. However, in this case, the unseen transitions are estimated as zeros. Once the probability of a transition is set to zero, the probability to reach a state could be tremendously changed, e.g., from 1 to 0 if the probability of this transition in the system is small but non-zero. To overcome this problem, when the set of transitions with non-zero probability is known (but not their probabilities), Laplace smoothing assigns a positive probability to the unseen transitions, i.e., by adding a small quantity both to the numerator and the denominator of the estimate used in frequency estimation. Other smoothing methods exist, such as Good-Turing [15] and Kneser-Sey estimations [7], notably used in natural language processing. Notwithstanding smoothing generates estimation biases, all these methods converge asymptotically to the exact transition probabilities.
In practice, however, there is often limited budget in observing and learning from the system, and the validity of the learned model is in question. In trusted machine learning, it is thus crucial to measure how the learned model differs from the original system and to provide practical guidelines (e.g., on the number of observations) to guarantee some control of their divergence.
Comparing two Markov processes is a common problem that relies on a notion of divergence. Most existing approaches focus on deviations between the probabilities of local transitions (e.g., [5,10,27]). However, a single deviation in a transition probability between the original system and the learned model may lead to large differences in their global behaviors, even when no transitions are overlooked, as shown in our example 1. For instance, the probability of reaching certain state may be magnified by paths which go through the same deviated transition many times. It is thus important to use a measure that quantifies the differences over global behaviors, rather than simply checking whether the differences between the individual transition probabilities are low enough.
Technically, the knowledge of a lower bound on the transition probabilities is often assumed [1,14]. While it is a soft assumption in many cases, such as when all transition probabilities are large enough, it is less clear how to obtain such a lower bound in other cases, such as when a very unlikely transition exists (e.g., a very small error probability). We show how to handle this in several cases: learning a Markov chain accurate w.r.t. this error rate, or learning a Markov chain accurate over all its global behaviors, which is possible if we know the underlying structure of the system (e.g., because we designed it, although we do not know the precise transition probabilities which are governed by uncertain forces). For the latter, we define a new concept, namely conditioning of a DTMC.
In this work, we model global behaviors using temporal logics. We consider Linear Temporal Logic (LTL) [24] and Computational Tree Logic (CTL) [11]. Agreeing on all formulas of LTL means that the first order behaviors of the system and the model are the same, while agreeing on CTL means that the system and the model are bisimilar [2]. Our goal is to provide stopping rules in the learning process of DTMCs that provides Probably Approximately Correct (PAC) bounds on the error in probabilities of every property in the logic between the model and the system. In Sect. 2, we recall useful notions on DTMCs and PAC-learning. We point out related works in Sect. 3. Our main contributions are as follows: -In Sect. 4, we show that it is impossible to learn a DTMC accurate for all LTL formulas, by adapting a result from [13]. -We provide in Sect. 6 a learning process bounding the difference in probability uniformly over all CTL properties. To do so, we use Laplace smoothing, and we provide rationale on choosing the smoothing parameter. -For the particular case of a time-to-failure property, notably used to compute the mean time between failures of critical systems (see e.g., [25]), we provide tighter bounds in Sect. 5, based on frequency estimation.
In Sect. 4, we formally state the problem and the specification that the learning process must fulfill. We also show our first contribution: the impossibility of learning a DTMC, accurate for all LTL formulas. Nevertheless, we prove in Sect. 5 our second contribution: the existence of a global bound for the time-tofailure properties, notably used to compute the mean time between failures of critical systems (see e.g., [25]) and provide an improved learning process, based on frequency estimation. In Sect. 6, we present our main contribution: a global bound guaranteeing that the original system and a model learned by Laplace smoothing have similar behaviors for all the formulas in CTL. We show that the error bound that we provide on the probabilities of properties is close to optimal. We evaluate our approach in Sect. 7 and conclude in Sect. 8.

Background
In this section, we introduce the notions and notations used throughout the paper. A stochastic system S is interpreted as a set of interacting components in which the state is determined randomly with respect to a global probability measure described below.

Definition 1 (Discrete-Time Markov Chains). A Discrete-Time Markov
Chain is a triple M = (S, μ, A) where: -S is a finite set of states; -μ : S → [0, 1] is an initial probability distribution over S; We denote by m the cardinal of S and A = (a ij ) 1≤i,j≤m = (A(i, j)) 1≤i,j≤m the probability matrix. Figures 1 and 2 show the graph of two DTMCs over 3 states {s 1 , s 2 , s 3 } (with μ(s 1 ) = 1). A run is an infinite sequence ω = s 0 s 1 · · · and a path is a finite sequence ω = s 0 · · · s l such that μ(s 0 ) > 0 and A(s i , s i+1 ) > 0 for all i, 0 ≤ i ≤ l. The length |ω| of a path ω is its number of transitions.
The cylinder set of ω, denoted C(ω), consists of all the runs starting by a path ω. Markov chain M underlies a probability space (Ω, F, P), where Ω is the set of all runs from M; F is the sigma-algebra generated by all the cylinders C(ω) and P is the unique probability measure [32] such that P(C(s 0 · · · s l )) = For simplicity, we assume a unique initial state s 0 and denote P(ω) = P (C(ω)). Finally, we sometimes use the notation P A i to emphasize that the probability distribution is parameterized by the probability matrix A, and the starting state is i.

PAC-Learning for Properties
To analyze the behavior of a system, properties are specified in temporal logic (e.g., LTL or CTL, respectively introduced in [24] and [11]). Given a logic L and ϕ a property of L, decidable in finite time, we denote ω |= ϕ if a path ω satisfies ϕ. Let z : Ω ×L → {0, 1} be the function that assigns 1 to a path ω if ω |= ϕ and 0 otherwise. In what follows, we assume that we have a procedure that draws path ω with respect to P A and outputs z(ω, ϕ). Further, we denote γ(A, ϕ) the probability that a path drawn with respect to P A satisfies ϕ. We omit the property or the matrix in the notation when it is clear from the context. Finally, note that the behavior of z(., ϕ) can be modeled as a Bernoulli random variable Z ϕ parameterized by the mean value γ(A, ϕ).
Probably Approximately Correct (PAC) learning [28] is a framework for mathematical analysis of machine learning. Given ε > 0 and 0 < δ < 1, we say that a property ϕ of L is PAC-learnable if there is an algorithm A such that, given a sample of n paths drawn according to the procedure, with probability of at least 1−δ, A outputs in polynomial time (in 1/ε and 1/δ) an approximation of the average value for Z ϕ close to its exact value, up to an error less than or equal to ε. Formally, ϕ is PAC-learnable if and only if A outputs an approximationγ such that: Moreover, if the above statement for algorithm A is true for every property in L, we say that A is a PAC-learning algorithm for L.

Monte-Carlo Estimation and Algorithm of Chen
Given a sample W of n paths drawn according to P A until ϕ is satisfied or violated (for ϕ such that with probability 1, ϕ is eventually satisfied or violated), the crude Monte-Carlo estimator, denotedγ W (A, ϕ), of the mean value for the random variable Z ϕ is given by the empirical frequency: The Okamoto inequality [23] (also called the Chernoff bound in the literature) is often used to guarantee that the deviation between a Monte-Carlo estimator γ W and the exact value γ by more than ε > 0 is bounded by a predefined confidence parameter δ. However, several sequential algorithms have been recently proposed to guarantee the same confidence and accuracy with fewer samples 1 .
In what follows, we use the Massart bound [22], implemented in the algorithm of Chen [6].
If n ≥ 2 To ease the readability, we write n succ = n i=1 z(ω i ) and H(n, n succ , , δ) When it is clear from the context, we only write H(n). Then, the algorithm A that stops sampling as soon as n ≥ H(n) and outputs a crude Monte-Carlo estimator for γ(A, ϕ) is a PAC-learning algorithm for ϕ. The condition over n is called the stopping criteria of the algorithm. As far as we know, this algorithm requires fewer samples than the other sequential algorithms (see e.g., [18]). Note that the estimation of a probability close to 1/2 likely requires more samples since H(n) is maximized inγ W = 1/2.

Related Work
Our work shares similar statistical results (see Sect. 2.3) with Statistical Model Checking (SMC) [32]. However, the context and the outputs are different. SMC is a simulation-based approach that aims to estimate one probability for a given property [9,29], within acceptable margins of error and confidence [17,18,33]. A challenge in SMC is posed by unbounded properties (e.g., fairness) since the sampled executions are finite. Some algorithms have been proposed to handle unbounded properties but they require the knowledge of the minimal probability transition of the system [1,14], which we avoid. While this restriction is light in many contexts, such as when every state and transition appears with a sufficiently high probability, contexts where probabilities are unknown and some are very small seems much harder to handle. In the following, we propose 2 solutions not requiring this assumption. The first one is the closest to SMC: we learn a Markov chain accurate for a given time-to-error property, and it does not require knowledge on the Markov chain. The second one is much more ambitious than SMC as it learns a Markov chain accurate for all its global behaviors, formalized as all properties of a temporal logic; it needs the assumption that the set of transitions is known, but not their probabilities nor a lower bound on them. This assumption may seem heavy, but it is reasonable for designers of systems, for which (a lower bound on) transition probabilities are not known (e.g. some error rate of components, etc).
For comparison with SMC, our final output is the (approximated) transition matrix of a DTMC rather than one (approximated) probability of a given property. This learned DTMC can be used for different purposes, e.g. as a component in a bigger model or as a simulation tool. In terms of performances, we will show that we can learn a DTMC w.r.t. a given property with the same number of samples as we need to estimate this property using SMC (see Sect. 5). That is, there is no penalty to estimate a DTMC rather than estimate one probability, and we can scale as well as SMC. In terms of expressivity, we can handle unbounded properties (e.g. fairness properties). Even better, we can learn a DTMC accurate uniformly over a possibly infinite set of properties, e.g. all formulas of CTL. This is something SMC is not designed to achieve.
Other related work can be cited: In [13], the authors investigate several distances for the estimation of the difference between DTMCs. But they do not propose algorithms for learning. In [16], the authors propose to analyze the learned model a posteriori to test whether it has some good properties. If not, then they tweak the model in order to enforce these properties. Also, several PAC-learning algorithms have been proposed for the estimation of stochastic systems [5,10] but these works focus on local transitions instead of global properties.

Problem Statement
In this work, we are interested to learn a DTMC model from a stochastic system S such that the behaviors of the system and the model are similar. We assume that the original system is a DTMC parameterized by a matrix A of transition probabilities. The transition probabilities are unknown, but the set of states of the DTMC is assumed to be known.
Our goal is to provide a learning algorithm A that guarantees an accurate estimation of S with respect to certain global properties. For that, a sampling process is defined as follows. A path (i.e., a sequence of states from s 0 ) of S is observed, and at steps specified by the sampling process, a reset action is performed, setting S back to its initial state s 0 . Then another path is generated. This process generates a set W of paths, called traces, used to learn a matrix A W . Formally, we want to provide a learning algorithm that guarantees the following specification: where ε > 0 and δ > 0 are respectively accuracy and confidence parameters and D(A,Â W ) is a measure of the divergence between A andÂ W . There exist several ways to specify the divergence between two transition matrices, e.g., the Kullback-Leibler divergence [19] or a distance based on a matrix norm. However, the existing notions remain heuristic because they are based on the difference between the individual probabilistic transitions of the matrix. We argue that what matters in practice is often to quantify the similarity between the global behaviors of the systems and the learned model.
In order to specify the behaviors of interest, we use a property ϕ or a set of properties Ψ on the set of states visited. We are interested in the difference between the probabilities of ϕ (i.e., the measure of the set of runs satisfying ϕ) with respect to A andÂ W . We want to ensure that this difference is less than some predefined ε with (high) probability 1 − δ. Hence, we define: Our problem is to construct an algorithm which takes the following as inputs: -confidence δ, 0 < δ < 1, -absolute error ε > 0, and -a property ϕ (or a set of properties Ψ ), and provides a learning procedure sampling a set W of paths, outputsÂ W , and terminates the sampling procedure while fulfilling Specification (2), with In what follows, we assume that the confidence level δ and absolute error ε are fixed. We first start with a negative result: if Ψ is the set of LTL formulas [2], such a learning process is impossible.

Theorem 2.
Given ε > 0, 0 < δ < 1, and a finite set W of paths randomly drawn with respect to a DTMC A, there is no learning strategy such that, for every LTL formula ϕ, Note that contrary to Theorem 1, the deviation in Theorem 2 is a difference between two exact probabilities (of the original system and of a learned model). The theorem holds as long asÂ W and A are not strictly equal, no matter howÂ W is learned. To prove this theorem, we show that, for any number of observations, we can always define a sequence of LTL properties that violates the specification above. It only exploits a single deviation in one transition. The proof, inspired by a result from [13], is given in the extended version. Example 1. We show in this example that in general, one needs to have some knowledge on the system in order to perform PAC learning -either a positive lower bound > 0 on the lowest probability transition, as in [1,14], or the support of transitions (but no knowledge on their probabilities), as we use in Sect. 6. Further, we show that the latter assumption does not imply the former, as even if no transitions are overlooked, the error in some reachability property can be arbitrarily close to 0.5 even with arbitrarily small error on the transition probabilities.
Let us consider DTMCs A,Â,B in Fig. 3, and formula F s 2 stating that s 2 is eventually reached. The probabilities to satisfy this formula in A,Â,B are respectively and PB(F s 2 ) = 0. Assume that A is the real system and thatÂ andB are DTMCs we learned from A. Obviously, one wants to avoid learningB from A, as the probability of F s 2 is very different inB and inÂ (0 instead of 0.5). If one knows that τ > for some lower bound > 0, then one can generate enough samples from s 1 to evaluate τ with an arbitrarily small error η 2 << on probability transitions with an arbitrarily high confidence, and in particular learn a DTMC similar toÂ.
On the other hand, if one knows there are transitions from s 1 to s 2 and to s 3 , then immediately, one does not learn DTMCB, but a DTMC similar to DTMĈ A (using e.g. Laplace smoothing [12]). While this part is straightforward with this assumption, evaluating τ is much harder when one does not know a priori a lower bound > 0 such that τ > . That is very important: while one can make sure that the error η 2 on probability transitions is arbitrarily small, if τ is unknown, then it could be the case that τ is as small as η 2(1−ε) > η 2 , for a small ε > 0. This gives us PÂ(F s 2 ) = 1 2 − 1−ε 2 = ε 2 , which is arbitrarily small, whereas P A (F s 2 ) = 0.5, leading to a huge error in the probability to reach s 2 . We work around that problem in Sect. 6 by defining and computing the conditioning of DTMCÂ. In some particular cases, as the one discussed in the next section, one can avoid that altogether (actually, the conditioning in these cases is perfect (=1), and it needs not be computed explicitly).

Learning for a Time-to-failure Property
In this section, we focus on property ϕ of reaching a failure state s F from an initial state s 0 without re-passing by the initial state, which is often used for assessing the failure rate of a system and the mean time between failures (see e.g., [25]). We assume that with probability 1, the runs eventually re-pass by s 0 or reach s F . Also, without loss of generality, we assume that there is a unique failure state s F in A. We denote γ(A, ϕ) the probability, given DTMC A, of satisfying property ϕ, i.e., the probability of a failure between two visits of s 0 .
Assume that the stochastic system S is observed from state s 0 . Between two visits of s 0 , property ϕ can be monitored. If s F is observed between two instances of s 0 , we say that the path ω = s 0 · ρ · s F satisfies ϕ, with s 0 , s F / ∈ ρ. Otherwise, if s 0 is visited again from s 0 , then we say that the path ω = s 0 · ρ · s 0 violates ϕ, with s 0 , s F / ∈ ρ. We call traces paths of the form In the following, we show that it is sufficient to use a frequency estimator to learn a DTMC which provides a good approximation for such a property.

Frequency Estimation of a DTMC
Given a set W of n traces, we denote n W ij the number of times a transition from state i to state j has occurred and n W i the number of times a transition has been taken from state i.
The frequency estimator of A is the DTMCÂ W = (â ij ) 1≤i,j≤m given bŷ In other words, to learnÂ W , it suffices to count the number of times a transition from i to j occurred, and divide by the number of times state i has been observed. The matrixÂ W is trivially a DTMC, except for states i which have not been visited. In this case, one can setâ ij = 1 m for all states j and obtain a DTMC. This has no impact on the behavior ofÂ W as i is not reachable from s 0 inÂ W .
LetÂ W be the matrix learned using the frequency estimator from the set W of traces, and let A be the real probabilistic matrix of the original system S. We show that, in the case of time-to-failure properties, γ(Â W , ϕ) is equal to the crude Monte Carlo estimatorγ W (A, ϕ) induced by W .

PAC Bounds for a Time-to-failure Property
We start by stating the main result of this section, bounding the error between γ(A, ϕ) and γ(Â W , ϕ): Theorem 3. Given a set W of n traces such that n = H(n) , we have: whereÂ W is the frequency estimator of A.
To prove Theorem (3), we first invoke Theorem 1 to establish: It remains to show thatγ W (A, ϕ) = γ(Â W , ϕ): It might be appealing to think that this result can be proved by induction on the size of the traces, mimicking the proof of computation of reachability probabilities by linear programming [2]. This is actually not the case. The remaining of this section is devoted to proving Proposition (1).
We first define q W (u) the number of occurrences of sequence u in the traces of W . Note that u can be a state, an individual transition or even a path. We also use the following definitions in the proof.

Definition 2 (Equivalence
The proof of Lemma 1 is provided in the extended version. In Lemma 1, (i) ensures thatÂ W =Â W and (ii) ensures the equality between the proportion of runs of W passing by s and satisfying γ, denotedγ s W , and the probability of reaching s F before s 0 starting from s with respect toÂ W . Formally, Proof. Let S 0 be the set of states s with no path inÂ W from s to s f without passing through s 0 . For all s ∈ S 0 , let p s = 0. Also, let The system of Eq. (8)  (reach s f before s 0 )) s∈S1 is trivially a solution of (8). But, since W satisfies the conditions of Lemma 1, we also have that (γ s W ) s∈S1 is a solution of (8), and thus we have the desired equality.
Notice that Lemma 2 does not hold in general with the set W . We have: That concludes the proof of Proposition 1. It shows that learning can be as efficient as statistical model-checking on comparable properties.

Learning for the Full CTL Logic
In this section, we learn a DTMCÂ W such thatÂ W and A have similar behaviors over all CTL formulas. This provides a much stronger result than on timeto-failure property, e.g., properties can involve liveness and fairness, and more importantly they are not known before the learning. Notice that PCTL [2] cannot be used, since an infinitesimal error on one > 0 probability can change the probability of a PCTL formula from 0 to 1. (State)-CTL is defined as follows: Definition 3. Let P rop be the set of state names. (State)-CTL is defined by the following grammar ϕ : As we want to compute the probability of paths satisfying a CTL formula, we consider the set Ψ of path-CTL properties, that is formulas ϕ of the form ϕ = Xϕ 1 , ϕ = ϕ 1 Uϕ 2 , ϕ = Fϕ 1 or ϕ = Gϕ 1 , with ϕ 1 , ϕ 2 (state)-CTL formulas. For instance, the property considered in the previous section is (¬s 0 )Us F .
In this section, for the sake of simplicity, the finite set W of traces is obtained by observing paths till a state is seen twice on the path. Then, the reset action is used and another trace is obtained from another path. That is, a trace ω from W is of the form ω = ρ · s · ρ · s, with ρ · s · ρ a loop-free path.
As explained in example 1, some additional knowledge on the system is necessary. In this section, we assume that the support of transition probabilities is known, i.e., for any state i, we know the set of states j such that a ij = 0. This assumption is needed both for Theorem 5 and to apply Laplace smoothing.

Learning DTMCs with Laplace Smoothing
Let α > 0. For any state s, let k s be the number of successors of s, that we know by hypothesis, and T = s∈S k s be the number of non-zero transitions. Let W be a set of traces, n W ij the number of transitions from state i to state j, and n W i = j n W ij . The estimator for W with Laplace smoothing α is the DTMĈ A α W = (â ij ) 1≤i,j≤m given for all i, j by: In comparison with the frequency estimator, the Laplace smoothing adds for each state s a term α to the numerator and k s times α to the denominator. This preserves the fact thatÂ α W is a Markov chain, and it ensures thatâ ij = 0 iff a ij = 0. In particular, compared with the frequency estimator, it avoids creating zeros in the probability tables.

Conditioning and Probability Bounds
Using Laplace smoothing slightly changes the probability of each transition by an additive offset η. We now explain how this small error η impacts the error on the probability of a CTL property.
Let A be a DTMC, and A η be a DTMC such that A η (i, j) = 0 iff A(i, j) = 0 for all states i, j, and such that j |A η (i, j) − A(i, j)| ≤ η for all states i. For all states s ∈ S, let R(s) be the set of states i such that there exists a path from i to s. Let R * (s) = R(s) \ {s}. Since both DTMCs have the same support, R (and also R * ) is equal for A and A η . Given m the number of states, the conditioning of A for s ∈ S and ≤ m is: i.e., the minimal probability from state i ∈ R * (s) to move away from R * (s) in at most steps. Let s be the minimal value such that Cond s s (A) > 0. This minimal s exists as Cond m s (A) > 0 since, for all s ∈ S and i ∈ R * (s), there is at least one path reaching s from i (this path leaves R * (s)), and taking a cycle-free path, we obtain a path of length at most m. Thus, the probability P A i (F ≤m ¬R * (s)) is at least the positive probability of the cylinder defined by this finite path. Formally, Theorem 4. Denoting ϕ the property of reaching state s in DTMC A, we have: Proof. Let v s be the stochastic vector with v s (s) = 1. We denote v 0 = v s0 . Let s ∈ S. We assume that s 0 ∈ R * (s) (else γ(A, ϕ) = γ(A η , ϕ) and the result is trivial). Without loss of generality, we can also assume that A(s, s) = A η (s, s) = 1 (as we are interested in reaching s at any step). With this assumption: We bound this error, through bounding by induction on t: We then have trivially: Note that for i = s, lim t→∞ v i · (A t ) · v s = 1 = lim t→∞ v i · A t η · v s , and thus their difference is null.
Let t ∈ N. We let j ∈ R * (s) such that

By the triangular inequality, introducing the term
· v s = 0, we have: We separate vector (v j · A s ) = w 1 + w 2 + w 3 in three sub-stochastic vectors w 1 , w 2 , w 3 : vector w 1 is over {s}, and thus we have w 1 · A t− s η = w 1 = w 1 · A t− s , and the term cancels out. Vector w 2 is over states of R * (s), with i∈R * w 2 [i] ≤ (1 − Cond s s (A)), and we obtain an inductive term ≤ (1 − Cond s s (A))E(t − s ). Last, vector w 3 is over states not in R(s), and we have and the term cancels out.
We also obtain that |v It yields for all t ∈ N: We can extend this result from reachability to formulas of the form S 0 US F , where S 0 , S F are subsets of states. This formula means that we reach the set of states S F through only states in S 0 on the way.
We define R(S 0 , S F ) to be the set of states which can reach S F using only states of S 0 , and R * (S 0 , S F ) = R(S 0 , S F ) \ S F . For ∈ N, we let: We can actually improve this conditioning: we defined it as the probability to reach S F or S \ R(S, S F ). At the price of a more technical proof, we can obtain a better bound by replacing S F by the set of states R 1 (S F ) that have probability 1 to reach S F . We let R * (S F ) = R(S, S F ) \ R 1 (S F ) the set of states that can reach S F with < 1 probability, and define the refined conditioning as follows:

Optimality of the Conditioning
We show now that the bound we provide in Theorem 4 is close to optimal. Consider again DTMCs A,Â in Fig. 3 from example 1, and formula F s 2 stating that s 2 is eventually reached. The probabilities to satisfy this formula in A,Â are respectively P A (F s 2 ) = 1 2 and PÂ(F s 2 ) = 1 2 − η 4τ . Assume that A is the real system and thatÂ is the DTMC we learned from A.
As we do not know precisely the transition probabilities in A, we can only compute the conditioning onÂ and not on A (it suffices to swap A and A η in Theorem 4 and 5 to have the same formula using Cond(A η ) = Cond(Â)). We have R(s 2 ) = {s 1 , s 2 } and R * (s 2 ) = R * (s 2 ) = {s 1 }. The probability to stay in R * (s 2 ) after s2 = 1 step is (1 − 2τ ), and thus Cond 1 . Notice that on that example, using s2 = m = 3, we obtain Cond 3 Our upper bound only has an overhead factor of 2, even while the conditioning is particularly bad (small) in this example.

PAC Bounds for j |Â W (i, j) − A(i, j)| ≤ η
We use Theorem 1 in order to obtain PAC bounds. We use it to estimate individual transition probabilities, rather than the probability of a property.
Let W be a set of traces drawn with respect to A such that every ω ∈ W is of the form ω = ρ · s · ρ · s. Recall for each state i, j of S, n W i is the number of transitions originating from i in W and n W ij is the number of transitions ss in W . Let δ = δ m stoch , where m stoch is the number of stochastic states, i.e., with at least two outgoing transitions.
We want to sample traces until the empirical transition probabilities n W ij n W i are relatively close to the exact transition probabilities a ij , for all i, j ∈ S. For that, we need to determine a stopping criteria over the number of state occurrences (n i ) 1≤i≤m such that: First, note that for any observed state i ∈ S, if a ij = 0 (or a ij = 1), then with probability 1, Let i ∈ S be a stochastic state. If we observe n W i transitions from i such that , then, according to Theorem 1, Moreover, we have: In other words, the probability that "there exists a state i ∈ S such that the deviation between the exact and empirical outgoing transitions from i exceeds ε" is bounded by δ as soon as for each state i ∈ S, n W i satisfies the stopping rule of the algorithm of Chen using ε and the corresponding δ . This gives the hypothesis j |A η (i, j) − A(i, j)| ≤ for all states i of Sect. 6.2.

A MatrixÂ W Accurate for all CTL properties
We now use Laplace smoothing in order to ensure the other hypothesis A η (i, j) = 0 iff A(i, j) = 0 for all states i, j. For all i ∈ S, we define the Laplace offset depending on the state i as where k i is the number of transitions from state i. This ensures that the error from Laplace smoothing is at most one tenth of the statistical error. Let α = (α i ) 1≤i≤m . From the sample set W , we output the matrixÂ α W = (â ij ) 1≤i,j≤m with Laplace smoothing α i for state i, i.e.:â It is easy to check that we have for all i, j ∈ S: That is, for all states i, Using the triangular inequality: For all i ∈ S, let H * (n W i , , δ ) = max j∈S H(n W i , n W ij , , δ ) be the maximal Chen bound over all the transitions from state i.
. Since in Theorem 5, the original model and the learned one have symmetric roles, by applying this theorem onÂ α W , we obtain that: Proof. First,â ij = 0 iff a ij = 0, by definition ofÂ α W . Second, P(∃i, j |a ij − a ij | > 11 10 ε) ≤ δ. We can thus apply Theorem 5 onÂ α W , A and obtain (9) for ϕ any formula of the form S 1 US 2 . It remains to show that for any formula ϕ ∈ Ψ , we can define S 1 , S 2 ⊆ S such that ϕ can be expressed as S 1 US 2 .
The last case is ϕ = Gϕ 1 , with ϕ 1 a CTL formula. Again, we define S 1 the set of states satisfying ϕ 1 , and S 2 the set of states satisfying the CTL formula AGϕ 1 . The probability of the set of paths satisfying ϕ = Gϕ 1 is exactly the same as the probability of the set of paths satisfying S 1 US 2 .

Algorithm
We give more details about the learning process of a Markov Chain, accurate for every CTL formula. For completeness, we also provide in the extended version a similar algorithm for a time-to-failure property.
A path ω is observed from s 0 till a state is observed twice. Then ω is added to W and the reset operation is performed. We use Laplace smoothing to compute

Evaluation and Discussion
In this section, we first evaluate Algorithm 1 on 5 systems which are crafted to evaluate the algorithm under different conditions (e.g., rare states). The objective of the evaluation is to provide some idea on how many samples would be sufficient for learning accurate DTMC estimations, and compare learning for all properties of CTL and learning for one time-to-failure property.
Then, we evaluate our algorithm on very large PRISM systems (millions or billions of states). Because of the number of states, we cannot learn a DTMC accurate for all properties of CTL there: it would ask to visit every single state a number of times. However, we can learn a DTMC for one specific (unbounded) property. We compare with an hypothesis testing algorithm from [31] which can handle the same unbounded property through a reachability analysis using the topology of the system. Table 1. Average number of observed events N (and relative standard deviation in parenthesis) given ε = 0.1 and δ = 0.05 for a time-to-failure property and for the full CTL logic using the refined conditioning Cond.

Evaluation on Crafted Models
We first describe the 5 systems: Systems 1 and 2 are three-state models described in Fig. 1 and Fig. 2. Systems 3 (resp. 5) is a 30-state (resp. 200-states) clique in which every individual transition probability is 1/30 (resp. 1/200). System 4 is a 64-state system modeling failure and repair of 3 types of components (3 components each, 9 components in total), see the extended version for a full description of the system, including a PRISM [20] model for the readers interested to investigate this system in details. We tested time-to-failure properties by choosing as failure states s 3 for Systems 1, 2, 3, 5, and the state where all 9 components fail for System 4.
We also tested Algorithm 1 (for full CTL logic) using the refined conditioning Cond. We performed our algorithms 100 times for each model, except for full CTL on System 4, for which we only tested once since it is very time-consuming. We report our results in Table 1 for ε = 0.1 and δ = 0.05. In particular, we output for each model its number of states and transitions. For each (set of) property, we provide the average number of observations (i.e. the number of samples times their average length) and the relative standard deviation (in parenthesis, that is the standard deviation divided by the average number of observed events).
The results show that we can learn a DTMC with more than 40000 stochastic transitions, such that the DTMC is accurate for all CTL formulas. Notice that for some particular systems such as System 4, it can take a lot of events to be observed before Algorithm 1 terminates. The reason is the presence of rare states, such as the state where all 9 components fail, which are observed with an extremely small probability. In order to evaluate the probabilities of CTL properties of the form: "if all 9 components fail, then CTL property ϕ is satisfied", this state needs to be explored many times, explaining the high number of events observed before the algorithm terminates. On the other hand, for properties that do not involve the 9 components failing as prior, such as time-to-failure, one does not need to observe this state even once to conclude that it has an extremely small probability to happen. This suggests that efficient algorithms could be developed for subsets of CTL formulas, e.g., in defining a subset of important events to consider. We believe that Theorem 4 and 5 could be extended to handle such cases. Over different runs, the results stay similar (notice the rather small relative standard deviation).
Comparing results for time-to-failure (or equivalently SMC) and for the full CTL logic is interesting. Excluding System 4 which involves rare states, the number of events that needs to be observed for the full CTL logic is 4.3 to 7 times more. Surprisingly, the highest difference is obtained on the smallest System 1. It is because every run of System 1 generated for time-to-failure is short (s 1 s 2 s 1 and s 1 s 2 s 3 ). However, in Systems 2,3 and 5, samples for time-to-failure can be much longer, and the performances for time-to-failure (or equivalently SMC) is not so much better than for learning a DTMC accurate for all CTL properties.
For the systems we tested, the unoptimized Cond was particularly large (more than 20) because for many states s, there was probability 0 to leave R(s), and hence (s) was quite large. These are the cases where Cond is much more efficient, as then we can choose s = 1 as the probability to reach s from states in R(s) is 1 (R 1 (s) = R(s) and R * (s) = ∅). We used Cond in our algorithm.
Finally, we evaluate experimental confidence by comparing the time-to-failure probabilities in the learned DTMC and the original system. We repeat our algorithms 1000 times on System 1 and 2 (with ε = 0.1 and δ = 0.05). These probabilities differ by less than ε, respectively 999 and 995 times out of 1000. Specification (2) is thus largely fulfilled (the specification should be ensured 950 out of 1000 times), that empirically endorses our approach. Hence, while our PAC bound over-approximates the confidence in the learned system (which is unavoidable), it is not that far from experimental values.

Evaluation on Large Models
We also evaluated our algorithm on large PRISM models, ranging from hundreds of thousands to billions of states. With these numbers of states, we cannot use the more ambitious learning over all the properties of CTL, which would need to visit every states a number of times. However, we can use our algorithm for learning a DTMC which is accurate given a particular (unbounded) property: it will visit only a fraction of the states, which is enough to give a model accurate for that property, with a well-learned kernel of states and some other states representatives for the remaining of the runs. We consider three test-cases from PRISM, satisfying the property that the sample stops with a conclusion (yes or no) with probability 1. Namely, herman, leader and egl. Table 2. Results for ε = 0.01 and δ = 0.001 of our algorithm compared with sampling with reachability analysis [31], as reported in [14], page 20. Numbers of samples needed by our method are given by the Massart bound (resp. by the Okamoto-Chernoff bound in parenthesis). TO and MO means time out (> 15 minutes on an Opteron 6134) and memory out (> 5GB) respectively.

Model name Size
Our learning method Sampling with reachability analysis [ Our prototype tool used in the previous subsection is implemented in Scilab: it cannot simulate very large systems of PRISM. Instead, we use PRISM to generate the samples needed for the learning. Hence, we report the usual Okamoto-Chernoff bound on the number of samples, which is what is implemented in PRISM. We also compare with the Massart bound used by the Chen algorithm (see Sect. 2.2), which is implemented in our tool and is more efficient as it takes into account the probability of the property.
For each model, we report its parameters, its size, i.e. its number of states, the number of samples needed using the Massart bound (the conservative Okamoto-Chernoff bound is in parenthesis), and the average path length. For comparison, we consider an hypothesis testing algorithm from [31] which can also handle unbounded properties. It uses the knowledge of the topology to do reachability analysis to stop the sampling if the property cannot be reached anymore. Hypothesis testing is used to decide with high confidence whether a probability exceeds a threshold or not. This requires less samples than SMC algorithms which estimate probabilities, but it is also less precise. We chose to compare with this algorithm because as in our work, it does not require knowledge on the probabilities, such as a lower bound on the transition probabilities needed by e.g. [14]. We do not report runtime as they cannot be compared (different platforms, different nature of result, etc.).
There are several conclusions we can draw from the experimental results (shown in Table 2). First, the number of samples from our algorithm (Chen algorithm implementing the Massart bound) are larger than in the algorithm from [31]. This is because they do hypothesis testing, which requires less samples than even estimating the probability of a property, while we learn a DTMC accurate for this property. For herman and leader, the difference is small (2.5x), because it is a case where the Massart bound is very efficient (80 times better than Okamoto-Chernoff implemented in PRISM). The egl system is the worst-case for the Massart bound (the probability of the property is 1 2 ), and it coincides with Okamoto-Chernoff. The difference with [31] is 40x in that case. Also, as shown in egl, paths in our algorithm can be a bit larger than in the algorithm from [31], where they can be stopped early by the reachability analysis. However, the differences are never larger than 3x. On the other hand, we learn a model representative of the original system for a given property, while [31] only provide a yes/no answer to hypothesis testing (performing SMC evaluating the probability of a property with the Massart bound would give exactly the same number of samples as we report for our learning algorithm). Last, the reachability analysis from [31] does time out or memory out on some complex systems, which is not the case with our algorithm.

Conclusion
In this paper, we provided theoretical grounds for obtaining global PAC bounds when learning a DTMC: we bound the error made between the behaviors of the model and of the system, formalized using temporal logics. While it is not possible to obtain a learning framework for LTL properties, we provide it for the whole CTL logic. For subsets of CTL, e.g. for a fixed timed-to-failure property, we obtain better bounds, as efficient as Statistical MC. Overall, this work should help in the recent trends of establishing trusted machine learning [16].
Our techniques are useful for designers of systems for which probabilities are governed by uncertain forces (e.g. error rates): in this case, it is not easy to have a lower bound on the minimal transition probability, but we can assume that the set of transitions is known. Technically, our techniques provides rationale to set the constant in Laplace smoothing, otherwise left to an expert to set.
Some cases remain problematic, such as systems where states are visited very rarely. Nevertheless, we foresee potential solutions involving rare event simulation [21]. This goes beyond the scope of this work and it is left to future work.