1 Introduction

Grammatical inference (GI) (Higuera 2010), also known as grammar induction or grammar learning, is concerned with learning language specifications in the form of grammars or automata from data consisting of strings over some alphabet. Starting with Angluin’s seminal work (Angluin 1987), methods have been developed for learning deterministic, non-deterministic and probabilistic grammars and automata. The learning techniques in GI have been applied in many areas, such as speech recognition, software development, pattern recognition, and computational biology. In this paper we adapt the learning techniques in the GI area to learn models for model checking.

Model Checking is a verification technique for determining whether a system model complies with a specification provided in a formal language (Baier and Katoen 2008). In the simplest case, system models are given by finite non-deterministic or probabilistic automata, but model-checking techniques have also been developed for more sophisticated system models, e.g., timed automata (Laroussinie et al. 1995; Bouyer et al. 2011, 2008). Powerful software tools that are available for model checking include UPPAAL (Behrmann et al. 2011) and PRISM (Kwiatkowska et al. 2011).

Traditionally, models used in model-checking are manually constructed, either in the development phase as system designs, or for existing hard- or software systems from known specifications and documentation. This procedure can be both time-consuming and error-prone, especially for systems lacking updated and detailed documentation, such as legacy software, 3rd party components, and black-box systems. These difficulties are generally considered a hindrance for adopting otherwise powerful model checking techniques, and have led to an increased interest in methods for data-driven model learning (or specification mining) for formal verification (Ammons et al. 2002; Sen et al. 2004a; Mao et al. 2011, 2012).

In this paper we investigate methods for learning deterministic probabilistic finite automata (DPFA) from data consisting of previously observed system behaviors, i.e., sample executions. The probabilistic models considered in this paper include labeled Markov decision processes (MDPs) and continuous-time labeled Markov chains (CTMCs), where the former model class also covers labeled Markov chains (LMCs) as a special case. Labeled Markov decision processes can be used to model reactive systems, where input actions are chosen non-deterministically and the resulting output for a given input action is determined probabilistically. Nondeterminism can model the free and unpredictable choices from an environment or the concurrency between components in a system. MDPs and by extension LMCs are discrete-time models, where each transition takes a universal discrete time unit. CTMCs, on the other hand, are real-time models, where the time delays between transitions are determined probabilistically. We show how methods for learning deterministic probabilistic finite automata (DPFA) (Carrasco and Oncina 1994, 1999; Higuera 2010) can be adapted for learning the above three model classes and pose the results within a model checking context. We give consistency results for the learning algorithms, and we analyze both theoretically and experimentally how the convergence of the learned models relates to the convergence of system properties expressed in linear time logics.

We also compare the accuracy of model checking learned models with the accuracy of a statistical model checking approach, where probabilities of query properties are directly estimated from the empirical frequencies in the data. Our results here demonstrate a smoothing effect of model learning which can prevent overfitting, but may in some cases also lead to less accurate results compared to statistical model checking. Our results also indicate a significant advantage of model learning over statistical model checking for the amortized time complexity over multiple queries.

1.1 Related work

Work on learning finite automata models can first be divided into two broad categories: active learning following Angluin’s \(\mathbf {L}^*\) algorithm (Angluin 1987), and passive learning based on a state-merging procedure.

Active learning is based on the assumption that there exists a teacher or an oracle that answers membership and equivalence queries. Originally developed by Angluin (1987) for learning deterministic finite automata, \(\mathbf {L}^*\) has been generalized in many different ways that also include extensions to learning automata models with inputs and outputs, as well as probabilistic automata: in Bollig et al. (2010), \(\mathbf {L}^*\) is exploited to learn communicating finite-state machines by using a given set of positive and negative message sequence charts to answer the membership and equivalence queries. In Niese (2003), \(\mathbf {L}^*\) is adapted to learn deterministic Mealy machines. This work is further extended to learn deterministic I/O automata by placing a transducer between the teacher and the Mealy machine learner (Aarts and Vaandrager 2010). In Grinchtein et al. (2005, (2006), \(\mathbf {L}^*\) is adapted to learn deterministic event-recording automata which is a subclass of real-time automata.

To learn probabilistic automata models, modified versions of \(\mathbf {L}^*\) have been proposed in which a membership query now asks for the probability of a given word in the target model (Tzeng 1992; de Higuera and Oncina 2004; Feng et al. 2011). In Komuravelli et al. (2012), \(\mathbf {L}^*\) combined with a stochastic state-space partitioning algorithm makes it possible to learn nondeterministic labeled probabilistic transition systems from tree samples. Exact oracles for (classical or probabilistic) membership and equivalence queries are usually not available in practice and have to be approximated. For deterministic finite automata this has been implemented using a conformance testing sub-routine (Raffelt and Steffen 2006).

Passive learning methods that only require data consisting of observed system behaviors have been developed for probabilistic automata models (Carrasco and Oncina 1994; Ron et al. 1996). These approaches are based on iteratively merging candidate states. Different approaches differ with respect to the strategy according to which candidate states are generated, and the criteria used for deciding whether to merge states. In algorithms following the paradigm of the Alergia algorithm (Carrasco and Oncina 1994), first a maximal, tree-shaped automaton is constructed, and iteratively reduced by recursive merge operations. The learning paradigm introduced by Ron et al. (1996), on the other hand, starts with a minimal automaton and successively refines it by expanding existing states with new candidate states. More important than these architectural differences, however, are differences in the criteria used for state merging. The most common approach is to use a statistical test for the equivalence of the distributions defined at the nodes (Carrasco and Oncina 1994; de la Higuera and Thollard 2000). For basic probabilistic automata only tests for the equivalence of binomial distributions are required, for which the use of the Hoeffding test is usually suggested. For timed automata models, this has been extended in Sen et al. (2004a) to also test the equivalence of two exponential distributions defining the delay times at the states. Thollard et al. (2000) provide the minimum divergence inference algorithm to control state merging: two nodes should be merged if the loss of the likelihood can be compensated by the reduced complexity of the resulting model. Ron et al. (1998) base the state merging decision on the existence of a distinguishing string, i.e. a string for which the difference of probability at the two candidate states exceeds a certain threshold. The state merging algorithms have been extended to learn stochastic transducers (Oncina et al. 1993) and timed automata (Verwer 2010).

In a number of papers the convergence properties of learning algorithms have been studied. Carrasco and Oncina (1994), de la Higuera and Thollard (2000) and Sen et al. (2004a) give learning in the limit results, i.e., the unknown automaton is correctly identified in the limit of large sample sizes. Quantitative bounds on the speed of convergence in the form of PAC learnability results are given in Ron et al. (1996), Clark and Thollard (2004) and Castro and Gavaldà (2008).

The use of grammatical inference techniques for model construction in a verification context has been proposed in several papers (Cobleigh et al. 2003; Giannakopoulou and Păsăreanu 2005; Leucker 2007; Singh et al. 2010; Feng et al. 2011). These papers focus on active learning using variants of \(\mathbf {L}^*\), and only Feng et al. (2011) consider the probabilistic case.

Statistical model checking (SMC) (Sen et al. 2004b; Legay et al. 2010) or approximate model-checking (Hérault et al. 2004) has a similar objective as model learning for verification. Instead of constructing a model from sample executions, one directly checks the empirical probabilities of properties in the data. Since the sample executions can only be finite strings, this approach is limited with respect to checking probabilities for unbounded properties.

1.2 Contribution and outline

Our work follows the Alergia paradigm and is closely linked to previous work (Carrasco and Oncina 1994; Sen et al. 2004a). We here do not introduce any major algorithmic novelties, but give an integrated account of learning system models that can also represent input/output behaviors and time delays. The novel aspect of this paper is a theoretical and experimental analysis of the feasibility of using the learned model for formal verification of temporal logic properties. We present theoretical results that based on the convergence properties for Alergia-like algorithms establish the convergence also of probability estimates for system properties of interest. An extensive empirical evaluation provides insight into the workings of the algorithm and demonstrates the feasibility of the learning approach for verification applications in practice. The evaluation also includes a detailed comparison of the learning approach with statistical model checking, considering both accuracy results and the time and space complexity for performing model checking. Finally, we provide a new detailed proof of the fundamental convergence results. While generally following the lines of argument pioneered in Carrasco and Oncina (1994), de la Higuera and Thollard (2000) and Sen et al. (2004a), our new proof contains the following improvements: it is cast in a very general framework, and accommodates in a uniform manner different classes of automata models, including input/output and timed automata. It is presented in a modular form that clearly identifies separate conditions for the algorithmic structure of the state merging procedure, for the statistical tests used for state-merging decisons, and for the data-generating process. The structure of the proof thereby facilitates the application of the convergence result to new learning scenarios. Since this general convergence analysis is somewhat independent from the rest of this paper, it is placed in a self-contained “Appendix”.

The paper is structured as follows: Sect. 2 presents background material. Section 3 describes the adapted Alergia algorithm for learning system models, and Sect. 4 analyzes the consistency and convergence properties of the learning algorithm. Section 5 provides empirical results on the behavior of the learning algorithm and demonstrates the use of the algorithm in a model checking context. The last section concludes the paper and outlines directions for future research. The “Appendix” contains our general convergence analysis. This paper is an extended version of Mao et al. (2011, (2012). Compared to these earlier conference publications, this paper significantly expands the theoretical analysis of the consistency aspects. It also includes a much more comprehensive experimental evaluation, in which the comparison against statistical model checking is added as a new dimension.

2 Preliminaries

2.1 Strings

We start by introducing the notion of strings that will be used throughout the paper.

  • Given a finite alphabet \(\varSigma \), we use \(\varSigma ^*\) and \(\varSigma ^{\omega }\) to denote the set of all finite and infinite strings over \(\varSigma \), respectively.

  • Given a infinite string \(s=\sigma _0\sigma _1 \ldots \in \varSigma ^{\omega }\) starting with the symbol \(\sigma _0\), \(s[j\ldots ] = \sigma _{j} \sigma _{j+1} \sigma _{j+2} \ldots \) is the suffix of s starting with the \((j+1)\)st symbol \(\sigma _j\) and \(\sigma _0 \sigma _1\ldots \sigma _j \in \varSigma ^{*}\) is the prefix of s.

  • Given an input alphabet \(\varSigma ^{\text {in}} \) and an output alphabet \(\varSigma ^{\text {out}} \), an infinite I/O string is denoted as \(\pi =\sigma _0\alpha _1\sigma _1\ldots \in \varSigma ^{\text {out}} \times (\varSigma ^{\text {in}} \times \varSigma ^{\text {out}})^{\omega }\), and \(\sigma _0 \alpha _1 \sigma _1 \ldots \alpha _n \sigma _n \in \varSigma ^{\text {out}} \times (\varSigma ^{\text {in}} \times \varSigma ^{\text {out}})^{*}\) is the prefix of s with \(2n+1\) alternating I/O symbols.

  • Given a finite string \(s=\sigma _0\sigma _1 \ldots \sigma _n\), we use \(\mathrm {prefix}(s)=\{\sigma _0 \ldots \sigma _j | 0\le j \le n\}\) to denote the set of all prefixes of string s. For a finite I/O string \(\pi =\sigma _0\alpha _1\sigma _1\ldots \alpha _n \sigma _n\), \(\mathrm {prefix}(\pi )=\{\sigma _0 \alpha _1\sigma _1\ldots \alpha _j \sigma _j | 0\le j \le n\}\). Given a set of finite strings \(S\), \(\mathrm {prefix}(S)\) denotes all prefixes of strings in S.

  • A timed string \(\rho =\sigma _0 t_0 \sigma _1 t_1\ldots \) includes the time delay \(t_i \in \mathbb R_{>0}\) between the observation of two consecutive symbols \(\sigma _i\) and \(\sigma _{i+1}\) in the string. Given a timed string \(\rho \), \(\rho [n] = \sigma _n\) is the (\(n+1\))th symbol of \(\rho \), \(\rho [n\ldots ] = \sigma _n t_n \sigma _{n+1} t_{n+1}\ldots \) is the suffix starting from the (\(n+1\))th symbol, \(\rho \langle n\rangle =t_n\) is the time spent between observing the symbols \(\sigma _n\) and \(\sigma _{n+1}\), and \(\rho @t\) is the suffix starting at time \(t\in \mathbb R_{>0}\), i.e., \(\rho @ t = \rho [n\ldots ] \), where n is the smallest index such that \(\sum \nolimits _{i = 0}^n {\rho \left\langle i \right\rangle } \ge t\). The skeleton of \(\rho \), denoted \(\mathbb S(\rho )\), is the string \(\sigma _0\sigma _1 \ldots \in \varSigma ^{\omega }\).

2.2 Stochastic system models

We begin with the definition of the basic (D)MC model, which quantifies transitions with probabilities. We next extend (D)MCs to DMDPs by introducing input actions, where each input action on a state defines a probability distribution over successor states. In both DMCs and DMDPs, the time spent in each state is given by a universal discrete time unit. We lift this assumption in DCTMCs by modeling the transition times using a probabilistic model.

Definition 1

(MC) A labeled Markov chain (MC) is a tuple \( \mathcal {M}^c =\langle Q, \varSigma ^{\text {out}},\mathbb {I}, \delta , L\rangle \), where

  • Q is a finite set of states,

  • \(\varSigma ^{\text {out}}\) is a finite alphabet,

  • \(\mathbb {I}:Q \rightarrow [0,1]\) is an initial probability distribution over Q such that \(\sum _{q\in Q}\mathbb {I}(q)=1\),

  • \(\delta :Q\times Q\rightarrow [0,1]\) is the transition probability function such that for all \(q\in Q\), \(\sum _{q'\in Q}\delta (q,q')=1\), and

  • \(L: Q\rightarrow \varSigma ^{\text {out}}\) is a labeling function.

Definition 2

(DMC) A labeled Markov chain is deterministic (DMC), if

  • there exists a start state \(q^s\in Q\) with \(\mathbb {I}(q^s)=1\), and

  • for all \(q\in Q\) and \(\sigma \in \varSigma ^{\text {out}} \): there exists at most one \(q'\in Q\) with \(L(q')=\sigma \) for which \(\delta (q,q')>0\).

Since the possible successor states in a DMC are uniquely labeled, we sometimes abuse notation and write \(\delta (q,\sigma )\) for \(\delta (q,q')\) where \(L(q')=\sigma \).

Each state in the \( \mathcal {M}^c \) represents a configuration of the system being modeled, and each transition represents the movement from one system configuration to another (quantified by a probability). An (infinite) path in \( \mathcal {M}^c \) is a string of states: \(h = q_0 q_1 \ldots \in Q^{\omega }\) where \(q_i \in Q\) and \(\delta (q_i,q_{i+1}) > 0\), for all \(i \in \mathbb N\). The trace for h, denoted \( trace (h)\), is a sequence of state labels \(s= \sigma _0\sigma _1\ldots \in (\varSigma ^{\text {out}})^{\omega }\), where \(\sigma _i=L(q_i)\) for all \(i \in \mathbb N\). Given a finite path \(h=q_0q_1\ldots q_n\), the cylinder set of h, denoted \( Cyl (q_0 q_1\ldots q_n)\), is defined as the set of infinite paths with the prefix h. The probability of the cylinder set is given by

$$\begin{aligned} P_{ \tiny \mathcal {M}^c }( Cyl (q_0 q_1\ldots q_n)) = \mathbb {I}(q_0) \cdot \prod _{i=1}^{n} \delta (q_{i-1},q_{i}). \end{aligned}$$

For any trace \(s\) in a DMC, there exists at most one path h such that \( trace (h)=s\), hence the definition above readily extends to cylinder sets for strings. If the MC is non-deterministic, there may exist more than one path with trace \(s\) in which case the probability of \( Cyl (s)\) is given by

$$\begin{aligned} P_{ \tiny { \mathcal {M}^c }}( Cyl (s)) = \sum \limits _{h:trace(h)=s}{P_{ \tiny \mathcal {M}^c }( Cyl (h))}. \end{aligned}$$

The probabilities assigned to cylinder sets induce a unique probability distribution on \((\varSigma ^{\text {out}})^{\omega }\) (equipped with the \(\sigma \)-algebra generated by the cylinder sets) (Baier and Katoen 2008). We denote this distribution also with \(P_{ \tiny { \mathcal {M}^c }}\). Moreover, we denote by \(P_{{\tiny \mathcal {M}^c },q}\) the distribution obtained by (re)defining \(q\in Q\) as the unique start state.

Note that our definition of (D)MCs differs from other versions of probabilistic automata, such as Rabin (1963) and Segala (1996): we assume states to be labeled, whereas the more common automaton model puts the labels on the transitions. Both types of models are equivalent, but a translation of a transition-labeled automaton to a state-labeled automaton may increase the number of states by a factor of \(\mid \! \varSigma ^{\text {out}} \!\mid \). Despite the increase in model size, we still adopt (D)MCs as system models due to the model checking tools and algorithms already developed for this model class.

The MC is a purely probabilistic model, i.e., in a certain state, the probability of reaching a specific state in the next step is known. Deterministic labeled Markov decision processes (DMDPs) extend DMCs with non-determinism, which can be used to model reactive systems where input actions are chosen non-deterministically and the resulting output for a given input action is determined probabilistically.

Definition 3

(DMDP) A deterministic labeled Markov decision process (DMDP) is a tuple \( \mathcal {M}^p = \langle Q, \varSigma ^{\text {in}}, \varSigma ^{\text {out}}, q^s, \delta , L \rangle \), where

  • \(Q, \mathbb {I},\text { and } L\) are the same as for DMCs,

  • \(\varSigma ^{\text {in}}\) is a finite alphabet of input actions,

  • \(\varSigma ^{\text {out}}\) is a finite alphabet of output symbols,

  • the transition probability function is defined as \(\delta :Q \times \varSigma ^{\text {in}} \times Q \rightarrow [0,1]\), such that for all \(q\in Q \) and all \(\alpha \in \varSigma ^{\text {in}} \), \(\sum _{q'\in Q} \delta (q,\alpha ,q')=1\), and

  • for all \(q\in Q \), \(\alpha \in \varSigma ^{\text {in}} \), and \(\sigma \in \varSigma ^{\text {out}} \), there exists at most one \(q'\in Q \) with \(L(q')=\sigma \in \varSigma ^{\text {out}} \) and \(\delta (q,\alpha , q')>0\).

The last condition in the definition above together with the existence of a unique initial state \(q^s\) makes the behavior of the model deterministic conditioned on the (non-deterministically chosen) input actions. Analogously to DMCs, we will sometimes abuse notation and write \(\delta (q,\alpha ,\sigma )\) instead of \(\delta (q,\alpha ,q')\) where \(L(q')=\sigma \). A path in a DMDP \( \mathcal {M}^p \) is an alternating sequence of states \(q_i \in Q\) and input symbols \(\alpha _i\in \varSigma ^{\text {in}} \), denoted as \(q_0 \alpha _1 q_1 \alpha _2 q_2 \ldots \). The trace of a path in a DMDP is defined analogously to the notion of trace in MCs. That is, the trace of a path \(q_0 \alpha _1 q_1 \alpha _2 q_2 \ldots \) is an alternating sequence of input symbols and state labels \(\pi =\sigma _0\alpha _1 \sigma _1\alpha _2\sigma _2\ldots \in \varSigma ^{\text {out}} \times (\varSigma ^{\text {in}} \times \varSigma ^{\text {out}})^{\omega }\), where \(\sigma _i =L(q_i)\). To reason about the probability of a set of paths in the DMDP, a scheduler (also known as an adversary or a strategy) is introduced to resolve the non-deterministic choices on the input actions.

Definition 4

(Scheduler) Let \( \mathcal {M}^p \) be a DMDP and \(Q^+\) be the set of state sequences of non-zero length. A scheduler for \( \mathcal {M}^p \) is a function \(\mathfrak {S}: Q^+\times \varSigma ^{\text {in}} \rightarrow [0,1]\) such that for all \(\varvec{q}=q_0q_1 \ldots q_n\in Q^+\), \(\sum _{\alpha \in \varSigma ^{\text {in}}}\mathfrak {S}(\varvec{q},\alpha )=1\). A scheduler is said to be deterministic if for all \(\varvec{q}\in Q^+\) there exists an \(\alpha \in \varSigma ^{\text {in}} \) for which \(\mathfrak {S}(\varvec{q},\alpha )=1\).

The scheduler specifies an action for each state based on the path history for that state. It is said to be fair if in any state q all input actions can be chosen with non-zero probability. If a scheduler \(\mathfrak {S}\) only depends on the current state we say that \(\mathfrak {S}\) is memoryless. An \( \mathcal {M}^p \) together with a scheduler \(\mathfrak {S}\) induce a probability distribution defined by the cylinder set of all finite path fragments in \( \mathcal {M}^p \). For a cylinder set \( Cyl (q_0 \alpha _1q_1\ldots \alpha _nq_n)\) the probability is defined as

$$\begin{aligned} P_{\tiny { \mathcal {M}^p ,\mathfrak {S}}}( Cyl (q_0 \alpha _1q_1\ldots \alpha _nq_n)) = \mathbb {I}(q_0) \cdot \prod _{i=1}^{n} \mathfrak {S}(q_0\ldots q_{i-1},\alpha _i)\delta (q_{i-1},\alpha _i,q_{i}). \end{aligned}$$

Similarly to DMCs, the probability distribution defined above induces a probability distribution over cylinder sets of I/O strings, and hence a distribution over infinite I/O sequences.

Example 1

The graphical model of a three-state DMDP \( \mathcal {M}^p \) is shown in Fig. 1a, where \(\varSigma ^{\text {in}} =\{\alpha ,\beta \}\) and \(\varSigma ^{\text {out}} =\{A, B\}\). From the initial state \(q^s\) (double circled) labeled with symbol A, the actions \(\alpha \) and \(\beta \) are chosen nondeterministically. Consider now the two memoryless schedulers \(\mathfrak {S}_1\) and \(\mathfrak {S}_2\) given by \(\mathfrak {S}_1(q) = \beta \), and \(\mathfrak {S}_2(q) = \alpha \) if \(q = q^s\) and \(\mathfrak {S}_2(q) =\beta \) otherwise. The schedulers induce the DMCs in Fig. 1b, c, where for the string \(s=AAAA\) we have \(P_{\tiny \mathcal {M}^c_{\mathfrak {S}_1}}(AAAA)=1\), and \(P_{\tiny \mathcal {M}^c_{\mathfrak {S}_2}}(AAAA)=4/9\).

Fig. 1
figure 1

a A DMDP \( \mathcal {M}^p \). b The DMC \(\mathcal {M}^c_{\tiny \mathfrak {S}_1}\) induced by the scheduler \(\mathfrak {S}_1\). c The DMC \(\mathcal {M}^c_{\mathfrak {S}_2}\) induced by the scheduler \(\mathfrak {S}_2\)

Both DMCs and DMDPs are discrete-time models, i.e., each transition takes a universal discrete time unit. The labeled deterministic continuous-time Markov chain (DCTMC) is a time-extension of the DMC, which models the amount of time the system stays in a specific state before making a transition to one of its successor states (Sen et al. 2004a; Chen et al. 2009).

Definition 5

(DCTMC) A deterministic labeled continuous-time Markov chain (DCTMC) is a tuple \( \mathcal {M}^t = \langle Q, \varSigma ^{\text {out}}, q^s, \delta , {R}, L, \rangle \), where:

  • \(Q, \varSigma ^{\text {out}}, q^s, \delta , L\) are defined as for DMCs;

  • \({R}: Q \rightarrow \mathbb R_{\ge 0}\) is the exit rate function.

In a DCTMC, the probability of making a transition from state q to one of its successor states \(q'\) within t time units is given by \( \delta (q,q') \cdot \left( {1 - e^{ - {R}(q) \cdot t} } \right) \), where \((1 - e^{ - {R}(q) \cdot t} )\) is the cumulative distribution of an exponential function with rate parameter \({R}(q)\).

Fig. 2
figure 2

a A DLMC \( \mathcal {M}^c \) and b a structurally identical DCTMC \( \mathcal {M}^t \) modeling the amount of time between state transitions

Example 2

Consider the DMC \( \mathcal {M}^c \) and the DCTMC \( \mathcal {M}^t \) shown in Fig. 2. Both models have three states with initial state \(q^s\) (double circled) and \(\varSigma ^{\text {out}} =\{A,B\}\). From \(q^s\), the probability of taking one of its two transitions are 1 / 3 and 2 / 3, respectively. Compared with \( \mathcal {M}^c \) in (a), the DCTMC \( \mathcal {M}^t \) in (b) has exit-rates associated with the states, e.g., 0.9 on \(q^s\). In \( \mathcal {M}^t \), the probability of leaving the initial state and moving to state \(q_2\) within t time units is calculated as \(2/3\cdot (1-e^{0.9\cdot t})\).

A timed path h in a DCTMC is an alternating sequence of states and time stamps \(q_0 {t_0} q_{1} {t_1} q_{2} \ldots \), where \(t_i \in \mathbb R_{>0}\) denotes the amount of time spent in state \(q_i\) before going to \(q_{i+1}\). By adopting the notation for timed strings we let \(h[n]= q_n\) and \(h\langle n\rangle = t_n\).

Let \( Cyl (q_0, I_0, \ldots , q_{k-1}, I_k,q_k)\) denote the cylinder set containing all paths with \(h \langle i \rangle \in I_i\) and \(h[i] = q_i\), for \(i<k\). The probability of \( Cyl (q_0, I_0, \ldots , q_{k-1}, I_k,q_k)\) is then defined inductively as follows (for \(k \ge 1\)) (Baier et al. 2003):

$$\begin{aligned}&P_{\tiny \mathcal {M}^t } ( Cyl (q_0, I_0, \ldots , q_{k-1}, I_k,q_k)) \nonumber \\&\quad = P_{\tiny \mathcal {M}^t } ( Cyl (q_0, I_0, \ldots , q_{k-1})) \cdot \delta (q_{k-1}, q_k) \cdot (e^{ - {R}(q_{k-1})\inf (I_k)} - e^{ - {R}(q_{k-1})\sup (I_k)} ). \end{aligned}$$

Following the definition of cylinder sets for DMCs, we can directly extend the definition above to probability distributions over cylinder sets for timed strings.

2.3 Specification languages

As will be detailed in Sect. 3, the proposed learning algorithms assume that data appears in the form of sequences of linearly ordered observations of the system in question. When learning system models, we therefore only look for models that preserve linear-time properties, which include safety properties (something bad will never happen) and liveness properties (something good will always happen).

Linear-time temporal logic (LTL) (Pnueli 1977) is a logical formalism used for specifying system properties from a linear time perspective. The property specified by an LTL formula does not only depend on the current state, but can also relate to future states. The basic ingredients of an LTL formula are atomic propositions (state labels \(\sigma \in \varSigma ^{\text {out}} \)), the Boolean connectors conjunction (\(\wedge \)) and negation (\(\lnot \)), and two basic temporal modalities \(\bigcirc \) (next) and \(\text{ U }\) (until) (Baier and Katoen 2008).

Definition 6

(LTL) Linear-time temporal logic (LTL) over \(\varSigma ^{\text {out}} \) is defined by the following syntax

$$\begin{aligned} \varphi {:}{:} {=} true \;|\;a \;|\; \varphi {}_1 \wedge \varphi _2 \;|\; \lnot \varphi \;|\; \bigcirc \varphi \; |\; \varphi _1 \text{ U }\varphi _2, \text { where } a\in \varSigma ^{\text {out}}. \end{aligned}$$

Definition 7

(LTL Semantics) Let \(\varphi \) be an LTL formula over \(\varSigma ^{\text {out}} \). For \( s= \sigma _0\sigma _1 \ldots \in (\varSigma ^{\text {out}})^\omega \), the LTL semantics of \(\varphi \) are as follows:

  • \(s \; \models \; true \)

  • \(s \; \models a \) iff \(a = \sigma _0\)

  • \(s \; \models \; \varphi _1 \wedge \varphi _2\) iff \(s \; \models \; \varphi _1\) and \( s \; \models \; \varphi _2\)

  • \(s \; \models \; \lnot \; \varphi \) iff \( s \nvDash \varphi \)

  • \(s \; \models \; \bigcirc \; \varphi \) iff \( s[1\ldots ]\models \;\varphi \)

  • \(s \; \models \; \varphi _1 \text{ U }\varphi _2 \) iff \( \exists j\ge 0.\; s[j\ldots ]\models \; \varphi _2\) and \(s[i\ldots ] \models \; \varphi _1\), for all \(0 \le i< j\)

For better readability, we also use the derived temporal operators \(\Box \) (always) and \(\lozenge \) (eventually) given by \(\lozenge \varphi =(true \text{ U }\varphi )\) (the model will eventually satisfy property \(\varphi \)) and \(\Box \varphi =\lnot (\lozenge \lnot \varphi )\) (property \(\varphi \) always holds).

Model checking an MC \( \mathcal {M}^c \) wrt. a LTL formula \(\varphi \) means to compute the total probability of the traces in \( \mathcal {M}^c \) which satisfy \(\varphi \), i.e., \(P_{\tiny \mathcal {M}^c }(\{ s \mid s \models \varphi , s \in (\varSigma ^{\text {out}})^\omega \})\).

Example 3

The LTL formula \(A \text{ U }B\) requires that a state q labeled with B will eventually be reached, and all states visited before q should all be labeled with A. For the DMC \( \mathcal {M}^c \) in Fig. 2a, only paths starting with \(q^s q_2\) satisfy the LTL formula. Model checking \( \mathcal {M}^c \) wrt. \(A \text{ U }B\) therefore amounts to computing the probability of all paths starting with \(q^s q_2\), i.e., \(P_{\tiny \mathcal {M}^c }(Cyl(q^s q_2)) = 2/3\). The LTL formula \(\lozenge \Box A\), read as eventually forever A, requires that after a certain point only states labeled with A will be visited. Paths starting from \(q_1\) satisfy \(\Box A\) and paths eventually reaching \(q_1\) satisfy \(\lozenge \Box A\). Model checking \( \mathcal {M}^c \) wrt. \(\lozenge \Box A\) can therefore be similarly reduced to the calculation of the probability \(P_{\tiny \mathcal {M}^c }(\mathop \cup \limits _{i \in [0,\infty )} {Cyl(q^s(q_2 q^s)^i q_1)}) = 1/3 + 2/3\cdot 1/3+ (2/3)^2\cdot 1/3 + \cdots = 1\).

The quantitative analysis of a DMDP \( \mathcal {M}^p \) against a specification \(\varphi \) amounts to establishing the lower and upper bounds that can be guaranteed when ranging over all possible schedulers. This corresponds to computing

$$\begin{aligned} P^{\mathrm {max}}_{\tiny \mathcal {M}^p }(\varphi ) = \mathop {\sup }\limits _{\tiny \mathfrak {S}} P_{\tiny \mathcal {M}^p , \mathfrak {S}} (\varphi ) \;\; \text {and} \;\; P^{\mathrm {min}}_{\tiny \mathcal {M}^p } (\varphi ) = \mathop {\inf }\limits _{\tiny \mathfrak {S}} P_{\tiny \mathcal {M}^p , \mathfrak {S}} (\varphi ), \end{aligned}$$

where the infimum and supremum are taken over all possible schedulers for \( \mathcal {M}^p \).

Continuous stochastic logic (CSL) (Baier et al. 2003) is a general branching-time temporal logic proposed for CTMCs that allows for a recursive combination of state and path formulas. However, as discussed in the beginning of the section, we only consider linear time properties of system models and we therefore define a linear sub-class of CSL, called sub-CSL, in which at most one temporal operator is allowed.

Definition 8

(sub-CSL) A sub-CSL formula \(\varphi \) is defined as follows:

$$\begin{aligned} \varphi {:}{:} {=} \varPhi \;|\; \varPhi _1 \text{ U }_I \varPhi _2 \;|\; \lozenge _I \varPhi \;|\; \Box _I \varPhi , \end{aligned}$$

where \(\varPhi \) is a propositional logic formula defined as \(\varPhi {:}{:} {=}true \;|\;a\;|\; \varPhi _1 \wedge \varPhi _2 \;|\; \lnot \varPhi a\in \varSigma ^{\text {out}} \), and I is an interval in \(\mathbb Q_{\ge 0}\).

Definition 9

(Semantics for sub-CSL) Let \(\varphi \) be a sub-CSL formula over \(\varSigma ^{\text {out}} \). The semantics of \(\varphi \) over a timed trace \(\rho =\sigma _0 t_0 \sigma _1 t_1 \ldots \) over \(\varSigma ^{\text {out}} \) is as follows

  • \(\rho \; \models \; \varPhi _1 \text{ U }_I \varPhi _2 \), iff \( \exists t \in I.\; (\rho @ t \models \varPhi _2 \wedge \forall t' < t, \rho @ t' \models \varPhi _1)\)

  • \(\rho \; \models \; \lozenge _I \varPhi \), iff \( \exists t \in I.\; (\rho @ t \models \varPhi )\)

  • \(\rho \; \models \; \Box _I \varPhi \), iff \( \forall t \in I.\; (\rho @ t \models \varPhi )\)

The semantics for the Boolean connectives are defined as for LTL.

Model checking a CTMC \( \mathcal {M}^t \) wrt. a sub-CSL formula \(\varphi \) amounts to computing the probability of the timed traces which satisfy \(\varphi \), i.e., \(P_{\tiny { \mathcal {M}^t }}(\varphi ) = P_{\tiny { \mathcal {M}^t }}(\{ \rho \mid \rho \models \varphi \})\).

Example 4

The sub-CSL formula \(\varphi = A \; \text{ U }_{[1.5,2.3]} \;B\) requires that a state q labeled with B will be reached within the time interval [1.5, 2.3] and that all states visited before q are labeled with A. For instance, the path \(q^s\; 1.8\; q_2\), generated by the DCTMC \( \mathcal {M}^t \) in Fig. 2b, satisfies \(\varphi \). Model checking \( \mathcal {M}^t \) against \(\varphi \) amounts to calculating \(P_{\tiny \mathcal {M}^t }(Cyl(q^s,[1.5,2.3],q_2)) = 2/3 \cdot (e^{-0.9\times 1.5} - e^{-0.9\times 2.3}) \approx 0.0444 \).

3 Learning stochastic models

In what follows we consider methods for automatically learning stochastic system models, as defined in Sect. 2, from data. The proposed algorithms are based on the Alergia algorithm (Carrasco and Oncina 1994; Higuera 2010) and adapted to a verification context. The Alergia algorithm starts with the construction of a frequency prefix tree acceptor (FPTA), which serves as a representation of the data. The basic idea of the learning algorithm is to approximate the generating model by merging together nodes in the FPTA which correspond to the same state in the generating model. Two nodes are merged after they pass a compatibility test based on the statistical information associated with the nodes. Both the compatibility test and the state merge are conducted recursively over all successor nodes.

In this section, we first present the original FPTA for strings, which only contain output symbols, and then extend it to handle I/O strings and timed strings. Afterwards, we discuss the general procedure of the Alergia algorithm. At the end, we customize the compatibility tests and merge operations for learning different types of system models.

3.1 Data representation

An FPTA T represents a set of strings \(S\) over \(\varSigma ^{\text {out}} \) in a tree structure, where each node is labeled by a symbol \(\sigma \in \varSigma ^{\text {out}} \) and each path from the root to a node \(q_s\) corresponds to a string \(s\in \mathrm {prefix}(S)\). Since a string s uniquely identifies a node in T and vice versa, we will sometimes use the symbol \(q_s\) for states and s for strings interchangeably. Each node \(q_s\) is associated with a transition frequency function \(f(q_s,\sigma )\), which encodes the number of strings with prefix \(s\sigma \) in \(S\); we define \(f (s,\cdot ) = \sum \nolimits _{\sigma \in \varSigma ^{\text {out}}} {f (s, \sigma )}\). The successor state of \(q_s\) given \(\sigma \) is denoted \(succ (s,\sigma ) = s\sigma \), and the set of all successor states of \(q_s\) is denoted \(succs (s)\). By normalizing the transition frequency functions \(f (s, \sigma )\) by \(f (s, \cdot )\) we obtain the transition probability functions \(\delta (s, \sigma )\). Figure 3a shows an FPTA constructed from observation sequences generated by the DMC in Fig. 2b. The root of the tree is labeled with the symbol A and associated with the frequencies \(f(A,B)=15\) and \(f(A,A)=7\). The frequency functions indicate that in the dataset there are 15 strings with prefix AB, 7 strings with prefix AA and there are 22 strings with prefix A, i.e., \(f(A,\cdot )=22\).

Fig. 3
figure 3

Examples of frequency prefix tree acceptors

The I/O frequency prefix tree acceptor (IOFPTA) is an extension of the FPTA for representing a set of I/O strings \(S_{io}\). In addition to the output symbols \(\sigma \in \varSigma ^{\text {out}} \) attached to the nodes, each edge is labeled with an input action \(\alpha \in \varSigma ^{\text {in}} \). Similar to FPTAs, a string from the root to a node \(q_{\pi }\) corresponds to an I/O string \(\pi \in \mathrm {prefix}(S_{io})\). A transition frequency function \(f (\pi ,\alpha ,\sigma )\) is associated with the node \(q_{\pi }\), to encode the number of strings with the prefix \(\pi \alpha \sigma \) in \(S_{io}\). As for FPTAs, we let \(f (\pi , \alpha , \cdot ) = \sum \nolimits _{\sigma \in \varSigma ^{\text {out}}} {f (\pi ,\alpha ,\sigma )}\).

By normalizing the transition frequency functions we obtain the transition probability functions \(\delta (\pi ,\alpha ,\sigma )\) for the IOFPTA. Figure 3b shows an IOFPTA constructed from I/O strings obtained from the DMDP in Fig. 1a.

A timed frequency prefix tree acceptor (TFPTA) represents a set of timed strings \(S_t\). A TFPTA is structurally identical to an FPTA and can be obtained from the skeleton of \(S_t\). Thus, the path from the root to a node \(q_s\) in an TFPTA corresponds to a prefix of the skeleton of a timed string in \(S_t\), i.e., \(s\in \mathrm {prefix}(\mathbb S(S_t))\). The transition frequency function associated with a node \(q_s\) is defined as for FPTAs by only considering the skeleton of \(S_t\). In addition to the transition frequency function, each node \(q_s\) is also associated with an average empirical exit time \(\hat{t}(s)\) (which is approximately the inverse of the exit-rate):

$$\begin{aligned} \hat{t}(s)=\frac{1}{f(s, \cdot )} \cdot \sum \nolimits _{\rho \in X} \rho \langle |s| \rangle , \end{aligned}$$

where \(X= \{\rho \mid s\in \mathrm {prefix}(\mathbb S(\rho )) , \rho \in S_t\}\) and |s| is the number of symbols in the string s. Figure 3c illustrates an TFPTA constructed from strings sampled from the DCTMC in Fig. 2. Each node in the tree is associated with an average exit time, i.e., the time spent in the state before observing the next symbol. With the symbol A occurring 22 times as prefix of a string, we get an average exit time of 1.2 time units for the root node and the estimation of the exit rate is therefore \(\frac{1}{1.2}\approx 0.83\).

3.2 Alergia

In this section we first sketch the main flow of the Alergia algorithm for learning DMCs as a modified version of the algorithm presented in Carrasco and Oncina (1994) and Higuera (2010). Afterwards we adapt the general learning algorithm to the different stochastic system models considered in this paper.

The Alergia algorithm is initialized by creating two identical FPTAs T and A as representations of the dataset \(S\) (line 2 of Algorithm 1). The FPTA T is kept as a data representation from which relevant statistics are retrieved during the execution of the algorithm. The FPTA A is iteratively transformed by merging nodes that have passed a statistical compatibility test. Observe that an FPTA (with normalized transition functions) is a DMC, and so is any model obtained by iteratively merging nodes. Similar properties hold for IOFPTAs and TFPTAs.

All compatibility tests are based on T to ensure the statistical validity of the compatibility tests that are performed. In some accounts of the Alergia algorithm it is suggested to join samples associated with different nodes of the original FPTA when the nodes are merged (Carrasco and Oncina 1994), and to base subsequent tests on these joined samples. While intuitively beneficial, since more data becomes available for testing, this latter approach invalidates some statistical arguments for the overall consistency of the algorithm: if \(S_1\) and \(S_2\) are two sets of samples that each are drawn by independent sampling from the same distribution, then the union \(S_1\cup S_2\) no longer is a set of independent samples, if the union is performed conditional on the fact that \(S_1\) and \(S_2\) have passed a statistical test of compatibility. Since the assumption of independent sampling underlies all statistical tests we are using, such a join, therefore, makes a theoretical analysis of the resulting procedure very challenging. In order to maintain a strong match between the algorithmic solution, and the theoretical analysis we can provide, we generally do not join the associated samples when merging nodes. However, we have also conducted a few experiments comparing the performance of the algorithm with and without joining of the samples. It turned out that the differences in the constructed models and the runtime were only minor (cf. Sect. 5.1).

Following the terminology of Higuera (2010), Algorithm 1 maintains two sets of nodes: RED nodes, which have already been determined as representative nodes and will be included in the final output model, and BLUE nodes which are scheduled for testing. Initially, RED contains only the initial node \(q^s_A\) while BLUE contains the immediate successor nodes of the initial node. When performing the outer loop of the algorithm, the lexicographically minimal node \(q_b\) in BLUE will be chosen. If there exists a node \(q_r\) in RED which is compatible with \(q_b\), then \(q_b\) and its successor nodes are merged to \(q_r\) and the corresponding successor nodes of \(q_r\), respectively (line 10). If \(q_b\) is not compatible with any state in RED, it will be included in RED (line 15). At the end of each iteration, BLUE will be updated with the immediate successor nodes of RED that are not contained in RED (line 17). After merging all compatible nodes in the tree, the frequencies in A are normalized (line 18 of Algorithm 1).

In order to adapt the Alergia algorithm to the different model classes presented in Sect. 2, we only need to tailor the compatibility test (line 9) and the merge operator (line 10) to each specific model class. In the following section, the required model-specific compatibility tests and merge operators are presented.

figure a

3.3 Local compatibility test and merge

Formally, two nodes \(q_r\) and \(q_b\) in an FPTA T are said to be \(\epsilon \)-compatible (\(\epsilon >0\)), if the following properties are satisfied:

  1. 1.

    \(L(q_r)=L(q_b)\),

  2. 2.

    \(\text {LocalCompatible }(q_r,q_b, \epsilon )\) is TRUE, and

  3. 3.

    the successor nodes of \(q_r\) and \(q_b\) are pair-wise \(\epsilon \)-compatible.

figure b

Algorithm 2 illustrates the compatibility test. Condition (1) requires the two nodes to have the same label (line 1). Condition (2) is model-specific and defines the local compatibility test for \(q_r\) and \(q_b\) (line 4). The last condition requires the compatibility to be recursively satisfied for every pair of successor nodes of \(q_r\) and \(q_b\) (line 10). Note that only pairs of successor nodes reached by the same output symbol (as well as the same input symbol in the IOFPTA case) are tested. For example, \(q_r'\) and \(q_b'\) are being tested only if \(q_r' = succ(q_r, \sigma )\) and \(q_b' = succ(q_b, \sigma )\) (in an IOFPTA, \(q_r'\) and \(q_b'\) are determined as \(q_r' = succ(q_r, \alpha , \sigma )\) and \(q_b' = succ(q_b, \alpha , \sigma )\)).

The compatibility test depends on a parameter \(\epsilon \) that controls the severity of the LocalCompatible tests, which are defined so that smaller values of \(\epsilon \) will make LocalCompatible return false less often. In most cases, \(\epsilon \) directly translates to the significance level of a statistical test that is the core of the LocalCompatible test.

In the following sections, we start by specifying the local compatibility test and merge procedure for FPTAs, and afterwards extend the specifications to IOFPTAs and TFPTAs. For FPTAs and IOFPTAs, the local compatibility test depends only on the local transition frequency functions, whereas for TFPTAs we also need to take the estimated exit rates into account. Analogous considerations apply for the merge procedure.

3.3.1 Local compatibility test and merge in FPTAs

Given two nodes \(q_r\) and \(q_b\) in an FPTA, their local compatibility requires that the difference between the next symbol distributions defined at two nodes is bounded. Specifically, we check for local compatibility (Line 4 in Algorithm 2) by employing the Hoeffding test (see Algorithm 3) realized by the call \(\text {Hoeffding }(f(q_r,\sigma ), f(q_r, \cdot ),f(q_b,\sigma ),f(q_b,\cdot ),\epsilon )\), for all \(\sigma \in \varSigma ^{\text {out}} \). Line 4 of Algorithm 3 is a statistical test for the identity of the transition probabilities at the states \(q_r\) and \(q_b\) to their \(\sigma \)-successors (Carrasco and Oncina 1999). The actual statistical level of significance of this test is given by \(2\epsilon \) rather than \(\epsilon \) itself. However, for the asymptotic consistency analysis that we will be concerned with in Sect. 4 and “Consistency of Alergia-style Learning” of Appendix the constant factor 2 is immaterial, and we will a little loosely refer to \(\epsilon \) as the significance level of the Hoeffding compatibility test. Also observe that the feasible range of the \(\epsilon \) parameter is (0, 2]. At \(\epsilon =2\) line 4 will always return false.

figure c

If two nodes \(q_r\) and \(q_b\) are compatible, \(q_b\) is merged to \(q_r\). The merge procedure (line 10 of Algorithm 1) follows the same steps as described in Higuera (2010). Firstly, the (unique) transition leading to \(q_b\) from its predecessor node \(q'\) (\(f^A(q', q_b)>0\)) is re-directed to \(q_r\) by setting \(f^A(q',q_r) \leftarrow f^A(q', q_b)\) and \(f^A(q', q_b)=0\). Secondly, the successor nodes of \(q_b\) are recursively folded to the corresponding successor nodes of \(q_r\) and the associated frequencies are updated. The complete merge procedure is illustrated in Fig. 4.

Fig. 4
figure 4

Merge node \(q_b\) (shadowed) to node \(q_r\) (shadowed and double circled) in the FPTA. a The transition from \(q'\) to \(q_b\) is redirected to \(q_r\). b Node \(q_b\) and its two outgoing transitions are folded to \(q_r\) and the frequencies are updated: \(f(q_r,B)=f(q_r,B)+f(q_b,B)=25\) and \(f(q_r,C)=f(q_r,C)+f(q_b,C)=11\). c The resulting FPTA obtained after recursively folding the successor nodes of \(q_r\) and \(q_b\)

3.3.2 Local compatibility test and merge in IOFPTAs

In an IOFPTA, the transition frequency function on node q, \(f(q_r, \alpha , q')\), is also conditioned on the input action \(\alpha \). Thus, in order to adapt the local compatibility test to IOFPTAs, we compare the transition probability distribution defined for each input action. Specifically, given two nodes \(q_r\) and \(q_b\), the Hoeffding test, realized by the procedure call \(\text {Hoeffding }(f(q_r, \alpha , \sigma ), f(q_r,\alpha , \cdot ), f(q_b,\alpha , \sigma ), f(q_b,\alpha ,\cdot ), \epsilon )\), is conducted for all \(\sigma \in \varSigma ^{\text {out}} \) conditioned on \(\alpha \in \varSigma ^{\text {in}} \). Similarly, the test is performed iteratively for all input actions at two given nodes.

The merge procedure for two compatible nodes in IOFPTA is similar to the one in FPTAs. An example is shown in Fig. 5. Observe that the frequencies are aggregated along the different input actions.

Fig. 5
figure 5

Node \(q_b\) is merged to node \(q_r\) in the IOFPTA. a The transition from \(q'\) to \(q_b\) is redirected to \(q_r\). b Successor nodes of \(q_b\) are locally folded along input actions to \(q_r\). c The IOFPTA resulting from recursively folding the subtrees rooted at \(q_r\) and \(q_b\)

3.3.3 Local compatibility test and merge in TFPTAs

The nodes in a TFPTA are associated with transition frequency functions and exit-rates encoding the local transition times. We therefore define two nodes \(q_r\) and \(q_b\) in a TFPTA to be compatible if the transition probability distributions over their successor nodes as well as their exit-rates are compatible. The compatibility of transition distributions for two nodes are, as for MCs, tested by the Hoeffding test (Algorithm 3). The compatibility test of the exit rates follows the procedure described in Sen et al. (2004a), which is essentially the F-test originally introduced in Cox (1953). The test is based on the ratio \({{\hat{t}_r }}/{{\hat{t}_b }}\) of the average empirical time delays at \(q_r\) and \(q_b\). The precise test criterion is given in Algorithm 4.

figure d

4 Consistency and convergence analysis

In this section we investigate theoretical convergence results for Alergia learning. These results consist of two components: first, we establish that in the large sample limit the learning algorithm will correctly identify the structure of a data generating model (modulo equivalence of the states). This component is related to previous convergence results (Carrasco and Oncina 1999; de la Higuera and Thollard 2000; Sen et al. 2004a), and we provide the main technical results in “Consistency of Alergia-style Learning” of Appendix. The second component is to establish that identification of the structure together with convergence of the estimates for the probabilistic parameters of the models (transition probabilities and exit rates) guarantees convergence of the probabilities for model properties expressed in our formal specification languages.

Our analysis, thus, focuses on exact identification in the limit, and thereby differs from probably approximately correct (PAC) learning results, such as presented in Ron et al. (1996) and Clark and Thollard (2004). PAC learning results are stronger than identification in the limit results in that they provide bounds on the sample complexity required to learn a good approximation of the true model. However, a PAC learnability analysis first requires the specification of a suitable metric to measure the quality of approximation. Existing PAC learning approaches for probabilistic automata are based on a semantics for the automata as defining a probability distribution over \(\varSigma ^*\). In that case, the Kullback-Leibler divergence between the distributions defined by the true and the approximate model is a canonical measure of approximation error.

Being interested in the probability of LTL properties, we, on the other hand, have to see automata as defining distributions on \(\varSigma ^{\omega }\). The Kullback–Leibler divergence between the distributions defined on \(\varSigma ^{\omega }\) is not a suitable measure for approximation quality, since it will almost always be infinite (even in the case where the approximate model is structurally identical to the true one, and differs with respect to transition probabilities only by an arbitrarily small \(\epsilon > 0\)). Within the verification literature, various versions of the bisimulation distance are a popular measure for approximate equivalence between system models (Desharnais et al. 1999; Breugel and Worrell 2005). However, it turns out that these metrics suffer from the same problem as the Kullback-Leibler distance, and fail to measure approximation quality as a smooth function of \(\epsilon \)-errors in the estimates of transition probabilities. These and other candidate measures for approximate equivalence of automata defining distributions on \(\varSigma ^{\omega }\) are investigated in detail in Jaeger et al. (2014). A number of counterexamples and impossibility results derived in Jaeger et al. (2014) indicate that there exist fundamental obstacles to defining measures for approximation error that simultaneously satisfy the two desiderata: (a) to provide a basis on which PAC learnability results could be derived, and (b) small approximation errors between models should also entail bounds on the differences between the probabilities of LTL properties in the models (a desideratum called “LTL continuity” in Jaeger et al. (2014)).

For the analysis of the identification of the structure, we now begin by formally defining the relevant equivalence relation of states. In the following, for any automaton \({\mathcal {M}}\) and state q of \({\mathcal {M}}\), we denote with \(({\mathcal {M}},q)\) the automaton obtained by (re-)defining q as the start state of \({\mathcal {M}}\).

Definition 10

Let \({\mathcal {M}}\) be a DLMC or DCTMC. States \(q,q'\) of \({\mathcal {M}}\) are equivalent, written \(q\sim q'\), if \(P_{\tiny (\mathcal {M},q)}=P_{\tiny (\mathcal {M},q')}\). States \(q,q'\) of a DMDP \({\mathcal {M}}\) are equivalent, if for all schedulers \(\mathfrak {S}\) of \(({\mathcal {M}},q)\) there exists a scheduler \(\mathfrak {S}'\) of \(({\mathcal {M}},q')\), such that \(P_{{\tiny (\mathcal {M},q)},\mathfrak {S}}=P_{{\tiny (\mathcal {M},q')},\mathfrak {S}'}\), and vice-versa.

When \(q\sim q'\), then also \(\delta (q,\alpha ,\sigma )\sim \delta (q',\alpha ,\sigma )\) for all \((\alpha ,\sigma )\in \varSigma ^{\text {in}} \times \varSigma ^{\text {out}} \). Therefore, \(\delta \) is also well-defined on \((Q/\sim ) \times \varSigma ^{\text {in}} \times (Q/\sim )\), and we thereby obtain the quotient automaton \({\mathcal {M}}/\sim \) whose states are the \(\sim \)-equivalence classes of \({\mathcal {M}}\).

Next, we formally define the structure of an automaton.

Definition 11

Let \({\mathcal {M}}\) be a DLMC, DMDP, or DCTMC. The structure of \({\mathcal {M}}\) is defined as \(\widehat{\mathcal {M}}:=\langle Q, \varSigma ^{\text {in}}, \varSigma ^{\text {out}}, q^s, \hat{\delta }, L\rangle \), where \(\hat{\delta } \subseteq Q\times \varSigma ^{\text {in}} \times Q\) is the transition relation defined by \((q,\alpha ,q')\in \hat{\delta } \Leftrightarrow \delta (q,\alpha ,q')>0\).

For DLMCs and DCTMCs the \(\varSigma ^{\text {in}} \) component should be regarded as vacuous in the preceding definition.

The first component of the convergence result will be the identification in the limit of \(\widehat{ {\mathcal {M}}/\sim }\). Before we can state that result, however, we have to consider the question of how training data for the learner is assumed to be generated. Since our automaton models are generative models for infinite sequences, one cannot simply assume that the training data consists of sampled runs of an automaton. All we can observe (and all that Alergia will accept) are finite initial segments of such runs. Thus, in the data-generation process, one has to assume that there is an external process that decides at what point the generation of a sequence is terminated. Furthermore, in the case of DMDP learning, an external scheduler is required to generate inputs. Both these external components must satisfy certain conditions, so that the generated data is rich enough to contain sufficiently many sampled transitions from all states and under all inputs. At the same time, the significance level \(\epsilon \) for Alergia must be chosen so that certain correctness guarantees for the compatibility tests performed during the execution of the algorithm are obtained. The sampling mechanism for finite sequences and the choice of significance levels for the compatibility tests are interrelated. The details of this relationship are elaborated in “Appendix”. For the present section, we only consider the case where data is generated as follows:

  • The length of the observed sequence is randomly determined by a geometric distribution with parameter \(\lambda \). This is equivalent to generating strings with an automaton where at each state the generating process terminates with probability \(\lambda \).

  • Inputs are generated by a scheduler that always chooses inputs uniformly at random.

We refer to this procedure as geometric sampling. It defines a probability distribution \(P^s\) on \((\varSigma ^*)^{\omega }\), where depending on the underlying automaton, \(\varSigma \) is \(\varSigma ^{\text {out}} \) (for DLMCs), \(\varSigma ^{\text {in}} \times \varSigma ^{\text {out}} \) (for DMDPs), or \(\varSigma ^{\text {out}} \times \mathbb R_{>0}\) (for DCTMCs).

Theorem 1

Let \({\mathcal {M}}\) be a DLMC or DMDP. Let \(S\in (\varSigma ^*)^{\omega }\) be generated by geometric sampling. Let \(\epsilon _N=1/N^r\) for some \(r>2\), and let \({\mathcal {M}}_N\) be the model learned by Alergia from the first N strings in S using significance level \(\epsilon _N\) in the compatibility tests. Then

$$\begin{aligned} P^s(\widehat{{\mathcal {M}}_N} = \widehat{ {\mathcal {M}}/\sim }\ \text{ for } \text{ almost } \text{ all } N)=1. \end{aligned}$$
(1)

Let \({\mathcal {M}}\) be a DCTMC, and S as above. There exist values \(\epsilon _N\) with \(1/N\le \epsilon _N\le 1/\sqrt{N}\), such that for \({\mathcal {M}}_N\) the model learned by Alergia from the first N strings in S using significance level \(\epsilon _N\):

$$\begin{aligned} \lim _{N\rightarrow \infty } P^s(\widehat{{\mathcal {M}}_N} = \widehat{ {\mathcal {M}}/\sim })=1. \end{aligned}$$
(2)

(2) also holds when \({\mathcal {M}}\) is a DLMC or DMDP, and \(\epsilon _N=1/N^r\) for some \(r\ge 1\).

The Theorem is a consequence of Theorem 4 and Lemmas 3 and 4 in “Consistency of Alergia-style Learning” of Appendix. The second part of the Theorem does not provide a complete description of the required sequence of significance levels \(\epsilon _N\), because the exact \(\epsilon _N\) values (obtained in the proof of Lemma 4) are defined in terms of the expected values of the size of the IOFPTA constructed from a sample of size N, and we can only bound this expected value, but do not have a closed-form expression as a function of N.

The reason we obtain somewhat stronger convergence guarantees for DLMCs and DMDPs than for DCTMCs lies in the fact that we have stronger results on the power of the Hoeffding test, than the F-test (cf. “Statistical Tests” of Appendix). It is an open problem whether almost sure convergence actually also holds for DCTMCs with the currently used F-test, or whether it could be obtained with a different, more powerful test for the compatibility of exponential distributions.

We are now ready to turn to the second component of our consistency analysis: ultimately, we are interested in whether the probabilities of properties expressed in the formal specification languages LTL and sub-CSL computed on the learned models converge to the probabilities defined by the true model. By Theorem 1 we know that the learned model will eventually have the correct structure, and the laws of large numbers also guarantee that the estimates of the transition probability and exit rate parameters will converge to the correct values. This, however, in general will not be enough to guarantee the convergence of the probabilities of complex system properties. As the following two Theorems show, however, we do obtain such a guarantee for properties expressed in LTL and sub-CSL. Since the sub-CSL case here is simpler, we consider it first.

Theorem 2

Let \({\mathcal {M}}\) be a DCTMC. Let \({\mathcal {M}}_N\) as in Theorem 1. For all sub-CSL properties \(\varphi \), and all \(\delta > 0\) then:

$$\begin{aligned} \lim _{N\rightarrow \infty } P^s( \mid \! P_{{\mathcal {M}}_N}(\varphi ) - P_{{\mathcal {M}}}(\varphi ) \!\mid > \delta ) = 0. \end{aligned}$$

Proof

By Theorem 1 we have that the probability that \({\mathcal {M}}_N\) and \({\mathcal {M}}/\sim \) have different structures is negligible in the limit. Conditional on \({\mathcal {M}}_N\) and \({\mathcal {M}}/\sim \) having the same structure, we also have by the law of large numbers that the parameters of \({\mathcal {M}}_N\) converge to the parameters of \({\mathcal {M}}/\sim \). It is therefore sufficient to show that then also \(P_{{\mathcal {M}}_N}(\varphi )\) converges to \(P_{{\mathcal {M}}/\sim }(\varphi )=P_{{\mathcal {M}}}(\varphi )\).

All properties \(\varphi \) expressible in sub-CSL are finite-horizon in the sense that there exists a fixed time limit t, such that whether a timed trace \(\rho =\sigma _0 t_0 \sigma _1 t_1 \ldots \) satisfies \(\varphi \) only depends on the prefix \(\rho [0:k]\), where k is such that \(t_0+\cdots + t_k>t\). For a purely propositional formula \(\varPhi \) this is \(t=0\), and for a formula containing a temporal operator with subscript I, t is the upper bound \(I^u\) of I. The set of traces satisfying \(\varphi \), therefore, can be represented as a countable disjoint union of sets of paths that are slightly generalized forms of cylinder sets. For example, the set of paths satisfying \(\varPhi _1 \text{ U }_I \varPhi _2\) is the union over all paths of the form \(q_0t_0\ldots q_{k-1}t_{k-1}q_k t_k\) where \(q_0,\ldots ,q_{k-1}\) satisfy \(\varPhi _1\), \(q_k\) satisfies \(\varPhi _2\), and \(t_0+\cdots +t_{k}\in I\). The probabilities of such slightly generalized cylinder sets are a continuous function of the transition probability and exit rate parameters of \({\mathcal {M}}_N\), and therefore the convergence of these parameters guarantees the convergence of the probabilities of the generalized cylinder sets, and thereby the convergence of the probability of \(\varphi \). \(\square \)

We now state the corresponding results for LTL and DMDPs, which subsumes the case of LTL and DLMCs

Theorem 3

Let \({\mathcal {M}}\) be a DMDP, and \({\mathcal {M}}_N\) as in Theorem 1 using significance levels \(\epsilon _N=1/N^r\). If \(r>2\), then for all LTL properties \(\varphi \):

$$\begin{aligned} P^s( \lim _{N\rightarrow \infty }P^{\mathrm {max}}_{{\mathcal {M}}_N}(\varphi ) = P^{\mathrm {max}}_{{\mathcal {M}}}(\varphi ))= P^s( \lim _{N\rightarrow \infty }P^{\mathrm {min}}_{{\mathcal {M}}_N}(\varphi ) = P^{\mathrm {min}}_{{\mathcal {M}}}(\varphi ))=1. \end{aligned}$$
(3)

If \(r\ge 1\), then for all \(\delta >0\):

$$\begin{aligned} \lim _{N\rightarrow \infty } P^s(\mid \! P^{\mathrm {max}}_{{\mathcal {M}}_N}(\varphi ) - P^{\mathrm {max}}_{{\mathcal {M}}}(\varphi ) \!\mid>\delta ) = \lim _{N\rightarrow \infty } P^s(\mid \! P^{\mathrm {min}}_{{\mathcal {M}}_N}(\varphi ) - P^{\mathrm {min}}_{{\mathcal {M}}}(\varphi ) \!\mid >\delta ) =0 \end{aligned}$$
(4)

The following is a slightly generalized version of the proof that was given for DLMCs in Mao et al. (2011).

Proof

Using the automata-theoretic approach to verification (Vardi 1985; Courcoubetis and Yannakakis 1995; Vardi 1999; Baier and Katoen 2008, Section 10.6.4), the probabilities \(P^{\mathrm {max}}_{{\mathcal {M}}_N}(\varphi )\) and \(P^{\mathrm {max}}_{{\mathcal {M}}}(\varphi )\) can be identified with maximum reachability probabilities in the respective products of \({\mathcal {M}}_N\) and \({\mathcal {M}}\) with a Rabin automaton B representing \(\phi \). The maximum here is with respect to all possible memoryless schedulers on the product MDPs. Since \({\mathcal {M}}\) and \({\mathcal {M}}/\sim \) are equivalent with respect to LTL properties, one can consider the product of \({\mathcal {M}}/\sim \) with B instead, which then by Theorem 1 for the case \(r>2\) will for almost all N have the same structure as the product of \({\mathcal {M}}_N\) with B. Maximum reachability probabilities in the product MDPs are a continuous function of the transition probability parameters on the interior of the parameter space, i.e., for sequences of parameters \(p_N\rightarrow p\) where \(p\ne 0,1\). Since \({\mathcal {M}}_N\) and \({\mathcal {M}}/\sim \) agree on all 0/1-valued parameters, and for all others the parameters of \({\mathcal {M}}_N\) converge to those of \({\mathcal {M}}/\sim \), one also obtains \(P^{\mathrm {max}}_{{\mathcal {M}}_N}(\varphi ) \rightarrow P^{\mathrm {max}}_{{\mathcal {M}}}(\varphi )\). The argument for \(P^{\mathrm {min}}\) is analogous by considering minimum reachability instead of maximum reachability. The proof for the case \(r\ge 1\) is identical, using the weaker convergence guarantee of Theorem 1 for this case. \(\square \)

Theorem 3 makes a strictly stronger statement for the choice of significance levels \(\epsilon _N=1/N^r\) with \(r>2\). However, all statements are strictly asymptotic, and these very small \(\epsilon _N\)-values may lead to significantly under-estimate the size of the generating model when learning from a given limited dataset. In practice, therefore, one may prefer the weaker guarantees obtained for \(\epsilon _N=1/N\) in exchange for a lower risk of learning an over-simplified model.

An important observation is that Theorems 2 and 3 are pointwise for each \(\varphi \), and not uniform for the whole languages sub-CSL and LTL, respectively. Thus, it is not the case that in the limit we will learn a model that simultaneously approximates the probabilities of all properties \(\phi \) to within a fixed error bound \(\delta \). In other words, the sample size N required to obtain a good approximation can be different for different \(\phi \). This is inevitable, due to the fact that both the languages sub-CSL and LTL contain formulas of unbounded complexity.

To illustrate this point, consider an LMC model \({\mathcal {M}}\) for a sequence of coin tosses: the model has two states labeled H and T, respectively, and transition probabilities of 1/2 between all the states. Let \({\mathcal {M}}_N\) be a learned approximation of \({\mathcal {M}}\). The transition probabilities in \({\mathcal {M}}_N\) will deviate slightly from the true values 1/2. For example, assume that the transitions in \({\mathcal {M}}_N\) have value \(1/2+\delta \) for the transitions leading into H, and \(1/2-\delta \) for the transitions leading into T. Then one can construct LTL formulas \(\phi \), such that \(\mid \! P_{\mathcal {M}}(\phi )-P_{{\mathcal {M}}_N}(\phi ) \!\mid \) is arbitrarily close to 1. To do so, observe that according to \({\mathcal {M}}\) the relative frequency of the symbol H in long execution traces converges to 1/2, whereas according to \({\mathcal {M}}_N\) it converges to \(1/2+\delta \). For any \(k>0\) we can express with an LTL formula \(\phi _k\) that the frequency of H in the first k steps is at least \(1/2+\delta /2\) by just enumerating all sequences of length k that satisfy this condition. Then, as \(k\rightarrow \infty \), \(P_{\mathcal {M}}(\phi _k)\rightarrow 0\) and \(P_{{\mathcal {M}}_N}(\phi _k)\rightarrow 1\).

5 Experiments

In order to validate the proposed algorithm we have conducted two case studies on learning stochastic system models. Since a DMC can be seen as a DMDP having only a single input action, we only report results for DMDPs and DCTMCs. For each case study, we generated observation sequences (I/O strings and timed strings) from known system models, and compared the generating models and the learned models based on relevant system properties expressed by PLTL formulas. All experiments were performed on a standard laptop with a 2.4GHz CPU.

5.1 Experiments with MDPs

For analyzing the behavior of the learning algorithm with respect to MDPs we consider a modified version of the slot machine model given by Jansen (2002). Our model represents a slot machine with three reels that are marked with two different symbols “bar” and “apple”, as well as a separate initial symbol “blank”. Starting with an initial configuration in which all reels show the “blank” symbol, the player can for a given number r of rounds select and spin one of the reels. A wheel that has been spun will randomly display either “bar” or “apple”, where the probability of obtaining a “bar” is 0.7 in the first round, and gradually decreases as \(0.7(r-k+1)/r\) for the kth round. The player receives a reward of 10 if the final configuration of the reels shows 3 bars, and a reward of 2 if the final configuration shows 2 bars. Instead of spinning a reel, the player can also choose to push a ’stop’ button. In that case, with probability 0.5 the game will end, and the player receives the prize corresponding to the current configuration of the reels. With probability 0.5, the player will earn 2 extra rounds. Thus, choosing the ’stop’ option can be beneficial when the current configuration already gives a reward (but at the risk that it will change into something less favorable when instead of terminating the game is extended by 2 rounds), or when with the remaining available rounds the current configuration is unlikely to change into a reward configuration (then at the risk that the game ends immediately with the current poor configuration).

This model is formalized as a DMDP whose states are defined by the configuration of the reels, the number of spins already performed sp (up to the maximum of r), and a Boolean end variable indicating whether the game is terminated. The granting of 2 extra spins is (approximately) implemented by decreasing by 2 the sp counter, down to a minimum of 0 (otherwise this would lead to an infinite state space). Input actions are \(spin _i\) (\(i=1,2,3\)) and stop. The output alphabet is \(\varSigma ^{out }= \{blank ,bar ,apple \}^3 \cup \{Pr0,Pr2,Pr10,end \}\). States with \(sp <r\) are labeled with the symbol from \(\{blank ,bar ,apple \}^3\) representing the current reel configuration. When the number of available spins has been exhausted, then the next input (regardless of which input is chosen) leads to a state displaying the prize won as one of \(\{Pr0,Pr2,Pr10 \}\). Finally, one additional input leads to a terminal state labeled with end. States labeled with \(\{Pr0,Pr2,Pr10 \}\) have an associated reward of 0, 2, and 10, respectively. We have implemented this DMDP in PRISM (Kwiatkowska et al. 2011), and experimented with two versions of the model given by \(r=3\), and \(r=5\). These models have 103 (\(r=3\)) and 161 (\(r=5\)) reachable states, respectively.

The model generates traces that with probability 1 are finite, in the sense that after finitely many steps the trace ends in an infinite sequence of end symbols. However, there is no fixed bound on the number of initial non-end symbols. We sample observation sequences from the models using a uniform random selection of input actions at each point. Sampling of one sequence is terminated when the end symbol appears. The length distribution of strings sampled in this manner is dominated by a geometric distribution with parameter \(\lambda =0.25\cdot 0.5\) (the probability that the random scheduler chooses the stop input, and the game terminates on that input). The convergence in probability (2) of Theorem 1 then also is ensured under this sampling regime (the consistency properties of the Hoeffding test in relation to the expected sample string lengths as described by Definitions 20 and 21 (iii) are unaffected when the length distribution of sampled strings is reduced; the data support condition of Definition 21 (ii) still is true for all ’relevant’ states of the IOFPTA, i.e., all states that are not just copies of the unique end state).

In the following, we characterize the size of data sets in terms of the total number N of observation symbols, rather than the number of sequences (as a better measure of the ’raw size’ of the data). For sufficiently large samples, the ratio between the number of sequences and the number of symbols is very nearly constant, so that letting \(\epsilon _N=c/N\) also satisfies the conditions to obtain (4) in Theorem 3 when N is the number of symbols. In our experiments we set \(c=10.000\), because that leads to \(\epsilon _{20.000}=0.5\) for our smallest datasize \(N=20.000\). Since the use of this \(\epsilon _N\) sequence only is motivated by the theoretical convergence in the limit guarantees, and these guarantees do not provide any optimality guarantees for the limited sample sizes we consider, we also consider the alternative sequence where \(\epsilon _N=0.5\), for all N. This also serves the purpose of investigating the robustness of the learning results with respect to the choice of the \(\epsilon _N\).

We evaluate the learned models based on how well they approximate properties of the generating model. We consider properties of the form \(P^{\mathrm {max}}(\phi )\) for different LTL formulas \(\phi \), and use the following accuracy measure for the evaluation: when p and \(\bar{p}\) are the probabilities in the true and learned models, respectively, then we use the Kullback-Leibler distance

$$\begin{aligned} KL(p,\bar{p}) = p\log \frac{p}{\bar{p}} + (1-p)\log \frac{1-p}{1-\bar{p}} \end{aligned}$$
(5)

to measure the error of \(\bar{p}\). The error, then, depends on the ratio \({p}/{\bar{p}}\) rather than the difference \(p-\bar{p}\). The inclusion of the term \((1-p)\log \frac{1-p}{1-\bar{p}}\) evaluates the estimate of \(P^{\mathrm {max}}(\phi )\) also as an estimate for the dual \(P^{\mathrm {min}}(\lnot \phi )=1-P^{\mathrm {max}}(\phi )\). \( KL(p,\bar{p})\) is infinite when \(p\ne \bar{p}\in \{0,1\}\), i.e. when the learned value \(\bar{p}\) represents an incorrect assumption of deterministic behavior. On the other hand \(\bar{p}\ne p\in \{0,1\}\), i.e., incorrectly modeling deterministic behavior as probabilistic, incurs only a finite KL error. This asymmetric view is reasonable in many situations, because estimating 0,1-values by non-extreme probabilities usually means erring on the safe side, whereas incorrectly inferring 0,1-values can lead to incorrect assumptions of critical safety properties, for example.

We compare the models learned by IOalergia with the models given by the initially constructed I/O frequency prefix tree acceptors (with the frequencies normalized to probabilities, so that the IOFPTA is itself a valid DMDP). These initial tree-models are just a somewhat compact representation of the original data, and model checking performed on the trees can be seen as statistical model checking for DMDPs. Based on the tree-model representation of the data, we can use the model checking functionality of the PRISM tool to also perform statistical model checking. However, it turned out that the PRISM model checking algorithms, which are optimized for models specified in a modular, structured way, do not perform so well on the tree models, which are given by an unstructured state-level representation. Thus, even though PRISM is known to be able to operate on models of tens of millions of states, we were only able to run PRISM on tree models of up to around 60,000 states.

Fig. 6
figure 6

Growth of tree and model size. Top \(r=3\), bottom \(r=5\)

Figure 6 shows how for the \(r=3\) and \(r=5\) models the number of states in the constructed IOFPTAs and learned models develops as a function of the data size. The plots are in log-log scale, with the number of data symbols (divided by 1000) on the x-axis, and the number of states of the trees and learned models on the left, respectively right, y-axes. The red lines (box symbols) show a linear growth of the IOFPTA in log-log space. These lines have a near-perfect fit with the functions \(550 N^{0.65}\) (\(r=3\)), and \(550 N^{0.8}\) (\(r=5\)). These fits experimentally verify the sub-linear growth of IOFPTAs, which is theoretically obtained from Lemma 2 (Appendix).

When learning with fixed \(\epsilon =0.5\), the learned model sizes also show an approximately linear behavior in log-log space, which translates to a growth of (approximately) the orders \(N^{0.27}\) (\(r=3\)), and \(N^{0.4}\) (\(r=5\)). Learning with \(\epsilon _N=10,000/N\) at first under-estimates the true model size. The models learned for the largest N values are very close in size to the generating model. However, the experimental range for N would need to be extended considerably further in order to ascertain that here we already see the asymptotic convergence to the true model.

We evaluate the accuracy of the learned model based on a test suite of 61 LTL properties. The complete list of properties is given in “MDP Test Properties” of Appendix. As mentioned above, using PRISM model checking on the IOFPTAs as a surrogate for statistical model checking does not scale to very large tree models. Therefore, the results here are limited to a maximum of \(N=1m\) for \(r=3\), and \(N=320k\) for \(r=5\) (at these tree sizes, a model-checking run for all 61 properties took several hours, vs. a few seconds for model checking the model learned from the IOFPTA).

Table 1 Number of test properties with \(KL(p,\bar{p})=\infty \)

We first consider for how many of the test properties an error \(KL(p,\bar{p})=\infty \) is obtained, i.e., the learned value \(\bar{p}\) is an erroneous deterministic 0/1-value. These numbers are given in Table 1. An entry kl in this table means that for k test properties the IOFPTA gave an infinite KL-value, and for l properties this was the case for the learned model. It emerges a clear picture that the learned model is much less likely to return erroneous deterministic values. This is a natural consequence of a model-smoothing effect resulting from the state-merging process, and illustrates that model learning can alleviate overfitting problems occurring in statistical model-checking. The most problematic queries for IOFPTA-model checking were the low-probability queries 56–61, where the true probabilities are in the range 0.03–0.002, and IOFPTA-model checking returned the value 0. The values obtained from the learned models, on the other hand, approximated the true values rather well, and had KL-errors in the range 0.001–0.01.

The smoothing effect in the learned models can also have the less desirable consequence of leading to non-extreme estimates for probabilities that in the generating model are actually 0/1-valued. This was observed for property 16, which for \(r=5\) has max-probability 1 in the generating model. Here IOFPTA model checking returned the correct result, wheras the probabilities in the learned models were in the range 0.95–0.99 even for large data sizes. Similarly, some of the properties that have zero probability in the \(r=3\) model, had probabilities in the range 0.01–0.001 in the learned models.

Fig. 7
figure 7

KL errors

Figure 7 illustrates the KL-errors for all 61 properties for small datasets (\(N=40k\)), and the largest datasets for which model checking the IOFPTA tree was feasible (\(N=1m\) for \(r=3\), and \(N=320k\) for \(r=5\)). In these plots the x-axes index the test properties. The properties are here sorted according to increasing values of the KL-errors obtained from the trees. Thus, the indexing differs from the numbering given in Table 5, and also the ordering of the properties differs in the four plots of Fig. 7. The y-axes show the KL-errors in log-scale. Infinite KL-values are represented by the value 10.0, and zero values by \(10^{-6}\).

At the right end of each plot appear the properties that gave \(KL=\infty \) from IOFPTA model-checking. The errors obtained for the same properties from the learned models are in the same range as the errors for other properties. On the left ends of the plots appear properties with actual probability zero, which give zero error from the tree, but nonzero estimates, and hence nonzero errors from the learned models.

Fig. 8
figure 8

Probabilities for \(P^{\mathrm {max}}(\lnot \lozenge ^{<9}{} end )\) queries

Fig. 9
figure 9

Probabilities for \(P^{\mathrm {max}}\lozenge Pr10 \) and \(P^{\mathrm {max}}\lozenge Pr2 \) queries. Plots with the larger probability values are for \(P^{\mathrm {max}}\lozenge Pr2 \)

For the \(r=5\) model the properties appearing at indices 42–49 (\(N=40k\)), respectively 52–59 (\(N=320k\)) are properties 17–24 of Table 5, which are all of the form \(P^{\mathrm {max}}(\lnot \lozenge ^{<k}{} end )\) for different values of k, i.e., they represent the maximum probability of the game lasting at least k steps, for various values of k. For both the tree and the learned models the estimates for these probabilities were quite inaccurate. Figure 8 on the right shows the actual probability values obtained for the \(P^{\mathrm {max}}(\lnot \lozenge ^{<9}{} end )\) query for \(r=5\). For the datasizes \(N=40k\) and \(N=320k\) depicted in Fig. 7, the estimates are above 0.9 for all trees and models, whereas the true value is 0.5. The left plot in Fig. 8 shows the results for the same query in the \(r=3\) case.

Figure 9 shows the probabilities returned for the queries \(P^{\mathrm {max}}\lozenge Pr10 \) and \(P^{\mathrm {max}}\lozenge Pr2 \). These are queries for which the corresponding KL-errors lie in the middle ranges of the KL-errors seen in Fig. 7.

Fig. 10
figure 10

Average KL errors

Figure 10 shows the average KL-errors obtained as a function of the data size. The average here is taken over all test properties excluding the properties \(P^{\mathrm {max}}(\lnot \lozenge ^{<k}{} end )\) (whose high values would otherwise mask the development of KL-errors for the remaining properties). Furthermore, for each data size, only properties are included for which all models return non-infinite errors.

Fig. 11
figure 11

Average KL error at \(N=10^6\) as function of \(\epsilon \)

To obtain a more complete picture on the influence of the \(\epsilon \) parameter, we also vary \(\epsilon \) over the whole feasible range from 0 to 2 for the fixed data size \(N=10^6\). Figure 11 shows the sizes and average KL-errors for the learned models. The different \(\epsilon \)-values we used are listed on the x-axis simply on equi-distant marks. The \(\epsilon \)-values we otherwise used for \(N=10^6\) are 0.5 and 0.01, which both are in the middle of the range of values considered here. Even at the extreme end \(\epsilon =2\) the learned models are significantly smaller than the original IOFPTA’s (which have sizes 47,564 and 134,693 for \(r=3\) and \(r=5\), respectively). This is because even though the Hoeffding test proper will always reject when \(\epsilon =2\), we still obtain positive compatibility results, and hence merges of nodes, due to the base test in line 1 of our Hoeffding compatibility test (Algorithm 3). The minimal model size is 31 nodes, corresponding to exactly one node for each output symbol. This minimal size is reached at \(\epsilon =10^{-60}\) and \(\epsilon =10^{-10}\) for \(r=3\) and \(r=5\), respectively. The average KL errors are shown in Fig. 11 separately for the “hard” test properties \(P^{\mathrm {max}}(\lnot \lozenge ^{<k}{} end )\), and the remaining “easy” properties. Furthermore, to obtain readable plots, the KL-errors for the hard properties have been scaled by a factor of 0.1.

Figure 11 indicates that better results are obtained when \(\epsilon \) is chosen so large that the size of the learned model is somewhat larger than the size of the true model. This would also have to be expected, since a model that over-estimates the true number of states can be trace-equivalent to the true model, whereas a model with fewer states than the true model usually can not. For the ’easy’ test properties we obtain a fairly clear picture of optimal \(\epsilon \)-values in the range 0.5–1.5, corresponding to models that are in the range of \(1\times \) to \(10\times \) the size of the true model. The picture for the ’hard’ properties is less clear and rather different for \(r=3\), where the most accurate models are learned for a range of small \(\epsilon \)-values, and \(r=5\), where the error decreases nearly monotonically as \(\epsilon \) increases. Overall, the results show that IOalergia learning is quite robust with respect to the precise choice of the \(\epsilon \) value.

Summarizing our observations, we can reach a number of conclusions: the differences in the accuracy of estimated probabilities are quite significant for different models of similar size (\(r=3\) with 103 states; \(r=5\) with 161 states), and for different queries \(P^{\mathrm {max}}(\lnot \lozenge ^{<k}{} end )\) and \(P^{\mathrm {max}}\lozenge PrX \) of similar syntactic form and complexity. Thus, neither the size of the true model, nor the complexity of the query alone will be good predictors for the accuracy of max-probability estimates obtained either by statistical model checking, or by model learning. In spite of very different convergence speeds, we observed convergence of the estimated max-probabilities to the true values for all test properties.

When comparing statistical model checking against model learning, no clear winner emerges in terms of the accuracy of estimated probabilities. The main difference lies in a smoothing effect of the learning process that eliminates extreme 0/1 empirical probabilities. This can allow the learned model to successfully generalize from the data, and return accurate estimates for low-probability properties that are not seen in the data, and for which statistical model checking returns zero probabilities. On the other hand, it can also lead to over-generalization, where true probability zero properties are given non-zero values in the learned model. Here it should be emphasized that in our experiments we have not tried to exploit another generalization capability of model learning, which is the ability to generalize from observations of finite initial trace segments to infinite behaviors. Traces in our slot machine model are finite with probability 1, and our data only contained traces of completed runs. This gives ideal conditions for statistical model checking, since empirical probabilities in the data correspond to actual model probabilities.

Comparing the results obtained from models learned with fixed \(\epsilon =0.5\), and decreasing \(\epsilon =10,000/N\) we observe in Figs. 8, 9 and 10 for smaller data sizes a slight advantage for \(\epsilon =0.5\). This is explained by Fig. 6, which shows that under the \(\epsilon =10,000/N\) regime the learned model stays smaller than the true model for the whole range of data sizes, approaching the true size only at the very end. The \(\epsilon =0.5\) models, on the other hand, soon become somewhat larger than the true model. As also indicated by Fig. 11, moderate over-approximations of the true model tend to lead to smaller KL errors.

Fig. 12
figure 12

Time for model learning (\(r=3\))

In terms of space, model learning obviously leads to very significant savings (Fig. 6). As mentioned above, we cannot make a meaningful comparison for the time complexity of statistical model checking versus model learning, since we are using a very inefficient approach for performing the former. Figure 12 shows the computation time for IOalergialearning for the case \(r=3\) and \(\epsilon =0.5\). The overall time is divided into the construction time for the IOFPTA, and the time for the IOalergianode-merging process. We observe that both times are linear in the datasize. For Alergia, the theoretical worst-case complexity is cubic in the size of the IOFPTA, but the linear behavior we here observe is consistent with what is reported as the typical behavior of Alergia in practice. Moreover, we see that the times for the tree construction and the node merging phases of the learning procedure are of the same order of magnitude. Since even a highly optimized statistical model checking procedure will not be much faster than the IOFPTA construction, we can conclude that the time for model learning is of the same order of magnitude as a single run of statistical model checking, with significant savings for the amortized cost of checking multiple properties.

Table 2 Accuracies of pure versus count-aggregating IOalergia(\(r=3\))

As discussed in Sect. 3.2, in our IOalergia implementation we do not aggregate frequency counts when merging nodes, and we perform the compatibility tests always based on the counts in the original IOFPTA. For comparison we also tested a version of the algorithm in which counts are aggregated. The main observation we made was that for a given \(\epsilon \)-value, models learned using aggregated counts were larger than models learned without count aggregation. Thus, aggregating counts leads to more rejections in the compatibility tests. This can be explained by the fact that the Hoeffding test will always accept compatibility when the two counts \(n_1,n_2\) are very small (cf. Algorithm 3), e.g. both are at most 2, or one is equal to 1, and the other less than 10. Since counts at the leaves of the IOFPTA (or nodes very close to the leaves) will usually have very low counts, this means that in the original IOFPTA most pairs of leaves will be tested as compatible. However, after merging the counts of two or three leaves, this will more often no longer be the case. The accuracy of models learned with count-aggregation was not higher than the accuracy of models learned without aggregation, but with \(\epsilon \)-settings that lead to models of approximately equal size. Table 2 shows some detailed results for the \(r=3\) model learned from data of size \(N=1m\). For the two \(\epsilon \)-values that also have been used in the previous experiments for \(N=1m\), the table shows the sizes of the learned models, with and without count aggregation. For comparison also the IOFPTA is included in the table. The average KL-error shown in the last column of the table is the average error over all 61 test properties (for \(r=3, N=1m\) the errors for the difficult properties 17–24 are not such clear outliers that their inclusion in the average dominates the results). For the IOFPTA the KL-error is averaged over all properties except two for which the error is infinite. The table indicates that the accuracy depends more on the size of the learned model (best results being obtained when slightly over-estimating the true size) than on whether learning is with or without count aggregation.

5.2 Experiments for CTMCs

For CMTCs, we consider a case study adapted from Haverkort et al. (2000), where two sub-clusters of workstations are connected through a backbone. Each sub-cluster has N workstations, and the data from a workstation is sent to the backbone by a switch connected to the workstation’s sub-cluster. The topology of the system is shown in Fig. 13. Each component in the system can break down and any broken component can be repaired. The average failure-free running time of the workstations, switches, and backbone is 2, 5, and 10 h, respectively; the average time required for repairing one of these components is 1, 2, and 4 h. There are two types of Quality of Service (QoS) associated with the system:

  • minimum: at least 3N / 4 workstations are operational and connected via switches and the backbone,

  • premium: at least N workstations are operational and connected via switches and the backbone.

Note that if the premium requirement is met, then so is the minimum requirement. We specify CTMCs for this system with a varying number of workstations. The summary statistics for the models in terms of the number of states and transitions are listed in Table 3.

Fig. 13
figure 13

The topology of a workstation cluster (Haverkort et al. 2000)

Table 3 Summary statistics of the CTMC models for the workstation cluster case study

When generating data from the specified models, the observation sequences correspond to timed strings that alternate between observable symbols and time values. Following the sampling procedure outlined in Sect. 4, we generated observation sequences from different system configurations with 4, 8, and 10 workstations in each sub-cluster. The average length of these observation sequences is 50. We also assume that each component is operational initially. For the present case study, the most important property is the amount of time for which the minimum and premium QoS requirements are satisfied. These two properties are expressed by the sub-CSL formulas

$$\begin{aligned} P=? [\lozenge _{\le t}\; !\mathsf {``minimum''}]\qquad P=? [\lozenge _{\le t}\; !\mathsf {``premium''}], \end{aligned}$$

where t is a real number.

For the experimental results reported below, we used \(\alpha =0.5\) for the compatibility tests employed in the learning algorithms. The choice of having a fixed \(\alpha \)-value is based on the experimental results for the slot machine model (see Sect. 5.1), which showed that the learning algorithms are fairly robust wrt. the particular choice of \(\alpha \)-value.

As shown in Fig. 14, the two QoS properties above are generally well approximated by the learned models although (as expected) the quality of the approximations decreases as the complexity of the generating models increases. All models are learned using 40000 symbols, and all probabilities have been computed using PRISM. For comparison, we have also included the results obtained by directly using the timed frequency prefix tree acceptors (TFPTAs) for performing model checking. As can be seen from the figure, when the prediction horizon starts to increase the properties are no longer well-approximated by the TFPTA-models. Summary information about the models learned for various data sizes and system configurations are given in the first five columns in Table 4; \(|S|\) is the number of symbols in the dataset (\(\times 10^3\)); \(|\mathrm {Seq}|\) is the number of sequences in the dataset; \(|\text {TFPTA}|\) is the number of nodes in the TFPTA; ‘Time’ is the learning time (in seconds), including the time for constructing the TFPTA, and |Q| is the number of states in the learned model.

Fig. 14
figure 14

The results of checking the properties \(P=? [\lozenge _{\le t}\; !\mathsf {``minimum''}]\) and \(P=? [\lozenge _{\le t}\; !\mathsf {``premium''}]\) in the learned models, timed frequency prefix tree acceptor, and the generating models with \(t \in [0.5,6]\)

Table 4 Experimental results for the workstation cluster
Fig. 15
figure 15

The quality of learned models measured in terms of randomly generated formulas

In addition to the two properties above, we have measured the quality of the learned models by randomly generating sets of sub-CSL formulas \(\varPhi \) using a stochastic context-free grammar. Each formula is restricted to a maximum length of 20. For the temporal operators we uniformly sample a time value t from [0, 20] and defined the time intervals as [0, t]. In order to avoid testing on tautologies or other formulas with little discriminative value, we constructed a baseline model B with one state for each symbol in the alphabet and with uniform transitions probabilities. For each generated formula \(\varphi \in \varPhi \) we then tested whether the formula was able to discriminate between the learned model A, the generating model M, and the baseline model B. If \(\varphi \) was not able to discriminate between these three models (i.e., \(P_{A}(\varphi )=P_{M}(\varphi )=P_{B}(\varphi )\)), then \(\varphi \) was removed from \(\varPhi \).

We finally evaluated the learned models by comparing the mean absolute difference in probability (calculated using PRISM) over the generated formulas for the models M and A:

$$\begin{aligned} D_A= & {} \frac{1}{|\varPhi |}\sum \nolimits _{\varphi \in \varPhi } {|P_{M } (\varphi ) - P_{A } (\varphi )|}. \end{aligned}$$
(6)

The mean absolute difference between M and B is calculated analogously.

The results of the experiments are listed in columns \(D_A\) and \(D_A^T\) in Table 4, where column \(D_A^T\) lists the results obtained by performing model checking using the TFPTA-model. For models with 4, 8, and 10 workstations in each sub-cluster we ended up with 677, 637, and 635 random formulas, respectively, after the elimination of non-discriminative formulas. The results are further illustrated in Fig. 15, where we also see that the difference (measured using the randomly generated formulas) between the learned model and the generating model decreases as the amount of data increases. Each data point is the mean value based on eight experiments with different randomly generated data sets. For comparison, the absolute mean difference between the baseline models and the generating models are 0.424, 0.350, and 0.293, for \(N=4\), \(N=8\), and \(N=10\), respectively.

6 Conclusion

In this paper we have proposed a framework for learning probabilistic system models based on observed system behaviors. Specifically, we have considered system models in the form of deterministic Markov decision processes and continuous time Markov chains, where the former model class includes standard deterministic Markov chains as a special case. The learning framework is presented within a model checking context and is based on an adapted version of the Alergia algorithm (Carrasco and Oncina 1994) for learning finite probabilistic automata models.

We have shown that in the large sample limit the learning algorithm will correctly identify the model structure as well as the probability parameters of the model. We position the learning results within a model checking context by showing that for the learned models the probabilities of model properties expressed in the formal specification languages LTL and sub-CSL will converge to the probabilities given by the true models.

The learning framework is empirically analyzed based on two use-cases covering Markov decision process and continuous time Markov chains. The use cases are analyzed wrt. the structure of the learned system models as well as relevant LTL and sub-CSL definable properties. The results show that for both model classes the learning algorithm is able to produce models that provide accurate estimates of the probabilities of the specified LTL and sub-CSL properties. The results have also been compared to the estimates obtained by statistical model checking, but with the analysis limited to properties testable by statistical model checking. Thus, we do not exploit the generalization capabilities of model learning for reasoning about unbounded system properties. The comparison shows that in terms of LTL-accuracy, there is no clear winner between the two approaches; the main differences in the results are caused by the smoothing effect of model learning. On the other hand, in terms of space and time complexity we see a significant difference in favor of model learning. For the sub-CSL properties, both the accuracy and complexity results are significantly better than those obtained by statistical model checking, in particular for sub-CSL properties defined over longer time horizons. These results are further complemented by accuracy estimates for randomly generated sub-CSL formulas, demonstrating that the learned models also provide accurate probability estimates of more general model properties.

The theoretical learning results presented in the paper focus on learning in the limit rather than on probably approximately correct (PAC) learning results. Extending the results to PAC learning would require an error measure for the model classes in question, which, in turn, would entail defining a suitable measure for probability distributions over \(\varSigma ^{\omega }\). Candidate error measures have been investigated by Jaeger et al. (2014) who show that there are fundamental difficulties in defining measures that on the one hand support PAC learnability results and on the other hand satisfy natural continuity properties.

In addition to the results reported in the paper, we have conducted preliminary experiments on learning deterministic MDP approximations based on observations generated by non-deterministic system models. The results showed that the learned (deterministic) models are not sufficiently expressive to capture all relevant non-deterministic system properties. Based on these results, we wish as part of future work to consider learning methods for non-deterministic model classes. We expect, however, that the learning methods will be significantly different from the methods proposed in the current paper as, e.g., the assumption about a deterministic system behavior is key for the FPTA-based data representation.

The current paper is a significantly extended version of Mao et al. (2011) and Mao et al. (2012). We have subsequently adapted the results in Mao et al. (2012) to support active learning scenarios, where one guides the interaction with the system under analysis in order to reduce the amount of data required for establishing an accurate system model (Chen and Nielsen 2012). Furthermore, the learning algorithm has also been extended for learning and verifying properties of systems endowed with a relational structure (Mao and Jaeger 2012). Generally, these learning results assume access to multiple observation sequences of the system in question. For systems that are hard (or even impossible) to restart, this requirement will rarely hold. In Chen et al. (2012) we have therefore considered methods for investigating system properties by learning system models based on a single observation sequence.