figure a
figure b

1 Introduction

System identification algorithms aim to capture the behavior of a black-box system, often called the system under learning (SUL), in a formal model. Among the system identification approaches, active automata learning (AAL) [5, 23, 24] is a popular methodology to extract finite automata from a black-box. AAL has been successfully applied to learn security-critical protocol implementations [15, 16, 18], legacy code [8, 44], smart cards [13], interfaces of data structures [22], embedded control software [46], and (explainable) neural network policies [51].

Modern AAL methods [25, 48] are available via mature tool sets [11, 26, 36] that implement these methods. They are primarily built around Angluin’s Minimal Adequate Teacher (MAT) framework [5]. In essence, the theoretically elegant MAT framework requires access to two types of queries. First, an output query (OQ) allows to execute a sequence of inputs on the black-box and observe its outputs. Second, an equivalence query (EQ) asks whether a hypothesized Mealy machine is indeed equivalent to the SUL. Implementing the equivalence query provides practitioners with an impossibility [35]: How do we decide whether a learned model is equivalent to the behavior of a black-box? To overcome this impossibility, practitioners take a more modest stance and only approximate equivalence queries. One approach is to randomly sample from all possible input sequences, which leads to a statistical guaranteeFootnote 1, as pioneered in the context of learning by Angluin [6]. Alternatively, based on ideas pioneered in [14, 50], the structure of the hypothesis is used to select a finite set of input sequences to be checked. These sets are test suites and the approach is called conformance testing [12]. This paper considers EQs via test suites; for an overview, see Sec. 2.

Challenge: Finding small test suites. Finite test suites can be obtained using the notion of k-completeness. In short, k-completeness guarantees equivalence under the assumption that the number of states in the SUL is at most k states larger than the hypothesis. Popular k-complete test are the classical W-method [14, 50] and variations thereon, such as Wp [19], HSI [32, 41] and Hybrid-ADS [34]; see empirical evaluations in [7]. We call these methods Access-Step-Identify (ASI). These are standard in tools like LearnLib [26] and AALpy [36]. However, k-complete test suites such as the W-method grow with \(|I|^k\), where |I| is the number of input symbols. Consequently, even for small k, these test suites are prohibitively large.

Our approach for smaller test suites. Towards smaller test suites, we adapt ASI methods and make assumptions on the shape of the SUL in relation to the shape of the hypothesis. In particular, we consider several natural assumptions that may occur in real-world systems. For instance, one of these assumptions is that in most states, most inputs either lead to an error-state or are simply discarded. Other assumptions are that certain inputs are used only in the beginning (e.g. in the authentication phase of a protocol), or that the SUL has a component structure where inputs are primarily used together within components. We formalize these assumptions, demonstrate the applicability on industrial benchmarks, and develop a notion of completeness under these assumptions. The resulting test suites are much smaller, as the factor \(|I|^k\) is restricted to \(|I'|^k\), with \(|I'| < |I|\).

Challenge: finding counterexamples as soon as possible. The time to find a counterexample during EQs is the bottleneck in AAL applications [47, 52]. To accelerate this process, it is helpful to constrain the search space of possible counterexamples, allowing for a targeted search. Here complete test suites are again helpful, even if they can not be fully executed and can only be approximated through sampling, as implemented for instance in the randomised W-method of LearnLib. Complete test suites then provide a constrained search space that still contains all actual counterexamples. Another relevant aspect for finding counterexamples fast is the order that tests are chosen: an adequate ordering in which counterexamples (empirically) occur early in the test suites is preferred.

Our approach for finding counterexamples faster. In the context of randomised W-methods, pruning input sequences that are not counterexamples yields a larger probability of sampling a counterexample and thus speeds up the procedure. However, for the smaller test suites described earlier, without domain-specific knowledge, we can not be certain that they contain (a larger fraction of) counterexamples, as we do not know whether the underlying assumptions are met. Instead, our idea is to combine multiple test suites. We prefer tests from test suites that led to counterexamples in previous invocations of the EQ during the learning process. We operationalize this idea using multi-armed bandits.

Contributions. In summary, this paper introduces three new test suites that are complete under additional assumptions on the SUL (Sec. 4). We combine these test suites via a multi-armed bandit framework to accelerate finding counterexamples in EQs (Sec. 5). The paper demonstrates performance on scalable self-generated benchmarks, standard benchmarks and industry benchmarks (Sec. 6). The proofs of all theorems, the complete benchmarks results and additional figures can be found in the appendix of the extended version of this paper [30].

Fig. 1.
figure 1

Interaction between the learner, teacher and SUL in the MAT framework.

2 Overview

We briefly illustrate the interactions in the MAT framework, the W-method, and our approach for generating smaller test suites, using a toy example. Recall that in the MAT framework the learner can pose output queries (OQ) and equivalence queries (EQ). This is depicted in Fig. 1, where EQs are implemented by the teacher. The Mealy machine in Fig. 2a depicts the SUL for a coffee machine with input alphabet \(I = \{\textit{coffee}, \textit{espresso}, \textit{tea}, 1 \}\). Coffee costs 1 euro, espresso costs 2 euros, and tea never gets dispensed. Via a series of queries, we may obtain the hypothesis in Fig. 2b. The hypothesis is easy to refute with an EQ, e.g., via the counterexample \(1 \cdot \textit{coffee}\). After various OQs, we learn the hypothesis in Fig. 2c. A short counterexample that distinguishes the hypothesis \(\mathcal {H}_1\) from the SUL \(\mathcal {S}\) is

$$\begin{aligned} \underbrace{1}_{\text {access}} \,\,\, \cdot \underbrace{1 \cdot \textit{coffee}}_{\text {infix}} \cdot \underbrace{\textit{coffee}}_{\text {distinguish}}. \end{aligned}$$

The counterexample consists of three parts. We first access \(q_1\) and \(t_1\), from which we run an infix that leads to either \(q_1\) or \(t_0\), and then we distinguish both states with coffee. Executing input coffee from \(q_1\) returns output coffee while executing input coffee from \(t_0\) returns output −. The W-method generates test suites that consist of input words of a similar shape. Concretely, test suites are constructed as \(P\cdot I^{\le k+1} \cdot W\), where P ensures access to the states in the hypothesis, \(I^{\le k+1}\) is the set of sequences of at most \(k+1\) arbitrary input symbols, used to step to states in the (larger) SUL, and W contains sequences that help to distinguish states. Test suites constructed in this way tend to contain many input sequences which do not help to refute the hypothesis. In our example, the W-method test suite with \(k=2\) for \(\mathcal {H}_1\) also contains uninformative sequences such as

$$ \underbrace{\epsilon }_{\text {access}} \,\,\, \cdot \underbrace{1 \cdot \textit{espresso} \cdot 1}_{\text {infix}} \cdot \underbrace{ \textit{coffee}}_{\text {distinguish}} \qquad \text {and} \qquad \underbrace{1}_{\text {access}} \,\,\, \cdot \underbrace{\textit{tea} \cdot \textit{espresso}}_{\text {infix}} \cdot \underbrace{\textit{espresso}}_{\text {distinguish}}.$$
Fig. 2.
figure 2

A coffee machine and two hypotheses which can be generated using AAL.

A smaller test suite. In hypothesis \(\mathcal {H}_1\), espresso and tea self-loop in all states. The counterexample to refute this hypothesis only requires the inputs coffee and 1. It is natural that input tea is not necessary to reach new states, as this option is obsolete. This leads us to a test suite for \(\mathcal {H}_1\) that excludes the inputs tea and espresso in the infix. If we generate infixes of length at most 3 (\(k=2\)) with the full alphabet, the test suite contains 112 test cases. If we exclude two inputs, only 12 test cases remain.

A set of smaller test suites. The restricted test suites that aim to exclude obsolete inputs can be refined. These restrictions can be adapted for other typical scenarios. Consider, e.g., network protocols that only perform a three-way handshake in the initial phase. In states where the communication protocol is initialized, these inputs are no longer relevant. Likewise, there are often clusters where the same input symbols are relevant. For instance, if a 10 cent coin is a relevant input in some state of a vending machine, then a 50 cent coin is likely also relevant.

Mixing test suites. Restricting the test suites yields the risk of missing counterexamples. While the test suite may be complete under (natural) additional assumptions, in a black-box setting we have no way to check whether these assumptions hold. We therefore present a methodology where various restricted test suites are combined, using multi-armed bandits to select test suites. During learning, the EQs then increasingly use test suites for which the assumptions hold, without the need for advanced knowledge of the SUL.

3 Complete Test Suites

We recall complete test suites and start with preliminaries on Mealy machines.

Definition 3.1

A Mealy machine is a tuple \(\mathcal {M}= (Q, I, O, q_0, \delta , \lambda )\) with finite sets Q, I and O of states, inputs and outputs respectively; the initial state \(q_0 \in Q\), the transition function \(\delta :Q \times I \rightarrow Q\) and the output function \(\lambda :Q \times I \rightarrow O\).

Below, we also use partial Mealy machines; these are defined as above but with \(\delta :Q \times I \rightharpoonup Q\) and \(\lambda :Q \times I \rightharpoonup O\) partial functions with the same domain. For a partial function f we write \(f(x)\mathord {\downarrow }\) if f(x) is defined and \(f(x)\mathord {\uparrow }\) otherwise. The transition and output functions are extended to input words of length \(n \in \mathbb {N}\) in the standard way, as functions \(\delta :Q \times I^n \rightharpoonup Q\) and \(\lambda :Q \times I^n \rightharpoonup O^n\). We abbreviate \(\delta (q_0, w)\) by \(\delta (w)\). Given \(Q' \subseteq Q\) and \(L \subseteq I^*\), we write \(\Delta ^{\mathcal {M}}(Q',L) = \{\delta (q,w) \mid q \in Q', w \in L\}\) for the set of states reached from \(Q'\) via words in L, and we let \(\Delta ^{\mathcal {M}}(L) = \Delta ^{\mathcal {M}}(\{q_0\}, L)\). In particular \(\Delta ^{\mathcal {M}}(I^*)\) is the set of reachable states of \(\mathcal {M}\). We use the superscript \(\mathcal {M}\) to indicate to which Mealy machine we refer, e.g. \(Q^{\mathcal {M}}\) and \(\delta ^{\mathcal {M}}\). We write \(|\mathcal {M}|\) for the number of states in \(\mathcal {M}\). A state \(q \in Q^{\mathcal {M}}\) is a sink if for all \(i \in I\), \(\delta (q,i)=q\). We denote the set of sinks by \(Q_{\text {sink}}\).

Definition 3.2

Given a language \(L \subseteq I^*\) and Mealy machines \(\mathcal {H}\) and \(\mathcal {S}\), states \(p \in Q^\mathcal {H}\) and \(q \in Q^\mathcal {S}\) are L-equivalent, written as \(p \sim _L q\), if \(\lambda ^\mathcal {H}(p,w)=\lambda ^\mathcal {S}(q,w)\) for all \(w \in L\). States p, q are equivalent, written \(p \sim q\), if they are \(I^*\)-equivalent. The Mealy machines \(\mathcal {H}\) and \(\mathcal {S}\) are equivalent, written \(\mathcal {H}\sim \mathcal {S}\), if \(q_0^{\mathcal {H}} \sim q_0^{\mathcal {S}}\).

Conformance testing techniques construct from a current hypothesis \(\mathcal {H}\) a suitable test suite \(T \subseteq I^*\), to be executed on the (black-box) SUL \(\mathcal {S}\). If a test case fails, we know the machines are inequivalent. Ideally, we want a test suite that contains a failing test case for every possible inequivalent Mealy machine. This is called a complete test suite. We define completeness in a more generic way than usual to make it easier to add conditions to the set of Mealy machines for which the test suite is complete in subsequent sections.

Definition 3.3

Given a Mealy machine \(\mathcal {H}\) and set of Mealy machines \(\mathcal {C}\), a test suite \(T \subseteq I^*\) is complete for \(\mathcal {H}\) w.r.t. \(\mathcal {C}\) if for all \(\mathcal {S}\in \mathcal {C}\), \(\mathcal {H}\sim _{T} \mathcal {S}\) implies \(\mathcal {H}\sim \mathcal {S}\).

In general, there are no test suites that are complete w.r.t. the (infinite) set \(\mathcal {C}\) containing all (inequivalent) Mealy machines [35]. In practice, we often use k-completeness, where we assume that \(\mathcal {C}\) only contains Mealy machines which have at most k states more than the hypothesis.

Definition 3.4

Let \(\mathcal {H}\) be a Mealy machines. A test suite \(T \subseteq I^*\) is k-complete for \(\mathcal {H}\) if it is complete w.r.t. \(\mathcal {C}^k_{\mathcal {H}} = \{ \mathcal {S}\mid |\mathcal {S}| - |\mathcal {H}|\,\le k \}\).

Conformance testing techniques often build k-complete test suites in a structured manner using a state cover and a characterization set. We give a formal description of a classical k-completeness technique: the W-method [14, 50].

Definition 3.5

An access sequence for \(q \in Q^{\mathcal {H}}\) is a word \(w \in I^*\) such that \(\delta ^{\mathcal {H}}(w)=q\). A language \(P \subseteq I^*\) is a state cover if \(\varepsilon \in P\) and P contains an access sequence for every reachable state, i.e., \(\Delta ^{\mathcal {H}}(P) = \Delta ^{\mathcal {H}}(I^*)\).

Definition 3.6

A characterization set for a Mealy machine \(\mathcal {H}\) is a non-empty language \(W \subseteq I^*\) such that \(p \sim _W q\) implies \(p \sim q\) for all \(p,q \in Q^\mathcal {H}\).

Let P be a minimal state cover and W a characterization set for \(\mathcal {H}\). Then the W-method, given \(k \in \mathbb {N}\), is given by the test suite \(T = P \cdot I^{\le k + 1} \cdot W\). The state cover P makes sure all states in \(\mathcal {H}\) are reached. The role of the set of infixes in \(I^{\le k + 1}\) is to reach states in \(\mathcal {S}\). The characterization set W checks if the states reached in \(\mathcal {H}\) and \(\mathcal {S}\) after reading a word from \(P \cdot I^{\le k + 1}\) match. Other ASI methods differ in the computation of the characterization set and the structure of the test suite but are constructed from the same sets P, I, and W.

In the remainder of this section, we prove that the W-method is k-complete [14, 50]. We recall the proof strategy from [34], based on reachability and bisimulations up-to \(\sim _{L}\), in Appendix A of [30]. With minimal changes, this proof also works for other ASI methods. Here, we summarize the approach in two main steps, which we reuse in Sec. 4 to prove completeness for different test suites under additional conditions. The first step concerns reachability in \(\mathcal {S}\). We assume that \(\mathcal {H}\) is minimal w.r.t. number of states, which is an invariant of active learning algorithms, our intended application. This assumption is only used for k to be correct; alternatively, one can bound the number of states of \(\mathcal {S}\) to the sum of k with the number of inequivalent states in \(\mathcal {H}\).

Lemma 3.7

Let \(\mathcal {H}\) and \(\mathcal {S}\) be Mealy machines with \(|\mathcal {S}| - |\mathcal {H}|\,\le k\) for some integer k, and assume \(\mathcal {H}\) is minimal. Moreover, let P be a state cover for \(\mathcal {H}\) and W a characterization set for \(\mathcal {H}\). Finally, let \(L = P \cdot I^{\le k}\) and \(T = P \cdot W\) and suppose that \(\mathcal {H}\sim _T \mathcal {S}\). Then L is a state cover for \(\mathcal {S}\).

It is in the above lemma that the assumption \(\varepsilon \in P\) is used, to ensure that all states in \(\mathcal {S}\) are reached from a state in \(\Delta ^\mathcal {S}(P)\). The second step extends this to actual equivalence of the two Mealy machines.

Lemma 3.8

Suppose \(L \subseteq I^*\) is a state cover for both \(\mathcal {H}\) and \(\mathcal {S}\). Let W be a characterization set for \(\mathcal {H}\), and \(T = L \cdot I^{\le 1} \cdot W\). If \(\mathcal {H}\sim _T \mathcal {S}\), then \(\mathcal {H}\sim \mathcal {S}\).

Combining the above two lemmas, we recover k-completeness of the W-method.

Corollary 3.9

The W-method is k-complete.

4 Complete Test Suites with Subalphabets

We introduce test suites that are similar to the W-method but have fewer infixes. These test suites are roughly of the form \(T = P \cdot I_{sub}^{\le k + 1} \cdot W\), with different choices for \(I_{sub} \subseteq I\). If the subalphabet gets smaller, the test suite size always decreases. If we choose I for \(I_{sub}\), we recover the original W-method test suite.

In the following subsections, we provide three new functions, called experts, for generating subalphabets. These experts are tailored to perform well for certain Mealy machine structures. For each expert, we provide a parameterized family of Mealy machines for which the expert should work well, and we show they are complete under specific assumptions that strengthen those of k-completeness.

The experts can be embedded in any ASI method. For conciseness, we focus on the W-method. In the definition of expert, the output is a set of subalphabets rather than a single one \(I_{sub}\) as described above; this is used in one of the experts.

Definition 4.1

An expert is a function E which takes as arguments a Mealy machine \(\mathcal {H}\) and a word \(v \in I^*\), and returns a set of subalphabets \(I_1,\ldots ,I_n\).

The embedding in the W-method is as follows.

Definition 4.2

The expert test suite \(\textsf{ETS}\) for \(\mathcal {H}\), expert E and \(k \in \mathbb {N}\) is:

$$\begin{aligned} \textsf{ETS}_{E,k}(\mathcal {H}) = \bigcup _{v \in P} ( v \cdot (\bigcup _{I_{sub} \in E(\mathcal {H},v)} I_{sub}^{\le k-1}) \cdot I^{\le 2} \cdot W) \end{aligned}$$

where P is a minimal state cover for \(\mathcal {H}\) and W a characterization set.

Before introducing the new experts we define the trivial expert.

Definition 4.3

The trivial expert \(E_{\textsf{T}}\) is given by \(E_{\textsf{T}}(\mathcal {H},q) = \{ I^{\mathcal {H}} \}\).

If P is a minimal state cover and W a characterization set, then \(\textsf{ETS}_{E_{\textsf{T}},k}(\mathcal {H})\) is given by \(P \cdot I^{\le k - 1} \cdot I^{\le 2} \cdot W\), which is precisely the W-method test suite.

Fig. 3.
figure 3

\(\textsf{ASML}_{a,b}\) models over inputs and outputs \(\{ x_i \mid 1 \le i \le a \} \cup \{ y_i \mid 1 \le i \le b \}\). Transitions not drawn, including all transitions \(y_i\) with \(1 \le i \le b\), lead to a sink with a unique output.

4.1 Active Inputs Expert

Motivation. Mealy machines with many inputs are challenging, even when most inputs induce no interesting behavior, i.e., when most inputs transition to sinks. This challenge is exemplified by the ASML models which were first described in [52] and partially made available for the 2019 RERS challenge [27]. The ASML models represent components of lithography systems used at ASML. These models feature many inputs that often lead to a sink state. Model m135 in particular has approximately 100 inputs that always transition to the sink state with the same output. The Mealy machines \(\textsf{ASML}_{a,b}\) where \(a, b \in \mathbb {N}\), displayed in Fig. 3, closely resemble m135. The model starts with a spine, then there is a choice between a branches, and the spine inputs are reused in a different order after the choice. There are b distinct inputs that always lead to a sink.

The expert. The active inputs expert addresses Mealy machines where there is a significant set of inputs that always lead to the sink state or self-loop. We define the active version of a Mealy machine and then the active inputs expert.

Definition 4.4

An input \(i\in I\) is active in \(q \in Q\), if \(\delta (q,i) \notin Q_{sink} \text { and } \delta (q,i) \ne q\). The active Mealy machine of \(\mathcal {H}= (Q,I,O,q_0,\delta ,\lambda )\) is the partial Mealy machine \(\textsf{active}(\mathcal {H}) = (Q \setminus Q_{sink},I',O,q_0,\delta ',\lambda ')\) such that

$$\begin{aligned} I' &= \Biggl \{ i \in I \mid \exists q \in Q.~ i \text { active in } q\Biggr \},\\ \delta '(q,i) &= {\left\{ \begin{array}{ll} \delta (q,i) &{} \text {if } i \text { active in }q,\\ \mathord {\uparrow }&{} \text {otherwise},\\ \end{array}\right. }\quad \text { and }\quad \lambda '(q,i) = {\left\{ \begin{array}{ll} \lambda (q,i) &{} \text {if } \delta '(q,i)\downarrow ,\\ \mathord {\uparrow }&{} \text {otherwise}.\\ \end{array}\right. } \end{aligned}$$

Definition 4.5

The active inputs expert \(E_{\textsf{AI}}\) is given by \(E_{\textsf{AI}}(\mathcal {H},p) = \{I^{\textsf{active}(\mathcal {H})} \}\).

Complexity. The time complexity of \(E_{\textsf{AI}}\) is \(\mathcal {O}(nk)\), where n is the number of states and k the number of inputs. This is achieved by first determining the set \(Q_{sink}\) in \(\mathcal {O}(nk)\), and then computing \(\delta '\) and \(I'\) simultaneously in \(\mathcal {O}(nk)\).

Completeness. Test suite \(\textsf{ETS}_{E_{\textsf{AI}},k}\) is complete for the set of Mealy machines which 1) have at most k additional states and 2) where all non-sink states can be reached by a word in the state cover followed by at most k active inputs.

Theorem 4.6

Suppose \(\textsf{ETS}_{E_{\textsf{AI}},k}(\mathcal {H})\) uses state cover P. Let \(\mathcal {C} = \{\mathcal {S}\in \mathcal {C}^k_{\mathcal {H}} \mid Q^{\mathcal {S}} \setminus Q_{sink}^{\mathcal {S}} \subseteq \Delta ^\mathcal {S}(P \cdot (I^{\textsf{active}(\mathcal {H})})^{\le k})\} \). Then \(\textsf{ETS}_{E_{\textsf{AI}},k}(\mathcal {H})\) is complete for \(\mathcal {C}\).

The proof follows from Lemma 3.8; the hypotheses make sure that a variant of Lemma 3.7 holds. The above theorem applies in particular, if \(\mathcal {H}\) is minimal, for the restriction of \(\mathcal {C}^k_{\mathcal {H}}\) to those Mealy machines \(\mathcal {S}\) where all non-sink states are reachable by the sub-alphabet generated by \(E_{\textsf{AI}}\).

The active inputs expert performs well on \(\textsf{ASML}_{a,b}\) once the spine is learned because it will not generate infixes with inputs that always lead to the sink state. In the empirical evaluation performed in Sec. 6, it can be observed that \(E_{\textsf{AI}}\) requires significantly fewer symbols to learn \(\textsf{ASML}_{a,b}\) compared to \(E_{\textsf{T}}\).

4.2 Future Expert

Motivation. Real-world systems often contain an ‘initialization phase’ where inputs like start or login are used that are not used later in the system. Fig. 4 shows the family of Mealy machines \(\textsf{TCP}_{a,b}\) inspired by the TCP models [16]. The models of TCP clients contain two distinct phases: the three-way handshake and the connected part. After the three-way handshake, some inputs are never active again. \(\textsf{TCP}_{a,b}\) has the same two phases. For the last few hypotheses that arise during learning, all inputs will be active. Therefore, \(E_{\textsf{AI}}\) will generate the same \(\textsf{ETS}\) as \(E_{\textsf{T}}\). \(E_{\textsf{AI}}\) is too coarse here because, at different parts of the system, different sets of inputs are active.

Fig. 4.
figure 4

\(\textsf{TCP}_{a,b}\) models over inputs and outputs \(\{ x_i \mid 1 \le i \le a \} \cup \{ y_i \mid 1 \le i \le b \}\). Transitions not shown lead to a sink with a unique output.

The expert. The future expert generates a subalphabet for each state in the hypothesis. This subalphabet contains all inputs that are active from that state onwards, within a given number of steps. Bounding the number can be useful in large models, and avoids that we end up with the complete alphabet if the Mealy machine is strongly connected.

Definition 4.7

The future expert \(E^l_{\textsf{F}}\), is given for \(l \in \mathbb {N}\) by \(E^l_{\textsf{F}}(\mathcal {H},v) = \{ I_{v,l} \}\) where \(I_{v,l} = \{ i \mid \exists q \in \Delta ^{\textsf{active}(\mathcal {H})}(v \cdot I^{\le {l-1}}) \wedge \delta ^{\textsf{active}(\mathcal {H})}(q,i)\downarrow \}\).

Complexity. The time complexity \(\mathcal {O}(n(n + n|I|))\) can be achieved for \(E_{\textsf{F}}\) with a bounded BFS for each state.

Completeness. For \(E^l_{\textsf{F}}\), we have the following completeness result.

Theorem 4.8

Suppose \(\textsf{ETS}_{E^l_{\textsf{F}},k}(\mathcal {H})\) uses state cover P. Let \(\mathcal {C} = \{\mathcal {S}\in \mathcal {C}^k_{\mathcal {H}} \mid Q^\mathcal {S}\setminus Q^\mathcal {S}_{sink} \subseteq \bigcup _{v\in P} \Delta ^\mathcal {S}(v \cdot I_{v, l}^{\le k})\}\). Then \(\textsf{ETS}_{E^l_{\textsf{F}},k}(\mathcal {H})\) is complete for \(\mathcal {C}\).

\(E_{\textsf{F}}\) performs well on \(\textsf{TCP}_{a,b}\) once the spine is learned because the subalphabet for states after \(y_a\) does not contain y-symbols, contrary to subalphabet from \(E_{\textsf{T}}\). Sec. 6 shows that \(E^l_{\textsf{F}}\) often outperforms the trivial expert \(E_{\textsf{T}}\).

4.3 Components Expert

Motivation. In some systems, sets of inputs are often used together. For example, after entering a username you often enter a password as well. It is possible that the set of inputs that are used together occur at multiple places in the system. Fig. 5 shows Mealy machines \(\textsf{SSH}_{a,b}\), loosely inspired by OpenSSH [18]. The OpenSSH model contains three phases: the key exchange, the authentication, and then the connection phase where re-keying is possible. For the family of Mealy machines \(\textsf{SSH}_{a,b}\), we assume there is a fixed set of possible keys and the key exchange and re-keying uses the same key-specific inputs, i.e., the inputs for the key exchange of key k are the relevant inputs for re-keying with key k.

The expert. The component expert generates subalphabets based on sets of states and is defined as follows.Footnote 2

Fig. 5.
figure 5

\(\textsf{SSH}_{a,b}\) models over inputs \(\{ x_{i,j} \mid 1 \le i \le a, j = 1, 2 \} \cup \{ y_i \mid 1 \le i \le b \} \cup \{y\}\) and the outputs \(\{ x_i \mid 1 \le i \le a \} \cup \{y, y_{fail}\}\). Transitions not shown lead to a sink with a unique output.

Definition 4.9

Let g be a function that takes a Mealy machine \(\mathcal {H}\) and returns a set of subsets of Q, referred to below as components. The component expert \(E^g_{\textsf{C}}\) with parameter g is defined s.t. \(E_C^g(\mathcal {H}, p) = \{ I_X \mid X \in g(\mathcal {H}) \}\) where \(I_X = \{ i \mid \exists q,q' \in X. \delta (q,i)=q' \}\).

Fig. 6.
figure 6

Example with colored communities.

Finding components. Finding a suitable subroutine g to determine components from a hypothesis is a non-trivial task. One relatively easy method for finding components is to compute the strongly connected components (SCCs). However, if the system can be reset at any state, then the complete model is an SCC and the components expert reduces to the trivial expert. Therefore, SCCs are often too strict. Another possibility is to utilize algorithms used in graph theory to decompose graphs into subgraphs. We propose to use Newman’s algorithm for detecting community structure [39] to identify components. The algorithm outputs sets of states with high transition density between states within the group. It starts with singleton communities and then greedily joins communities based on the maximal change in modularity, as long as it is positive. The modularity value \(\text {mod}(c)\) for component c is:

$$\begin{aligned} \text {mod}(c) = \frac{\#\text {edges staying in c}}{\#\text {edges}} - \frac{\#\text {outgoing edges of c} \cdot \#\text {incoming edges of c}}{\#\text {edges}^2} \end{aligned}$$

Example 4.10

We illustrate Newman’s algorithm on Fig. 6. Initially, \(\text {mod}(\{q_1\}) = 0 - \frac{2\cdot 3}{15^2} \approx -0.027, \text {mod}(\{q_3\}) = 0 - \frac{2\cdot 2}{15^2} \approx -0.018\). The difference between the initial modularity and the modularity of \(\{q_1,q_3\}\) (\(2 - \frac{4\cdot 5}{15^2} = 0.0444\)) is the highest possible change in modularity. We thus merge communities \(\{ q_1 \}\) and \(\{ q_3 \}\). Likewise, we then merge \(\{q_1, q_3\}\) and \(\{q_2\}\). After several steps we get to the final communities \(\{q_0, q_1, q_2, q_3\}\) and \(\{q_4,q_5,q_6\}\).

To apply Newman’s algorithm, the subroutine g transforms \(\textsf{active}(\mathcal {H})\) to a directed graph \(G = (Q,E)\) where \(E = \{ (q,q') \mid q,q' \in Q \wedge \exists i \in I. \delta (q,i)=q' \}\) and then applies Newman’s algorithm on G.

Complexity. The time complexity of \(E_C^g\) is \(\mathcal {O}(g + nk)\) where g is the complexity of the subroutine. The subterm \(\mathcal {O}(nk)\) originates from the active transformation. With Newman’s algorithm, the total complexity is in \(\mathcal {O}(n(n + n|I|))\) [39].

Completeness. \(\textsf{ETS}_{E^{g}_{\textsf{C}},k}\) is k-complete if all non-sink states in the SUL can be reached from a state p in the hypothesis with at most k inputs from some \(I_X\).

Theorem 4.11

Suppose \(\textsf{ETS}_{E_C^g,k}(\mathcal {H})\) uses state cover P. Let \(\mathcal {C} = \{\mathcal {S}\in \mathcal {C}^k_{\mathcal {H}} \mid Q^{\mathcal {S}} \setminus Q_{sink}^{\mathcal {S}} \subseteq \bigcup _{X \in g(\mathcal {H})} \Delta ^\mathcal {S}(P \cdot I_X^{\le k})\}\). Then \(\textsf{ETS}_{E_C^g,k}(\mathcal {H})\) is complete for \(\mathcal {C}\).

\(E^{\text {Newman}}_{\textsf{C}}\) performs well on \(\textsf{SSH}_{a,b}\) once the key exchange and authentication phase have been learned because the subalphabet mostly contains symbols that belong together and allows discovery of a whole new key exchange component. Ideally, \(\{x_{i,1}, x_{i,2}, x_{i,3}\}\), \(\{z_{i_1}, z_{i_2}, z_{i,3}\}\) for \(1 \le i \le a\) and \(\{y, y_0, ..., y_b\}\) form components for \(\textsf{SSH}_{a,b}\). In our experiments, Newman’s algorithm sometimes finds slightly bigger components.

5 Test Case Prioritization

To establish equivalence, all tests in a complete test suite need to be executed and their order is then irrelevant. However, to find a counterexample, we only need to execute tests until we hit that counterexample. This means that different orderings lead to significant performance changes [7]. In this section, we first describe the state-of-the-art in (ordered) test suites. We then create new, ordered test suites, that combine the \(\textsf{ETS}\)’s from Sec. 4 adaptively.

5.1 Randomised Test Suites

Test suites are often stored in a tree-like data structure. The straightforward ordering iterates over this tree to process the test cases deterministically. However, a variety of deterministic orderings for P, I, W are all (on average) outperformed by randomised methods that do a better job in diversification [7, Ch. 4]. State-of-the-art randomised test suite generation methods are described in [34, 46] and make use of a geometric distribution to determine the length of the infix. We present a simplerFootnote 3 and more generic variation: Given an expert e and a distribution \(\mu \) over natural numbers, the randomised \(\textsf{ETS}\) \(S_{e,\mu }\) is a distribution over words \(v \cdot i\cdot w \in P \cdot I^* \cdot W\) such that:

$$\begin{aligned} S_{e,\mu }(v \cdot i \cdot w) = \frac{\mu (l)}{|\textsf{ETS}_{e,l}|} \qquad \text {for } |i|=l \end{aligned}$$
(1)

Informally, (1) indicates that the probability of sampling a test case with infix length l is the probability of sampling infix length l from distribution \(\mu \) and then uniformly sampling a test case from \(\textsf{ETS}_{e,l}\).

For any \(\mu \) with infinite support, the generated test suite is infinite. Thus, randomised ASI methods are test case prioritizations over infinite test suites \(P \cdot I^* \cdot W\). Still, randomised ASI methods often find counterexamples faster than k-complete ASI methods [4, 20]. To ensure k-completeness in randomised ASI methods, we need extra bookkeeping to determine whether the right tests have been executed and we can only guarantee that we execute these tests in the limit.

5.2 Multi-Armed Bandits

We want to use all experts from Sec.4 to generate test cases. A naive solution is to determine a static distribution that describes how often an expert should be selected for generating a test case. However, it is unclear how such a distribution should be determined. Instead, we use so-called multi-armed bandits to dynamically update the distribution over available experts using information from previous testing rounds. We refer to this algorithm as the Mixture of Experts. The multi-armed bandits problem was first described by Robbins [43] and is a classic reinforcement learning problem. We instantiate the \(\textsf{EXP3}\) algorithm for adversarial multi-armed bandits [9]. Intuitively, our instantiation prioritizes test cases by better performing experts. We embed the Mixture of Expert algorithm in the MAT framework from Fig. 1 and list the pseudocode in Algorithm 1.

Algorithm 1
figure c

Instantiated \(\textsf{EXP3}\) Algorithm for Test Case Generation

Algorithm 1 is used with randomised ASI-methods. The algorithm is called with a hypothesis \(\mathcal {H}\) and \(\textsf{weights}\). The parameter \(\textsf{weights}\) indicates how good an expert is and is initialized to 1 for each enabled expert. The algorithm uses the set of enabled experts E, constant k, distribution \(\mu \), and exploration parameter \(\gamma \) as global parameters. The exploration parameter determines how often we choose an expert at random. In Algorithm 1, each iteration of the loop represents the generation of one test case. In each iteration, we first update the distribution \(\textsf{probs}\) for each expert \(i \in E\) using Eq. 2.

$$\begin{aligned} \textsf{probs}(i) \leftarrow (1-\gamma ) \cdot \frac{\textsf{weights}(i)}{\Sigma _{j \in E} \textsf{weights}(j)} + \frac{\gamma }{|E|} \end{aligned}$$
(2)

Next, we sample an expert from \(\textsf{probs}\) and sample a test case from \(S_{e,\mu }\). We determine the output of the test case on \(\mathcal {H}\) and \(\mathcal {S}\) and update the \(\textsf{weights}\) for chosen expert e using Eq. 3 if \(v\ne v'\), otherwise the \(\textsf{weights}\) remain the same.

$$\begin{aligned} \textsf{weights}(e) \leftarrow \textsf{weights}(e) \cdot exp \left( {\frac{\gamma }{\textsf{probs}(i) \cdot |E|}}\right) \end{aligned}$$
(3)

If \(v \ne v'\), then we have found a counterexample and the \(\textsf{weights}\) value for the chosen expert significantly increases. Consequently, the expert is more likely to be chosen to generate test cases in the next rounds. Finally, if \(v \ne v'\), we return the counterexample. Otherwise, generate a new test case.

6 Experimental Evaluation

In this section, we empirically investigate the performance of our implementation of Algorithm 1 in comparison with a state-of-the-art baseline. The source code and all benchmarks are available onlineFootnote 4 [29]. We investigate the performance on four benchmark sets with varying complexities in the first three experiments:

  • RQ1: How does Algorithm 1 scale on the models from Figs. 3, 4, and 5?

  • RQ2: How does Algorithm 1 compare to the state-of-the-art on industrial benchmarks from the RERS challenge [27]?

  • RQ3: How does Algorithm 1 perform on the standard automata wiki [38] benchmark suite and randomly generated Mealy machines?

In Experiment 3, we additionally consider an alternative non-randomised version of the presented algorithm which is not feasible to apply to the benchmarks of Experiment 2 given the worse performance of non-randomised test suites. Experiment 4 provides an in-depth analysis of runs on two benchmarks from the RERS challenge. Detailed benchmark results can be found in Appendix C of [30].

Experimental Setup We have extended the \(L^{\#}\) learning library [48] with the multi-armed bandits approach described in Sec. 5. We compare our implementation instantiated with different experts. We write \(\textsf{MoE}\)(\(*\)) to refer to our key contribution, using the Mixture of all Experts, i.e., \(\textsf{MoE}\)(\(E_{\textsf{T}}, E_{\textsf{AI}}, E^k_{\textsf{F}}, E^{\text {Newman}}_{\textsf{C}}\)). The exploration parameter \(\gamma \) used in Algorithm 1 is set to 0.2 (determined by grid search) and the number of hypothesis states before we start sampling experts to 5. We evaluate within a MAT framework as in Figure 1. Our contributions can be paired with any learning algorithm in the MAT framework. We use \(L^{\#}\) [48], as this is a recent learning algorithm. We sample test cases from \(S_{E_{\textsf{T}},\mu }\) as our baseline. More precisely, we use randomised Hybrid-ADS, as formulated in [34, Ch. 1], as conformance testing technique. For both the baseline and algorithm 1, the \(\mu \) in Eq. (1) is instantiated as follows: Let \(\textsf {geom}\) be the geometric distribution with mean 2, then randomised Hybrid-ADS generates \(S_{e,\mu }\) as in Eq. (1), where \(\mu (x) = \textsf {geom}(x)\) if \(x > 3\), \(\mu (3) = 7/8\), and \(\mu (x) = 0\) otherwise. These hyperparameters are chosen to match [20]. We run Experiments 1 and 3 with 30 seeds, and Experiments 2 and 4 with 50 seeds. In Experiments 1, 3 and 4 we evaluate the performance based on the total number of symbols and resets which is the sum of the length of all test cases plus the number of test cases sent to the SUT. Additional plots based on only the symbols or only the resets can be found in Appendix D of [30].

Experiment 1 We evaluate the performance on the benchmark families \(\textsf{ASML}_{a,b}\), \(\textsf{TCP}_{a,b}\), and \(\textsf{SSH}_{a,b}\), for several choices of a and b. In all models, increasing a leads to a general increase in difficulty, while b adds the number of ‘irrelevant’ inputs. Beyond the baseline and \(\textsf{MoE}\)(\(*\)), we include for each family the associated experts discussed in Sec. 4, to validate that they indeed perform well on these families. Thus, for \(\textsf{ASML}_{a,b}\) we run \(\textsf{MoE}\)(\(E_{\textsf{T}}, E_{\textsf{AI}}\)), for \(\textsf{TCP}_{a,b}\) we run \(\textsf{MoE}\)(\(E_{\textsf{T}}, E^k_{F}\)), and for \(\textsf{SSH}_{a,b}\) we run \(\textsf{MoE}\)(\(E_{\textsf{T}}, E^{\text {Newman}}_{\textsf{C}}\)).

Fig. 7.
figure 7

Results Experiment 1.

Results. Fig. 7 plots the results, distinguishing six cases. Each column reflects another benchmark family. The top row shows the values for the parameterized models with \(a=3\), while the bottom row shows the values for the parameterized models with \(a=5\). In each figure, the x-axis reflects the value of b. The y-axis (log scale) shows the total number of symbols and resets to learn and test a model. The y-axis is different for all subplots.

Discussion. From the plot, we observe that the baseline is outperformed by the other algorithms. Interestingly, the performance of \(\textsf{MoE}\)(\(*\)) and the algorithm belonging to the parameterized model is often comparable. Increasing a leads to an increase in the total number of symbols and resets, which illustrates the scalability of the parameterized models. Increasing the value b has more influence on the baseline than the other algorithms, as expected.

Experiment 2 We compare \(\textsf{MoE}\)(\(*\)) to the baseline on the ASML benchmarks introduced in the RERS challenge [27]. We consider 23 models with 25-289 states and 10-177 inputs. We skip models with less than 15 states because \(\textsf{MoE}\) needs time to learn which expert works best. The ASML models frequently do not terminate within a timeout of an hour [52]. Therefore, we set a maximal symbol budget. The SUL rejects new OQs once the budget is depleted.

Fig. 8.
figure 8

Results Experiment 2.

Results. Fig. 8 lists different models sorted by the number of transitions. For each model, we show how often out of 50 seeds an algorithm learns the model within a symbol budget of \(10^8\). We provide a similar figure with half the budget in Appendix D of [30].

Discussion. From the plot, we observe that \(\textsf{MoE}\)(\(*\)) learns the model more often than the baseline. The \(\textsf{MoE}\)(\(*\)) algorithm can learn 12 models with at least \(80\%\) of the seeds while the baseline only learns 3 models with at least \(80\%\) of the seeds. The same pattern can be observed for half the budget.

Experiment 3 We consider the protocol implementations used in [17, 48] (38 models, 15-133 states, 7-22 inputs) and randomly generated models (27 models, 20-60 states, 11-31 inputs). For the standard benchmarks, we perform the experiment with the randomised \(\textsf{ETS}\), as used in the other experiments, and the deterministically ordered \(\textsf{ETS}\) from Sec. 5.1 with \(k=2\).

Fig. 9.
figure 9

Results Experiment 3.

Results. Fig. 9 shows the number of symbols and resets needed to learn and test a model (log-scaled). The y-axis shows \(\textsf{MoE}\)(\(*\)) and the x-axis shows the baseline. The diagonal solid lines correspond to using the same number of symbols and resets, the dotted lines indicate a factor two difference. Points in the right triangle indicate that \(\textsf{MoE}\)(\(*\)) used fewer symbols and resets than the baseline.

Discussion From Fig. 9a, we observe that \(\textsf{MoE}\)(\(*\)) slightly outperforms the baseline in the k-complete test suite setting. From Fig. 9b, we observe that the performance of \(\textsf{MoE}\)(\(*\)) leads to slightly better results than the baseline. The performance is comparable for the randomly generated models (Fig. 9c).

Experiment 4 We analyze runs of \(\textsf{MoE}\)(\(*\)) and the baseline for models m159 and m189 to provide insights on the behavior of the algorithms.

Results. Fig. 10 shows the runs of the first 3 seeds for m159 and m189. Each data point at (x, y) in the subplots represents one hypothesis, with x states, that was learned using a total of y symbols (notice the log scale). The green (or blue) lines correspond to runs with the baseline (or \(\textsf{MoE}\)(\(*\))). The different markers for \(\textsf{MoE}\)(\(*\)) indicate which expert was used to generate the counterexample. The vertical lines extending to \(10^8\) indicate that the algorithm ran out of budget before learning the correct model.

Discussion. In line with Experiment 2, we see that more runs lead to learning the full model using \(\textsf{MoE}\)(\(*\)). The plots use the number of states as a rough progress measure. Based on this progress measure, we see that the difference is negligible for small hypothesis sizes, but for larger hypotheses, the difference is substantial. For m159, we observe that the baseline runs out of budget before all states have been found, whereas the \(\textsf{MoE}\)(\(*\)) is able to learn the correct model within the budget (using the smaller test suites). In m189, we observe a significant divergence in progress. On average, the future expert is most used to find counterexamples.

Fig. 10.
figure 10

Results Experiment 4 for m159 (left) and m189 (right).

7 Related Work

Test suites. The use of conformance testing [12] is standard in automata learning [47] and goes back to [40]. There are several recent evaluations comparing sample-based conformance testing techniques [4, 7, 20]; these comparisons are orthogonal to the current paper. Another idea is to use mutation testing [3]. Mutation testing performs well on small models (\(<100\) states) but [3] notices that this technique is computationally too expensive for large models.

Increasing the alphabet size. Instead of reducing the alphabet size for a more guided counterexample finding, a common theme is to use abstraction refinement [23, 47] during learning to iteratively refine the alphabet. Bobaru et al. [10] learn models using abstractions of components to later show that a property holds or is violated. Additionally, Vaandrager and Wißman [49] formally describe the relation between high-level state machines and low-level models using abstraction refinement.

Using the automata structure. A recent trend is gray-box automata learning, which assumes partial information on the SUL and aims to exploit this information. In particular, learning algorithms addressing various types of composition (sequential, parallel, product) have been investiged [2, 31, 33, 37]. However, all these techniques adapt the learning algorithm, not the testing algorithm, as in the current paper. Furthermore, while the results in Sec. 4 are similar to a gray-box setting, the idea in Sec. 5 is that this work leads to better performance in the strict black-box setting, as highlighted by the experiments.

Algorithm selection. Machine learning for algorithm selection is an active area of research, see e.g., [1, 28] and has been applied successfully, e.g., in the context of SAT checking [21]. In formal methods, multi-armed bandits framework has been used, e.g., to prioritize SMT solver over others [42] or to guide falsification processes for hybrid systems [53]. In automata learning, bandits have recently been applied to select between different oracles for answering output queries [45].

8 Conclusion

In this paper, we introduced smaller test suites for conformance testing that preserve the typical completeness guarantees under natural assumptions on the learned system. The paper demonstrates that a combination of these test suites and a multi-armed bandit formulation significantly accelerates modern active automata learning, even when the assumptions do not hold. Natural extensions include adding additional small test suites, designing variations of the presented experts to, for example, handle parallel components[31, 37], and using a multi-armed bandit to select the essential parameter k. Furthermore, our approach paves the way for using similar assumptions to those made for the completeness of the expert test suites in other aspects of active automata learning.