1 Introduction

In recent years, there has been significant interest in the use of neural models, and in particular recurrent neural networks (RNNs), for learning languages. Like other supervised machine learning techniques, RNNs are trained based on a large set of examples of the target concept.

RNNs can reasonably approximate a variety of languages, and even precisely represent a regular language (Casey 1998). However, they are in practice unlikely to generalise exactly to the concept being trained, and what they eventually learn in actuality is unclear (Omlin&Giles 2000). Indeed, several lines of work attempt to glimpse into the RNN black-box (Zeng et al. 1993; Omlin&Giles 1996; Cechin et al. 2003; Jacobsson 2005; Karpathy et al. 2015; Li et al. 2015; Linzen et al. 2016; Strobelt et al. 2016; Lei et al. 2016; Kádár et al. 2016; Shi et al. 2016; Adi et al. 2016; Murdoch&Szlam 2017; Wang et al. 2017; Arras et al. 2017).

In contrast to the supervised ML paradigm, the exact learning paradigm considers setups that allow learning a target language without approximation. For example, Angluin’s \(\hbox {L}^{*}\) algorithm enables the learning of any regular language, provided a teacher capable of answering membership (request to label example) and equivalence (comparison of proposed language with target language) queries is available (Angluin 1987).

In this work we use exact learning to elicit the true concept class of a trained recurrent neural network. This is done by treating the trained RNN as the teacher of the \(\hbox {L}^{*}\) algorithm. To the best of our knowledge, this is the first attempt to use exact learning with queries and counterexamples to extract an automaton from a given RNN.

Recurrent neural networks

Recurrent neural networks (RNNs) are a class of neural networks which are used to process sequences of arbitrary lengths. When operating over sequences of discrete alphabets, the input sequence is fed into the RNN on a symbol-by-symbol basis. For each input symbol the RNN outputs a state vector representing the sequence up to that point, combining the current state vector and input symbol at every step to produce the next one. An RNN is essentially a parameterised mathematical function that takes as input a state vector and an input vector, and produces a new state vector. The RNN is trainable, and, when trained together with a classification component, the training procedure drives the state vectors to provide a representation of the prefix which is informative for the classification task being trained.

Classification

An RNN can be paired with a classification component, a classifier function that takes as input a state vector and returns a binary or multi-class classification decision. The RNN and the classifier are combined by applying the RNN to the sequence, and then the classifier to the final resulting state vector. When the classification component gives a binary classification for each state vector, the combination defines a binary classifier over sequences, which we call an RNN-acceptor. When the component gives a distribution over the possible next tokens, the combination defines a next-token distribution for each input sequence, which we call a Language-Model RNN (LM-RNN).

A trained RNN-acceptor can be seen as a state machine in which the states are high-dimensional vectors: it has an initial state, a well defined transition function between internal states, and a well defined classification for each internal state. A trained LM-RNN is not immediately analogous to a binary state machine, but we will see in this work how it may be interpreted as a one, and under this interpretation also extracted from using our method.

RNNs play a central role in deep learning, and in particular in natural language processing. For more in-depth overview, see (Goodfellow et al. 2016; Goldberg 2016, 2017).

We now turn to the question of understanding what an RNN has actually learned. We formulate the question around RNN-acceptors, but later (in Sect. 8) show how the solution relates to LM-RNNs.

Motivation

Given an RNN-acceptor R trained over a finite alphabet \(\Sigma\), our goal is to extract a deterministic finite-state automaton (DFA) A that classifies sequences in a manner observably equivalent to R. (Ideally, we would like to obtain a DFA that accepts exactly the same language as the network, but this is a much more difficult task.Footnote 1)

Note

In this work, when understood from context, we use the term RNN to mean RNN-acceptor. Additionally, we use “automata” to refer specifically to deterministic finite automata (DFAs) (as opposed to other automata variants, such as pushdown automata or weighted automata).

Previously existing techniques for DFA extraction from recurrent neural networks are based on creating an a-priori partitioning of the RNN’s state space, and mapping the transitions between the resulting clusters (e.g., Omlin&Giles (1996); Zeng et al. (1993)). In this work however, we approach the question using exact learning.

Exact learning

In the field of exact learning, concepts (sets of instances) can be learned precisely from a minimally adequate teacher—an oracle capable of answering two query types (Goldman&Kearns 1995):

  • membership queries state whether a given instance is in the concept or not

  • equivalence queries state whether a given hypothesis (set of instances) is equal to the concept held by the teacher. If not, return an instance on which the hypothesis and the concept disagree (a counterexample).

The \(\hbox {L}^{*}\) algorithm (Angluin 1987) is an exact learning algorithm for learning a DFA from a minimally adequate teacher with knowledge of some regular language L. In this context, the concept is L, the instances are finite sequences (‘words’) over its alphabet, and the hypotheses are presented as automata \({\mathcal {A}}\) defining a regular language \(L_{\mathcal {A}}\). \(\hbox {L}^{*}\) completes when the oracle accepts its latest equivalence query, i.e. when \(L_{\mathcal {A}}=L\).

Our approach

We treat DFA extraction from RNNs as an exact learning problem. We use Angluin’s \(\hbox {L}^{*}\) algorithm to elicit a DFA from any type of trained RNN, using the RNN as a teacher. In doing so, we maintain only a coarse partitioning of the RNN’s state space, refining it only as much as necessary to answer \(\hbox {L}^*\)’s queries.

RNNs as teachers

A trained RNN-acceptor can trivially answer membership queries, by feeding input sequences to the network for classification. Answering equivalence queries, however, is not so easy. The main challenge is that no finite interpretation of the network’s states and transitions is given upfront: the states of an RNN are high-dimensional real-valued vectors, resulting in an infinite state space which cannot be exhaustively enumerated and compared to the hypothesis.

To address this challenge, we use a finite abstraction of the RNN R to answer equivalence queries: we define a finite partitioning of the state space, and create from it an automaton which can be compared to the hypothesis \({\mathcal {A}}\). A unique aspect of this setting compared to previous \(\hbox {L}^{*}\) works is that we only observe an abstraction of the teacher. This means that when there is a disagreement between the teacher and the learner, it may be not that the learner is incorrect and needs to refine its representation, but rather (or also) that our abstraction of the teacher is not precise enough and must be refined. Indeed, at every equivalence query, the current finite abstraction and current proposed automaton \({\mathcal {A}}\) act as two hypotheses for the RNN R’s ground truth, which must at least be equivalent to each other in order to both be equivalent to R. Thus, whenever the two disagree on a sample, we find its true classification in R, obtaining through this either a counterexample to \({\mathcal {A}}\) or a refinement to the abstraction.

Main contributions

The main contributions of this paper are:

  • We present a novel and general framework for extracting automata from trained RNNs, using the RNNs as teachers in an exact learning setting.

  • We implementFootnote 2 the technique and show its ability to extract descriptive automata in settings where previous approaches fail. We demonstrate its effectiveness on modern RNN architectures—multi-layer LSTMs and GRUs.

  • We describe how the technique can be used to learn DFAs from only positive examples, and demonstrate its effectiveness in this setting. To do so we show how to create RNN-acceptors from positive examples only, using a language modeling objective.

  • We apply our technique to RNNs trained to \(100\%\) train and test accuracy on simple languages, and discover in doing so that some RNNs have not generalised to the intended concept. Our method easily reveals and produces adversarial inputs—words misclassified by the trained RNN and not present in the train or test set.

A basic version of this paper has been presented in ICML 2018 (Weiss et al. 2018a).

2 Preliminaries

In this paper we use the following notations and terminology.

2.1 Automaton and classification function

A deterministic finite automaton (DFA) A is a tuple \(\langle \Sigma , Q, i, F, \delta \rangle\), in which \(\Sigma\) is the alphabet, Q the set of states, \(F \subseteq Q\) the set of accepting states, \(i \in Q\) the initial state, and \(\delta : Q \times \Sigma \rightarrow Q\) the transition function. For a given automaton we add the notation \(f:Q\rightarrow \{Acc,Rej\}\) as the function giving the classification of each state, i.e. \(f(q)=Acc \iff q\in F\), and the notation \({\hat{\delta }}:Q\times \Sigma ^* \rightarrow Q\) as the recursive application of \(\delta\) to a sequence, i.e.: for every \(q\in Q\), \({\hat{\delta }}(q,\epsilon )=q\), and for every \(w\in \Sigma ^*\) and \(\sigma \in \Sigma\), \({\hat{\delta }}(q,w{\cdot }\sigma )=\delta ({\hat{\delta }}(q,w),\sigma )\). As an abuse of notation, use \({\hat{\delta }}(w)\) to denote \({\hat{\delta }}(i,w)\).

The classification of a word \(w\in \Sigma ^*\) by a DFA A is defined \(A(w)=f({\hat{\delta }}(w))\), and the regular language defined by A is the set of words it accepts, \(L_A=\{w\in \Sigma ^*\ |\ A(w)=Acc\}\).

Two automata A and B are equivalent if \(L_A=L_B\), and an automaton \(A=\langle \Sigma ,Q,i,F,\delta \rangle\) is minimal if for every automaton \(A'=\langle \Sigma ,Q',i',F',\delta ' \rangle\) equivalent to A, \(|Q|\le |Q'|\). Two states \(q_1,q_2\in Q\) of an automaton \(A=\langle \Sigma ,Q,i,F,\delta \rangle\) are equivalent if for every \(w\in \Sigma ^*\), \(f({\hat{\delta }}(q_1,w))=f({\hat{\delta }}(q_2,w))\), and an automaton is minimal iff it has no two equivalent states.

For visual clarity, ‘sink reject states’—states \(q\notin F\) for which \(\delta (q,\sigma )=q\) for every \(\sigma\)—are not drawn in images of DFAs in this paper. Thus for example the second DFA in Fig. 1 actually has 3 states, and rejects the sequence “)”.

2.2 Recurrent neural networks

An RNN R is a parameterised function \(g_R(h,x)\) that takes as input a state-vector \(h_t\in {\mathbb {R}}^{d_s}\) and an input vector \(x_{t+1}\in {\mathbb {R}}^{d_i}\) and returns a state-vector \(h_{t+1}\in {\mathbb {R}}^{d_s}\). An RNN can be applied to a sequence \(x_1,...,x_n\) by recursive application of the function \(g_R\) to the vectors \(x_i\), beginning from a given initial state \(h_{0,R}\) associated with the network. When applying an RNN to a sequence over a finite alphabet, each symbol is deterministically mapped to an input vector using either a one-hot encodingFootnote 3 or an embedding matrix, the method presented in this work is agnostic to this choice. For convenience, we refer to input symbols and their corresponding input vectors interchangeably.

We denote the state space of a network R by \(S_R\subseteq {\mathbb {R}}^{d_s}\), and by \(\hat{g_R}:S_R\times \Sigma ^*\rightarrow S_R\) the recursive application of \(g_R\) to a sequence, i.e. for every \(h\in S_R\), \(\hat{g_R}(h,\epsilon )=h\), and for every \(w\in \Sigma ^*\) and \(\sigma \in \Sigma\), \(\hat{g_R}(h,w{\cdot }\sigma )=g_R(\hat{g_R}(h,w),\sigma )\). As an abuse of notation, we also use \(\hat{ g_R}(w)\) to denote \(\hat{ g_R}(h_{0,R},w)\).

2.3 RNN-acceptors

A binary RNN-acceptor is an RNN with an additional function \(f_R: S_R \rightarrow \{Acc,Rej\}\) that receives a state vector \(h_t\) and returns an accept or reject decision. The RNN-acceptor R is the pair of functions \(g_R,f_R\) with associated initial state \(h_{0,R}\), The classification of a word \(w\in \Sigma ^*\) by an RNN-acceptor R is defined \(R(w)=f_R(\hat{g_R}(w))\), and the language defined by R is the set of words it accepts, \(L_R=\{w\in \Sigma ^*\ |\ R(w)=Acc\}\).

A given RNN-acceptor can be interpreted as a deterministic, though possibly infinite, automaton, which we do note is a more powerful model than that of deterministic finite automata.

We drop the subscript R when it is clear from context.

2.4 Multi-layer RNNs

RNNs are often arranged in layers (“deep RNNs”). In a k-layers layered configuration, there are k RNN functions \(g_1,...,g_k\), which are applied to an input sequence \(x=x_1,...,x_m\) as follows: x is mapped by \(g_1\) to a sequence of state vectors \(h_{1,1},...,h_{1,m}\), and then each sequence \(h_{i,1},...,h_{i,m}\) is mapped by \(g_{i+1}\) to the sequence \(h_{i+1,1},...,h_{i+1,m}\). For such multi-layer configurations, we take the entire state-vector at time t to be the concatenation of the individual layers’ state vectors: \(h_t = h_{1,t}{\cdot }\ h_{2,t} ... {\cdot }h_{k,t}\). Generally, the classification component of a multi-layered RNN-acceptor or LM-RNN is applied only to the final state of the top layer: \(f_R(h_t)=f'_R(h_{t,x})\) for some \(f'_R\).

2.5 RNN architectures

The parameterised functions \(g_R\) and \(f_R\) can take many forms. The function \(f_R\) can take the form of a linear transformation or a more elaborate classifier. The original form of \(g_R\) is the Elman RNN (Elman 1990), in which \(g_R\) is an affine transform followed by a non-linearity, \(g_R(h,x) = \tanh (W^xx+W^h h + b)\). Here \(W^x\), \(W^h\) and b are the parameters of the function that need to be trained, and have dimensions \(d_{s}\times d_{i}\), \(d_{s} \times d_{s}\), and \(d_s\times 1\) respectively. Other popular forms are the Long Short-Term Memory (LSTM) (Hochreiter&Schmidhuber 1997) and the Gated Recurrent Unit (GRU) (Cho et al. 2014; Chung et al. 2014). These more elaborate functions are based on a differentiable gating mechanism, and have been repeatedly demonstrated to be easier to train than the Elman RNN, and to robustly handle long-range sequential dependencies. We refer the interested readers to textbooks such as Goodfellow et al. (2016); Goldberg (2017) or to the documentation of the PyTorch framework (Paszke et al. 2019) for their exact forms.

Our technique is agnostic to these internal differences, treating the functions \(f_R\) and \(g_R\) as black boxes. In our experiments, we use linear transformation for \(f_R\), and the popular LSTM and GRU architectures for \(g_R\). For the LSTM, whose transition function is often described as converting a triplet of input-vector, state-vector and memory-vector to a next state-vector and memory-vector, we treat the concatenation of the state-vector and memory-vector as a single state-vector with dimension \(d_s=2h_s\), where \(h_s\) is the hidden size of the cell.

2.6 Network abstraction

Given a neural network R with state space S and alphabet \(\Sigma\), and a partitioning function \(p {:} S\rightarrow {\mathbb {N}}\), Omlin and Giles (1996) presented a method for extracting a DFA for which every state is a partition from p, and the state transitions and classifications are defined by a single sample from each partition. Their method can be seen as a simple sheared exploration of the partitions defined by p. The exploration begins from the partition containing the initial state \(p(h_{0,R})\), explores according to the network’s transition function \(g_R\), and shears wherever it reaches an abstract state (partition) that has already been visited. We present it as pseudocode in Algorithm 1.

We denote by \(A_{R,p}\) the DFA extracted by this method from a network R and partitioning p, and denote all its related states and functions by subscript Rp.Footnote 4 Note that the algorithm is guaranteed to extract a deterministic finite automaton (DFA) from any network and finite partitioning.

figure a

2.7 The L \(^{*}\) algorithm

Angluin’s \(\hbox {L}^{*}\) algorithm (1987) is an exact learning algorithm for regular languages. The algorithm learns an unknown regular language L over an alphabet \(\Sigma\) from a teacher T, generating as output a DFA \({\mathcal {A}}\) that accepts L. In our work we implement such a teacher for \(\hbox {L}^{*}\) around a given RNN, and apply \(\hbox {L}^{*}\)to this teacher directly. Therefore it is sufficient here to limit our discussion to only the requirements of this interaction.

\(\hbox {L}^{*}\) interacts with a teacher that must answer two types of queries: membership queries, in which the teacher must classify words presented by \(\hbox {L}^*\), and equivalence queries, in which the teacher must accept or reject automata proposed by \(\hbox {L}^{*}\)based on whether or not they correctly represent the target language. If the teacher rejects an automaton \({\mathcal {A}}\), it must also provide a counterexample—a word that \({\mathcal {A}}\) misclassifies with respect to the target language. \(\hbox {L}^{*}\) continues to present queries to the teacher until the teacher accepts a hypothesis \({\mathcal {A}}\), at which point it terminates and returns \({\mathcal {A}}\).

The \(\hbox {L}^{*}\) algorithm is guaranteed to always present a minimal DFA consistent with all membership queries given so far, and we use this fact in our work. Additionally, provided the target language T is regular, \(\hbox {L}^{*}\) is guaranteed to return a minimal DFA for T in polynomial time in \((|Q|+|w|+|\Sigma |)\), where |Q| is the number of states in that DFA, \(\Sigma\) is the input alphabet, and |w| is the length of the longest counterexample given by the teacher (Angluin 1987; Berg et al. 2005).

3 Existing approaches and related work

Soon after the introduction of the RNN (Elman 1990), it was shown that, when learning a regular language, a simple (“Elman-”) RNN is able to cluster its reachable states in a manner that resembles a (not necessarily minimal) DFA for that language (Cleeremans et al. 1989Footnote 5). Since then there has been a lot of research on extracting rules, and in particular finite automata, from RNNs. Partial surveys of these works are presented by Wang et al. (2017) and Jacobsson (2005).

Transition mapping

In their 1996 paper, Omlin and Giles experimented on second-order RNNs, and found that their learned states also tend to cluster in small areas in the network state space. Through this, and an assumption of continuity in the network behavior (i.e., small changes in the current state lead only to small changes in the next state), they concluded that it was safe to cluster like-valued state vectors together as one state, and traverse these clustered states in order to recover a DFA from the RNN.

In particular, given a neural network R with state space S and alphabet \(\Sigma\), and a partitioning function \(p {:} S\rightarrow {\mathbb {N}}\), Omlin and Giles presented a method (Algorithm 1) for extracting a DFA abstraction of the network in which every abstracted state is an entire partition from p, and the transitions between abstracted states and their classifications are obtained by a single sample of the continuous values in each such partition.

In both their own work and more recent research by others (e.g. Wang et al. 2017), this extraction method has been shown to produce DFAs that are reasonably representative of given second-order RNNs—provided the given partitioning captures the differences between the network states well enough.

Quantisation

For networks with bounded output values, Omlin and Giles suggested dividing each dimension of the network state space into \(q\in {\mathbb {N}}\) (referred to as the quantisation level) equal intervals, yielding \(q^{d_s}\) subsets of the output space with \(d_s\) being the length of the state vectors.

However, because this technique applies a uniform quantisation over the entire output space, it suffers from inherent state explosion and does not scale to the networks used in practice today: the original paper demonstrates the technique on networks with 8 hidden values, whereas today’s can have hundreds to thousands.

Clustering

Other state-partitioning approaches use clustering (Cechin et al. 2003; Wang et al. 2017; Cohen et al. 2017). In these approaches, an unsupervised classifier such as k-means is applied to a large sample set of reachable network states, creating a finite number of clusters. The sample states can be found by various methods, such as a BFS exploration of the network state space to a certain depth, or by recording all state vectors reached by the network when applied to its train set (if available). The partitioning of the state space defined by the clusters is then explored in a similar way to that described by Omlin&Giles (1996). Clustering approaches yield automata that are much smaller than those given by the partitioning method originally proposed by Omlin and Giles, making them more applicable to networks of today’s standards.

Weaknesses

In both of these approaches the partitioning is set before the extraction begins, with no mechanism for recognizing and overcoming overly coarse behavior. Both approaches thus face the challenge of choosing the best parameter value for extraction, and are generally applied several times with different parameter values, after which the ‘best’ DFA is chosen according to a heuristic (e.g., accuracy against RNN on the test set). Additionally, both approaches can still have rather large state space, and—as the exploration of the extracted DFA is performed blindly—these states cannot be merged until the extraction is complete and the DFA can be minimised.

Note on architectures

Many of these works use second order RNNs (Giles et al. 1990), which are shown to better map DFAs than simple RNNs (Goudreau et al. 1994; Wang et al. 2018). In this work however, we experiment on the popular GRU (Cho et al. 2014; Chung et al. 2014) and LSTM (Hochreiter&Schmidhuber 1997) architectures, as they are more widely used in practice.

3.1 Recent works and future directions

Since the initial publication of this method, several other approaches for extracting DFAs have been suggested, and still other works have begun grappling with more complicated targets such as weighted automata or context free languages.

DFAs

Mayr&Yovine (2018) released an \(\hbox {L}^*\)-based approach for learning DFAs from any neural network architecture, answering equivalence queries by drawing random samples over the input alphabet and checking if they are counterexamples to the proposed automaton. Their work analyses this approach from a PAC learning perspective and applies also to completely black box models, in contrast to our own work and other extraction works listed above (which rely on access to the RNN’s hidden state from different prefixes). In Sect. 7.7, we compare our method to this approach, highlighting the advantage of the abstraction based approach to equivalence queries when the hidden state is available.

Wang&Niepert (2019) propose state-regularised RNNs, a variant of RNNs that is regularised towards transitioning between a finite number of learned internal states. Their work discusses both training these new RNNs and the recovery of DFAs from them once trained, presenting an extraction method tailored to their proposed architecture.

WFAs

Ayache et al. (2018) use spectral learning ( Balle et al. (2014)) to extract weighted, non-deterministic finite automata (WFAs) from any black box language model, evaluating on RNNs. Okudono et al. (2020) also apply spectral learning for WFA extraction, but this time to whitebox RNNs, using an adaptation of the equivalence query presented in this paper to refine the WFA beyond the initial spectral extraction. In a later work, we adapt \(\hbox {L}^{*}\)to a weighted setting, extracting weighted deterministic finite automata (WDFAs) from any black box language model (Weiss et al. 2019). Finally, more recently, Zhang et al. (2021) expand on the partitioning and then transition-mapping approach of the classical DFA extraction papers (Omlin&Giles 1996) to recover WFAs from RNNs without using exact or spectral learning.

CFGs

With the understanding that some RNN architectures behave more like counter machines (Gers&Schmidhuber 2001; Weiss et al. 2018b; Suzgun et al. 2019), which are more expressive than DFAs, and indeed that an RNN in general might be trained on something more complicated than a regular language, it becomes interesting to consider extraction of context free languages (CFGs) from RNNs.

Recently, Yellin&Weiss (2021) use the DFA-extraction method presented in this paper as the initial step in an algorithm for extracting a subclass of CFGs from trained RNNs,Footnote 6 and Barbot et al. (2021) apply results on visibly pushdown languages and tree automata to extract a different subclass of CFGs, also from trained RNNs. Independently, there exist several works on learning (subclasses of) CFGs from queries, or from examples only, that have not yet been applied for extraction from RNNs (Sakakibara 1992; Yokomori 2003; Tellier 2006; Clark&Eyraud 2007; Clark 2010; D’ulizia et al. 2010; Shibata&Yoshinaka 2016; Clark&Yoshinaka 2016; Yoshinaka 2019).

4 Learning automata from RNNs using L*

In the following sections we show how to build a teacher for the \(\hbox {L}^{*}\) algorithm around a given RNN-acceptor R. The teacher must be able to answer membership and equivalence queries as required by \(\hbox {L}^{*}\).

To implement membership queries we rely on the RNN classifier itself. To determine whether a given word w is in the unknown language \(L_R\), we simply run the RNN on this word, and check whether it accepts or rejects w.

To implement equivalence queries we check the equivalence of the \(\hbox {L}^{*}\) hypothesised automaton \({\mathcal {A}}\) against an abstraction \(A_{R,p}\) of the network, where p is a partitioning over the network’s state space. If we find a disagreement \(w\in \Sigma ^*\) between \({\mathcal {A}}\) and the current abstraction \(A_{R,p}\), we use R to determine whether this is because the \(\hbox {L}^{*}\) hypothesis is incorrect (i.e., \(L_R(w)\ne {\mathcal {A}}(w)\)), or a result of a poor abstraction (i.e., \(L_R(w)\ne A_{R,p}(w)\)). In the former case (\(L_R(w)\ne {\mathcal {A}}(w)\)), we end the equivalence query and return w as a counterexample to \(\hbox {L}^{*}\). Otherwise, we refine p and restart the comparison of \({\mathcal {A}}\) and \(A_{R,p}\). If no such disagreement w is found (i.e., \({\mathcal {A}}\) and \(A_{R,p}\) are equivalent), we accept \(\hbox {L}^*\)’s hypothesis and the extraction ends.

p is maintained between equivalence queries, i.e., the partitioning p at the start of the \(j{+}1{}^{\mathrm{th}}\) equivalence query is the same partitioning p from the end of the \(j{}^{\mathrm{th}}\) equivalence query.

In theory, the extraction continues until the automaton proposed by \(\hbox {L}^{*}\) is accepted, i.e., \({\mathcal {A}}\) and \(A_{R,p}\) converge. In practice, for some RNNs this may take a long time and yield a large DFA (>30,000 states). To counter this, we place time or size limits on the interaction, after which the last \(\hbox {L}^{*}\) hypothesis is returned.Footnote 7 We see that these DFAs still generalise well to their respective networks.

The partitioning p has to be coarse enough to facilitate feasible computation of \(A_{R,p}\), but fine enough to capture the interesting observations made by the network. As we have an iterative setting, we can satisfy this by starting with a very coarse initial abstraction and refining it only sparingly, whenever it is proven incorrect.

The equivalence queries are described in Sect. 5, and the partitioning and its refinements in Sect. 6.

Note

Convergence of \(A_{R,p}\) and \({\mathcal {A}}\) does not guarantee that R and \({\mathcal {A}}\) are equivalent. Providing such a guarantee would be an interesting direction for future work.

5 Answering equivalence queries

Given a network R, a partitioning function p over its state space S, and a proposed minimal automaton \({\mathcal {A}}\), we wish to check whether the abstraction of the network \(A_{R,p}\) is equivalent to \({\mathcal {A}}\), preferably while exploring as little of \(A_{R,p}\) as necessary. If the two are not equivalent—meaning, necessarily, that at least one is not an accurate representation of the network R—we wish to find and resolve the cause of the inequivalence, either by returning a counterexample to \(\hbox {L}^{*}\)(and so refining \({\mathcal {A}}\)), or refining the partitioning function p (and so the abstraction \(A_{R,p}\)) in the necessary area. Hence our equivalence query must be able not only to return counterexamples when necessary, but also to specifically identify overly-coarse partitions in the partitioning p.

For clarity, from here onwards we refer to the continuous network states \(h\in S\) as R-states, the abstracted states in \(A_{R,p}\) as A-states, and the states of the \(\hbox {L}^{*}\) DFAs \({\mathcal {A}}\) as L-states.

In this section we describe the details of an equivalence query assuming a given partitioning p and refinement operation refine. We present our initial partitioning \(p_0\) and refine operation in Sect. 6.

5.1 Parallel exploration

The key intuition to our approach is the fact that \({\mathcal {A}}\) is minimal, and so each state in the DFA \(A_{R,p}\) should—if the two automata are equivalent—be equivalent to exactly one state in the DFA \({\mathcal {A}}\). This is based on the fact that for automata \(A=\langle \Sigma , Q, i, F,\delta \rangle\) and \(A'=\langle \Sigma , Q',i',F',\delta '\rangle\) in which \(A'\) is minimal, A and \(A'\) are equivalent if and only if there exists a mapping \(m:Q\rightarrow Q'\) satisfying that \(m(i)=i'\), \(f(q)=f'(m(q))\), and \(m(\delta (q,\sigma ))=\delta '(m(q),\sigma )\) for every \(q,\sigma \in Q\times \Sigma\).

To check the equivalence of \(A_{R,p}\) and \({\mathcal {A}}\) without necessarily having to fully explore \(A_{R,p}\), we build such a mapping between their states on-the-fly: we associate between states of the two automata during the extraction of \(A_{R,p}\), by traversing \({\mathcal {A}}\) in parallel to the extraction of \(A_{R,p}\) (which is extracted according to Algorithm 1). We update this association for all R-states visited during this extraction, i.e., including those at which the traversal is sheared.Footnote 8 Any inconsistencies (conflicts) in this association are definite indicators of inequivalence between \(A_{R,p}\) and \({\mathcal {A}}\).

5.1.1 Conflict types

We refer to associations in which an accepting A-state is associated with a rejecting L-state or vice versa as abstract classification conflicts. We refer to multiple but disagreeing associations for a single A-state, i.e. situations in which one A-state is associated with two different (minimal) L-states, as partitioning conflicts. (The inverse, in which one minimal L-state is associated with several A-states, is not a problem: \(A_{R,p}\) is not necessarily minimal and so these states may be equivalent.)

Recalling that the ulterior motive is to find inconsistencies between the proposed automaton \({\mathcal {A}}\) and the given network R, and that the exploration of \(A_{R,p}\) runs atop an exploration of the actual R-states, we also check at each point during the exploration whether the current R-state \(h\in S_R\) has identical classification to that of the current L-state reached in the parallel traversal of \({\mathcal {A}}\). As the classification of a newly discovered A-state is determined by the R-state with which it was first mapped, this also covers all abstract classification conflicts. We refer to failures of this test generally as classification conflicts, and check only for them and for partitioning conflicts.

5.2 Conflict resolution and counterexample generation

Classification conflicts are a sign that a path \(w\in \Sigma ^*\) satisfying \(R(w)\ne {\mathcal {A}}(w)\) has been traversed in the exploration of \(A_{R,p}\), and so necessarily that w is a counterexample to the equivalence of \({\mathcal {A}}\) and R. They are resolved by returning the path w as a counterexample to \(\hbox {L}^{*}\), so that it may refine its observations and provide a new automaton. All that is necessary for this is to maintain the current path w throughout the exploration.

Partitioning conflicts are a sign that an A-state \(q\in Q_{R,p}\), that has already been reached with a path \(w_1\) during the exploration of \(A_{R,p}\), has been reached again with a new path \(w_2\) for which the L-state is different from that of \(w_1\). In other words, partitioning conflicts give us two sequences \(w_1,w_2\in \Sigma ^*\) for which \(\hat{\delta _{R,p}}(w_1)=\hat{\delta _{R,p}}(w_2)\) but \({\hat{\delta _{{\mathcal {A}}}}}(w_1)\ne {\hat{\delta _{{\mathcal {A}}}}}(w_2)\). We denote by \(q_1,q_2\in Q_{\mathcal {A}}\) the L-states reached in \({\mathcal {A}}\) by these sequences, \(q_i=\hat{\delta _{\mathcal {A}}}(w_i)\). As \({\mathcal {A}}\) is a minimal automaton, \(q_1\) and \(q_2\) are necessarily inequivalent, meaning there exists a differentiating suffix \(s\in \Sigma ^*\) for which \(f_{{\mathcal {A}}}({\hat{\delta _{{\mathcal {A}}}}}(q_1,s))\ne f_{{\mathcal {A}}}({\hat{\delta _{{\mathcal {A}}}}}(q_2,s))\), i.e. for which \(f_{{\mathcal {A}}}(w_1{\cdot }s)\ne f_{{\mathcal {A}}}(w_2{\cdot }s)\). Conversely, as \(\hat{\delta _{R,p}}(w_1)=\hat{\delta _{R,p}}(w_2)\) then \(\hat{ \delta _{R,p} }(w_1{\cdot }s)=\hat{ \delta _{R,p} }(w_2{\cdot }s)\), and so \(f_{R,p}(w_1{\cdot }s)=f_{R,p}(w_2{\cdot }s)\).

Clearly in this case \({\mathcal {A}}\) and \(A_{R,p}\) must disagree on the classification of either \(w_1{\cdot }s\) or \(w_2{\cdot }s\), and so at least one of them must be inconsistent with the network R. In order to determine the ‘offending’ automaton, we pass both \(w_1{\cdot }s\) and \(w_2{\cdot }s\) to R for their true classifications. If \({\mathcal {A}}\) is found to be inconsistent with the network, the word on which \({\mathcal {A}}\) and R disagree is returned to \(\hbox {L}^{*}\) as a counterexample.

Else, \(w_1{\cdot }s\) and \(w_2{\cdot }s\) are necessarily classified differently by the network, and \(A_{R,p}\) should not lead \(w_1\) and \(w_2\) to the same A-state. The R-states \(h_1={\hat{g}}(w_1)\) and \(h_2={\hat{g}}(w_2)\) are passed, along with the current partitioning p, to a refinement operation, which refines p such that the two are no longer mapped to the same A-state—preventing a reoccurrence of that particular conflict.

The previous reasoning can be applied to \(w_2\) with all paths \(w_1\) that have reached the conflicted A-state \(q\in Q_{R,p}\) without conflict before \(w_2\) was traversed. As such, the classifications of all the words \(w_1{\cdot }s\) are tested against the network, prioritising returning a counterexample over refining the partitioning.Footnote 9 If eventually it is the partitioning that is refined, then the R-state that triggered the conflict, \(h={\hat{g}}(w_2)\), is split from all R-states \(h_1={\hat{g}}(w_1)\) for \(w_1\) that have already reached q in the exploration, in one single refinement.Footnote 10

Every time the partitioning is refined, the guided exploration starts over, and the process repeats until either a counterexample is returned to \(\hbox {L}^*\), equivalence is reached (exploration completes without a counterexample), or some predetermined limit (such as time or partitioning size) is exceeded. We note that in practice—and very often so with the decision-tree based refinement operation that we present—there are cases in which starting over is equivalent to merely updating the associated A-state p(h) of the R-state h that triggered the refinement and continuing the exploration from there, and we implement our equivalence query to take advantage of this.

In our implementation, whenever we find several potential counterexamples to the proposed DFA, we check them in order of increasing length and return the shortest counterexample we have found.

5.3 Algorithm

figure b

Pseudocode for this entire equivalence checking procedure (ignoring preference for shortest counterexamples) is presented in Algorithm 2.Footnote 11 The description here assumes the existence of a refinement operation refine separating in the partitioning an R-state h from a set of other R-states H, we present such a method in Sect. 6.

The overall iterative process, including the refinements to p, is desribed in check_equivalence, and the equivalence checking for a specific partitioning p is given in parallel_explore.

parallel_explore attempts to build \(A_{R,p}\) in variables \(Q,F,q_0,\delta\), while also maintaining the associations of these states to R and \({\mathcal {A}}\) as follows:

  • Visitors holds for every A-state q the set of all R-states h satisfying \(p(h)=q\) that have been visited during the exploration. This is used for refinements triggered by partitioning conflicts.

  • Path holds for every R-state h the sequence \(w\in \Sigma ^*\) with which h has been visited during the exploration.Footnote 12 This is used for generating potential counterexamples when handling a partitioning conflict.

  • Association holds for every A-state q the L-state \(q'\in Q_{\mathcal {A}}\) visited in the parallel exploration of \({\mathcal {A}}\) the first time that q was visited. If at any point q is visited while the parallel exploration is on a different state \(q''\ne q'\), a partitioning conflict is triggered.

Note that finding the separating suffix for two inequivalent states \(q_1,q_2\) of a given automaton A can be done by a simple parallel BFS exploration of the states reachable from \(q_1\) and \(q_2\) in A, continuing until two states with opposite classifications are found.

6 Abstraction and refinement

Given a partitioning p, an R-state h, and a set of R-states \(H\subseteq S\setminus \{h\}\), we must refine p to obtain a new partitioning \(p'\) satisfying:

  1. 1.

    for every \(h_1\in H\), \(p'(h)\ne p'(h_1)\), and

  2. 2.

    for every \(h_1,h_2\in S\), if \(p(h_1)\ne p(h_2)\) then \(p'(h_1)\ne p'(h_2)\).

The first condition separates (in the partitioning) the R-states that caused the partitioning conflict leading to the refinement. The second condition maintains separations made by earlier refinements, i.e., it prevents previously created abstract states from being merged.

We want to generalise the information given by h and H well, so as not to invoke excessive refinements as new R-states are explored. Additionally, we would like to keep the partitioning as small as possible, so that \(A_{R,p}\) can be explored and compared to \({\mathcal {A}}\) in reasonable time at every equivalence query.

To keep the partitioning small, we settle on a decision tree structure, in which each refinement only splits the partition in which the conflict was recognised. Additionally, seeing that in practice our equivalence checking method can overcome imperfect splits between H and h by generating further splits if necessary, we relax the first condition. Specifically, we allow the classifiers splitting between H and h in the conflicated partition to not do so perfectly, provided they separate at least some of H from h.

Our method is unaffected by the length of the R-states, and very conservative: each refinement increases the number of A-states by exactly one. Our experiments show that it is fast enough to quickly find counterexamples to proposed DFAs.

6.1 Initial partitioning

In addition to a refinement method, our algorithm needs an initial partitioning \(p_0\) from which to start the first equivalence query. As we wish to keep the abstraction as small as possible, we begin with no state separation at all: \(p_0 : h \mapsto 0\).

6.2 Decision-tree based partitioning, with support vector refinement

Let \(h\in S,H\subset S\) be the R-states with which a refinement was invoked. We know the refinement is only applied to hH satisfying \(p(h)=p(h')\) for every \(h'\in H\). To keep the partitioning small, we define a gentle refinement operation, in which for every call we only split the single partition p(h). This approach avoids state explosion by adding only one A-state per refinement.

Decision Tree It is natural to maintain a partitioning p refined over time in this way as a decision tree, where each internal node tracks some single refinement made to p, and its leaves are the current A-states of the abstraction.

SVM classifiers At every refinement, for the split of p(h), we would like to allocate a region around the R-state h that is large enough to contain other R-states that behave similarly, but separate from neighbouring R-states that do not. We achieve this by fitting an SVM (Boser et al. 1992) classifier with an RBF kernelFootnote 13 to separate h from H (splitting the partition p(h) in exactly two). The max-margin property of the SVM ensures a large space around h, while the Gaussian RBF kernel allows for a non-linear partitioning of the space. We use this classifier to split the A-state p(h), yielding a new partitioning \(p'\) with exactly one more A-state than p.

Whenever the SVM successfully separates h from H entirely, this approach satisfies the requirements of refinement operations. Otherwise, the method fails to satisfy condition 1 of the refinement operation. Nevertheless, the SVM classifier will always separate at least one of the R-states \(h'\in H\) from h, and later explorations can invoke further refinements if necessary. In practice we see that this does not hinder the main goal of the abstraction, which is finding counterexamples to equivalence queries.

Unlike mathematically defined partitionings such as the quantisation proposed by Omlin&Giles (1996), our abstraction’s storage is linear in the number of A-states it can map to; and computing an R-state’s associated A-state may be linear in this number as well (e.g. if the decision tree is a chain). Luckily, as this number of A-states also grows very slowly (linearly in the number of refinements), this does not become a problem.

6.3 Practical considerations

As the initial partitioning and the refinement operation are very coarse, our method runs the risk of accepting very small but wrong DFAs early in the extraction.

To counter this, two measures are taken:  

  1. 1.

    At the beginning of extraction, one accepting and one rejecting sequence are provided to the teacher, and then checked as potential counterexamples at the beginning of every equivalence query.Footnote 14 Conversely, if these are not available, equivalence queries are extended with n random samples for some small n (e.g. \(n=100\)) and range of lengths (e.g. 0-100): whenever \({\mathcal {A}}\) and \(A_{R,p}\) are equivalent, n random samples are generated and checked as potential counterexamples (\({\mathcal {A}}(w)\ne R(w)\)) before \({\mathcal {A}}\) can be accepted.

  2. 2.

    The first refinement is aggressive, generating a greater (but still manageable) number of A-states than made with the main single-partition split approach used for the rest of the extraction.

 The first measure is taken specifically to prevent erronous termination of the extraction on a single state automaton, and requires only two samples (if provided) or short additional time before accepting an equivalence query.

The second measure prevents the extraction from too readily terminating on small DFAs, by creating a (manageably) large \(A_{R,p}\) that will hopefully capture a relatively rich representation of the RNN. Our method for it is presented in Sect. 6.3.1.

6.3.1 Aggressive difference-based refinement

At the first refinement, instead of splitting \(p_0(h)\) to separate h from all or most of H using a single SVM, we split S in its entirety across multiple dimensions chosen according to h and H. Specifically, we calculate the mean \(h_m\) of H, find the d dimensions with the largest gap between h and \(h_m\), and then split S along the middle of that gap for each of the d dimensions.

The resulting partitioning can be comfortably stored in a decision tree of depth d. It is intuitively similar to that of the quantisation suggested by Omlin and Giles, except that it focuses only on the dimensions with the greatest deviation of values between the states being split, and splits the ‘active’ range of values.

The value d may be set by the user, and increased if the extraction is suspected to have converged too soon. We found that \(d=10\) generally provides a strong enough initial partitioning of S, without making the abstraction too large for feasible exploration.

7 Experimental results

We first demonstrate the effectiveness of our method on LSTM- and GRU-acceptorsFootnote 15 trained on the Tomita grammars (1982), which have been used as benchmarks in previous automata-extraction work (Wang et al. 2017), and then on substantially more complicated languages. We show the effectiveness of our refinement-based equivalence query approach over that of plain random sampling and present cases in which our method extracts informative DFAs where other approaches fail. In addition, for some seemingly perfect networks, we find that our method quickly returns counterexamples representing deviations from the target language.

We clarify that when we refer to extraction time for any method, we consider the entire process: from the moment the extraction begins, to the moment a DFA is returned.Footnote 16

Prototype Implementation and Settings

We implemented all methods in Python, using PyTorch (Paszke et al. 2019) and scikit-learn (Pedregosa et al. 2011). For the SVM classifiers, we used the SVC variant, with regularisation factor \(C=10^4\) to encourage perfect splits and otherwise default parameters—in particular, the RBF kernel with gamma value \(1/(\text {num features})\).

All training and extraction was done on amazon instances of type p3.2xlarge, except for the BP and email classifier RNNs which were run on p2.xlarge.

7.1 Languages

We consider the Tomita Grammars (7.4.1), and more complicated regular languages defined by small, randomly sampled DFAs (7.4.2). We also consider the language of legal email addresses (defined precisely in 7.9.1), and the language of balanced parentheses (BP): the set of sequences over ()a-z in which the parentheses are balanced, e.g. a(a)ba and ()(()).

7.2 Sample sets and training

Tomita and Random Regular Languages We use train, validation, and test sets of sizes 5000, 1000 and 1000 containing samples of lengths 1-100 (uniformly distributed). To get ‘representative’ sample sets, we define a distribution over each DFA’s state transitions favouring transitions which do not reduce the number of reachable states,Footnote 17 sample from that distribution, and train the RNN to provide correct output for all prefixes of every sample (as opposed to only the full samples).Footnote 18 We train these RNNs with the Adam optimiser, using initial learning rate 0.0003, an exponential learning rate scheduler with gamma 0.9, and dropout 0.1. Each RNN was trained for up to 100 epochs on its train set, or until the validation set had \(100\%\) accuracy for 3 epochs in a row, whichever came sooner.

Balanced Parentheses and Email Addresses We generated positive samples using tailored functions,Footnote 19 and negative samples as a mix of both random sequences and mutations of the positive samples.Footnote 20 Here we train the RNN only on the full samples (as opposed to classifying every prefix). We trained all networks to \(100\%\) accuracy on their train sets, and considered only those that reached \(99.9{+}\%\) accuracy on a test set consisting of up to 1000 uniformly sampled words of each of the lengths \(n\in {1,4,7,...,28}\). The positive to negative sample ratios in the test sets were not controlled. The BP and email train sets were randomly generated during training. The BP train set created \({\approx }44600\) samples, of which \({\approx }60\%\) were positive for each RNN, and reached balanced parentheses up to depth 11. The email addresses train set created 40000 samples.

7.3 Details on our extraction (practical considerations)

We apply the measures discussed in Sect. 6.3 as follows: First, for all networks, we apply our method with aggressive initial refinement depth \(d=10\) (Sect. 6.3.1). Second, we use additional counterexamples:

Additional Counterexamples For the Tomita and random DFA languages, during extraction, we used random samples as additional potential counterexamples. Specifically, whenever an equivalence query was going to accept, we considered an additional 100 potential counterexamples, each generated as follows: first, we choose a length from \(0-10\) (uniformly), and then uniformly sample a sequence of that length over the RNN input alphabet.

For BP and email addresses, during extraction, we presented each RNN along with one positive and one negative sample to check for counterexamples at each equivalence query. These were chosen as the shortest positive and shortest negative word in the train set of the RNN, in particular: for BP, the initial samples were the empty sequence (positive) and ) (negative), and for emails, the initial samples were 0@m.com (positive) and the empty sequence (negative). For BP, these samples are covered anyway by \(\hbox {L}^*\)’s initial membership queries, but for email addresses the positive sample helps ‘kick off’ the extraction, preventing the method from accepting an automaton with a single (rejecting) state.

No further parameter tuning was required to achieve our results.

7.4 Small regular languages

7.4.1 The Tomita grammars

The Tomita grammars (1982) are the following 7 languages over \(\Sigma =\{0,1\}\):

  1. 1.

    1*

  2. 2.

    (10)*

  3. 3.

    The complement of ((0|1)*0)*1(11)*(0(0|1)*1)*0(00)*(1(0|1)*)*, i.e.: all sequences w which do not contain an odd series of 1s followed later by an odd series of 0s

  4. 4.

    All words w not containing 000,

  5. 5.

    All w for which \(\#_0(w)\) and \(\#_1(w)\) are even (where \(\#_a(w)\) is the number of a’s in w),

  6. 6.

    All w for which (\(\#_0(w)-\#_1(w))\equiv _3 0\), and

  7. 7.

    0*1*0*1*.

They are the languages classically used to evaluate DFA extraction from RNNs.

We trained one 1-layer GRU network with hidden size 50 for each Tomita grammar (7 GRUs in total), in the manner described in Sect. 7.2. In training, all but one of the RNNs reached 3 consecutive epochs with \(100\%\) validation set accuracy within 10 epochs, and reached \(100\%\) test set accuracy. The 6th Tomita grammar was harder to train, with the RNN reaching only \(78\%\) validation accuracy after 100 epochs. As our focus is on extraction rather than training, we repeated training on this language, eventually obtaining an RNN with perfect train and validation accuracy for this language as well (this time with initial learning rate 0.0004 and gamma 0.95). We then applied our method to extract from the perfectly trained RNNs.

For each one, our method correctly extracted and accepted the target grammar in under 1 second.

7.4.2 Random small regular languages

Though the Tomita grammars are a popular language set for evaluating DFA extraction from RNNs, they are quite simple: the largest Tomita grammars are still only 5-state DFAs over a 2-letter alphabet. As our method performed so well on these grammars, we expand to more challenging languages.

Table 1 Results for DFA extracted using our method from 2-layer GRU and LSTM networks with various state sizes, trained on random regular languages of varying sizes and alphabets. Each row in each table represents 10 experiments with the same parameters (network hidden-state size \(d_s\), alphabet size \(|\Sigma |\), and minimal target DFA size \(|Q_T|\)). In each experiment, a random DFA is generated and an RNN is trained on it, after which a DFA is extracted from and compared to the RNN. The column \(|Q_{\mathcal {A}}|\) represents the size of the final returned DFA, \(\#\)c-exs describes how many counterexamples were used during extraction, max |c-ex| describes their maximum length, and RNN Acc. is the accuracy of the trained RNN on its test set. Each column represents the average of the 10 experiments, except for max |c-ex| which gives the overall maximum counterexample used across all RNNs in that row. Each extraction was run with a time limit of 30 seconds, and whenever an extraction timed out the last automaton proposed by \(\hbox {L}^{*}\) was taken as the extracted automaton. For the accuracies on the different lengths, 1000 random words of each length were sampled and evaluated, and for the accuracy on the training set all of the RNN’s training set was evaluated (i.e., comparing DFA against RNN)

We considered randomly-generated minimal DFAs of varying complexity, specifically, DFAs with alphabet size and number of states \((|\Sigma |,|Q|)=(3,5), (5,5)\) and (3, 10). For each combination we randomly generated 10 minimal DFAs, making 30 DFAs overall. For each DFA we trained 6 2-layer RNNs: 3 GRUs and 3 LSTMs, each with hidden state sizes \(d_s=50, 100\) and 500, this makes 180 RNNs overall. The training method is described in Sect. 7.2. We applied our extraction method to each of these RNNs, with a time limit of 30 seconds (after which the last \(\hbox {L}^{*}\) hypothesis is returned) and initial split depth and counterexamples as described in Sect. 7.3. The results of these experiments are shown in Table 1. Each row in the table represents the average of 10 extractions.

Most extractions completed before the time limit, having reached equivalence.Footnote 21 We compared the extracted automata against the networks on their training sets and on 1000 randomly generated word samples for each of the word-lengths 10,50,100 and 1000. In all settings (hidden size, alphabet size, and DFA size) where the RNNs achieved \(100\%\) test set accuracy, our extraction obtained DFAs with perfect accuracy against their RNNs. For two RNNs which reached \(99\%\) accuracy, our extraction achieved \(99\%\) accuracy against the RNNs, and for the two RNNs with less than \(99\%\) accuracy our extraction achieved on average \(\ge 88\%\) accuracy for all evaluation sets.

7.5 Comparison with a-priori quantisation

Table 2 Results for DFA extracted using a simple partitioning of the RNN state space, in which each state dimension is split into \(q=2\) equal segments (positive and negative). The extractions were applied to the same RNNs as in Table 1, with each row representing 10 experiments as before. \(|Q_{\mathcal {A}}|\) again reports the (average) number of states in the extracted DFAs, though this time it is rounded for clearer presentation. The extractions were run with a time limit of 500 seconds. This time, instead of reporting only the accuracy of the extracted DFAs against their RNNs on different samples sets, we also report their coverage: the fraction of samples for which the DFAs have a classification at all (i.e., do not have missing transitions). The accuracy is computed only on covered sequences, and we write report the accuracy as \({-}1\) when all extractions in the row have 0 coverage for that set. For example: \(1.0 {\times } 0.12\) tells us that only \(12\%\) of samples have full transitions in the extracted DFA, but that for those \(12\%\), the DFA accuracy against the RNN is perfect

In their 1996 paper, Omlin and Giles suggested partitioning the network state space by dividing each state dimension into q equal intervals, with q being the quantisation level. We tested this method on each of our small regular language RNNs (Sect. 7.4.2), with \(q=2\) and a time limit of 500 seconds to avoid excessive memory consumption.Footnote 22

In many cases, we found that 500 seconds was not enough time for this method to extract a complete DFA from our RNNs.Footnote 23 To enable some comparison, we allow the method to return incomplete DFAs, i.e. DFAs in which some transitions are missing, and we move from evaluating just the accuracy of a DFA to evaluating both its accuracy and its coverage, with coverage being the fraction of samples for which it has a full transition path.

We provide the results of extracting with this method in Table 2, which uses the exact same RNNs as in Table 1.

The extracted DFAs are very large—with some even having 100, 000 states–and yet their coverage of sequences of length 1, 000 and even 100 tends to zero as the RNN complexity (state size \(d_s\), or RNN target language complexity) increases. For the covered sequences, the extracted DFA’s accuracy was often very high (99+\(\%\)), suggesting that quantisation—while impractical—is sufficiently expressive to describe a network’s state space. However, it is also possible that the sheer size of the quantisation (\(2^{50}\) for our smallest RNNs, and more for others) simply allowed each explored R-state its own A-state, giving high accuracy just by observation bias (only covered sequences could have their accuracy checked).

This is in contrast to our method, which always returns complete DFAs,Footnote 24 and which consistently extracted accurate DFA from the same networks in a fraction of the time and memory used by the plain quantisation approach. This is because our method maintains from a very early point in extraction a complete DFA \({\mathcal {A}}\) that constitutes a constantly improving approximation of the considered RNN.

7.6 Comparison with k-Means clustering

Table 3 Results for DFA extracted using k-means clustering from the same 2-layer GRU and LSTM networks considered in Table 1, i.e., each row represents the average results of 10 experiments as before, and considers the exact same trained RNNs. The extractions did not have a time limit, instead, the number of states sampled was set to 5000 and the k values considered were \(k=1,6,11,...,31\). The accuracies were evaluated on the same sample sets as in Table 1

Next, we implemented a simple k-means clustering and extraction approach and applied it to the same networks from Sect. 7.4.2 with varying k.

Specifically, for each RNN, we sampled \(N=5000\) unique prefixes from its train set, computed the states reached from them in the RNN, and used k-means clustering to partition the state space according to those states for each of \(k=1,6,11,...,31\).Footnote 25 We then mapped the transitions of each partitioning to create 7 potential DFAs, and evaluated each one against the RNN on its 1000-sample test set to choose the best.

k-means has a well defined and ‘reasonably quick’ stopping condition: the number of RNN states visited, and the number of clusters to be created and traversed from them, is given as input to the extraction.Footnote 26 Hence for this extraction we do not use a time limit, allowing the method to extract all of its potential DFAs in full, evaluate them, and return the best DFA. As done for the other methods, we measure for k-means the total time from beginning the extraction until a single final DFA is returned. In particular, this covers sampling once all 5000 RNN states (generally \({<}10\) seconds), making a k-state DFA from these RNN states by applying k-means clustering to them (taking from \({<}1\) to \({\sim } 50\) seconds for each k, depending on the states and on k), and finally choosing the best DFA by evaluating on the test set (generally \({<}10\) seconds). We note that the bulk of the extraction time is spent in clustering the sampled states into different numbers of clusters k.

In Table 3 we report the results of these extractions. In particular, we report the time (in seconds) spent on each full extraction, the number of clusters k used for each best DFA, each DFA’s size \(|Q_{\mathcal {A}}|\) after minimisation, and of course each extracted DFA’s accuracy against the same sample sets as before (i.e., as in 1).

For the GRU networks trained on smaller DFAs (which reached \(100\%\) test-set accuracy), k-means clustering is as successful as our method, often returning a DFA with perfect or near-perfect accuracy against the target RNN. For the LSTMs and the larger DFAs however, our method obtains far higher accuracy, and often in less time. The difference in success on the LSTMs and GRUs is curious, we leave this question open in this work.

7.7 Comparison with random sampling for counterexample generation

For 3 of the Tomita grammars (specifically, Tomitas 3,4, and 7), the first counterexample returned in our extraction (Sect. 7.4.1) was actually created by the initial random sampling. Moreover, for all of the Tomita grammars, answering all equivalence queries using a random sampler alone (with up to 1, 000 samples per query) was successful at extracting the grammars from the RNNs, and this was also true for many of the languages considered in Sect. 7.4.2. The termination is slightly slower than our own, to allow for sampling many potential counterexamples before accepting the \(\hbox {L}^{*}\) hypothesis, but still fast enough to make random sampling seem appealing (the method spent \(\approx 10\) seconds on each Tomita grammar). Indeed, Mayr&Yovine (2018) even suggest such a method in their recent work, analysing it from a PAC perspective.

Given this, the question may arise whether there is at all merit to the exploration and refinement of abstractions of the network, as opposed to a simple random sampling approach to counterexample generation for \(\hbox {L}^{*}\) equivalence queries.

In this section we show the advantage of our method for counterexample generation, through the example of balanced parentheses (BP): the language of sequences with correctly balanced parentheses over the alphabet ()a-z. BP is not a regular language, but the attempt to approximate it with DFAs, and in particular the search for counterexamples to proposed DFAs, proves informative. In particular, when sampling the tokens with uniform distribution, the probability of randomly generating a sequence with nested and correctly balanced parentheses over the BP alphabet is very low. This prevents the random sampler from finding counterexamples to \(\hbox {L}^*\)’s proposed automata, each of which accept balanced parentheses to a bounded depth (see Examples in Fig. 1), highlighting the advantage of our approach.

We train one GRU and one LSTM network on BP, each with 2 layers and hidden dimension 50. We extract DFAs from these networks using \(\hbox {L}^*\), generating counterexamples once with our method and once with a random counterexample generator. The random counterexample generator works as follows: for each equivalence query, it randomly samples sequences over the input alphabet \(\Sigma\) until a counterexample (sample on which \({\mathcal {A}}\) and the RNN disagree) is found. In particular, for each length \(l=1,2,3,...\) and increasing until a counterexample is found, it generates and compares up to 1000 random samples of length l, with uniform distribution.

We allowed each method 400 secondsFootnote 27 to extract an automaton from networks trained to \(100\%\) train set accuracy. The accuracy of these extracted automata against the original networks on their training sets is recorded in Table 4, as well as the maximum parentheses nesting depth the \(\hbox {L}^{*}\) proposed automata reached during extraction.

Table 4 Accuracy of extracted automata against their networks, which were trained to 100% training accuracy on the balanced parentheses (BP) language. The comparisons were done on the training sets of the networks. The maximum nesting depth the extracted automata reached while still behaving as BP is recorded (the GRU network ultimately returned a more complex automaton than the one extracted from the LSTM network, but this automaton no longer behaved as BP and so we have no reasonable measure for its ‘depth’). The hidden size \(d_s\) and the number of layers in each network is also noted. (For the LSTM network, this is the size of both the memory and the cell vectors, meaning the total hidden size of a single cell in this network is twice as big as the value listed.)
Table 5 Extraction of automata from GRU and LSTM networks trained to 100% accuracy on the training set for the language of balanced parentheses over the 28-letter alphabet a-z(). Each table shows the counterexamples and the counterexample generation times for each of the successive equivalence queries posed by \(\hbox {L}^{*}\)during extraction, for both our method and a brute force approach. Generally, each successive equivalence query from \(\hbox {L}^{*}\)for either network was an automaton classifying the language of all words with balanced parentheses up to nesting depth n, with increasing n. The exception to this comes after the penultimate counterexample in the extraction from the GRU network, in which a word with unbalanced parentheses was returned as a counterexample to \(\hbox {L}^{*}\)(whose automaton currently rejects it)
Fig. 1
figure 1

Select automata of increasing size for recognising balanced parentheses over the 28 letter alphabet a-z, (, ), up to nesting depths 1 (flawed), 1 (correct), 2, and 4, respectively. In this and in all following automata figures, the initial state is an octogon, accepting states have a double border, and sink reject states (rejecting states whose transitions all lead back to themselves) are not drawn

Fig. 2
figure 2

Automaton with vague resemblance to the BP automata of Fig. 1, but no longer representing a language of balanced parentheses up to a certain depth. (Showing how a trained network may be overfitted past a certain sample complexity.)

We list the counterexamples and counterexample generation times for each of the BP network extractions in Table 5. Note the succinctness and the generation speed of the counterexamples generated by our method: excluding two samples at the end of the GRU extraction, they are clear of the ‘neutral’ tokens a-z and of repeating parentheses (e.g., ()()), as these were not necessary to advance the automata learned by \(\hbox {L}^{*}\) (Fig. 1). In contrast, the random sampling method has difficulty finding legally balanced sequences, taking a long time to find counterexamples at all, and including many ‘uninformative’ neutral tokens in its results.

The extracted DFAs themselves were also pleasing: each subsequent DFA proposed by \(\hbox {L}^{*}\) for this language was capable of accepting all words with balanced parentheses of increasing nesting depth, as pushed by the counterexamples provided by our method (Fig. 1). In addition, for the GRU network trained on BP, our extraction method managed to push past the limits of the network’s ‘understanding’—finding the point at which the network begins to overfit to the particularly deeply-nested examples in its training set, and extracting the slightly more complicated automaton seen in Fig. 2.

7.8 Additional variations on our method

We show the necessity of the initial split and counterexamples for our method, the effect of running extraction for a longer time (if it has not completed), and support the decision to return the final \(\hbox {L}^{*}\) hypothesis \({\mathcal {A}}\) as opposed to the final abstraction \(A_{R,p}\) whenever the extraction has not reached equivalence in time.

Removing the Initial Split Heuristics

Table 6 Extracting with our method from the same RNNs as in Table 1, but this time without the initial heuristics as described in Sect. 7.3. The extraction time is reduced significantly, along with the accuracy: \(\hbox {L}^*\)’s first hypotheses are frequently very small, and without the aggressive initial state-splitting and random samples, the abstraction is too coarse to find counterexamples

We run the extraction again on the same RNNs as in Table 1, but this time setting the initial split depth to 1 and the number of random samples before accepting a hypothesis to 0. We report the results in Table 6. The average number of counterexamples (“\(\#\)c-exs”) per extraction drops to almost 0 for most settings, meaning the majority \(\hbox {L}^{*}\) initial hypotheses are accepted immediately by the method (without counterexamples). The number of states in the returned automata is often smaller than in the target, and their accuracy drops significantly.

This shows that indeed our method must be coupled with some heuristics to prevent acceptance in the early stages, during which both the abstraction and the \(\hbox {L}^{*}\) hypothesis only reflect the RNN’s classification on very short sequences, and have not yet diverged.

Timing out: Using the Abstraction, and Increasing the Time Limit

Table 7 Extracting with our method from 2-layer GRUs and LSTMs trained imperfectly on DFAs with size \(|\Sigma |=|Q|=10\), varying RNN hidden size (\(d_s\)) and extraction time limit. Each row represents the average of 10 experiments, with average DFA (\(|Q_{\mathcal {A}}|\), \(|Q_{A_{R,p}}|\)) and final partitioning (|p|) sizes rounded for space. We report both the accuracy (against the RNN) of the final \(\hbox {L}^{*}\) hypothesis, \({\mathcal {A}}\), and the abstraction \(A_{R,p}\) used by the method to find counterexamples to each \({\mathcal {A}}\). We see that the final \(\hbox {L}^{*}\) hypothesis is clearly the superior option when extraction has not terminated. Unfortunately, we also see that the accuracy does not increase well with more time, this is because the hypothesis generation (time from counterexample to new hypothesis) grows slower with each iteration

When we increase \(|\Sigma |\) and |Q| of our randomly generated target DFAs to 10, the training routine used in this work is not sufficient for the RNNs with dimensions \(d_s=50\) and \(d_s=100\) to train perfectly, and they reach on average \(<0.8\%\) test set accuracy on their target languages. For these RNNs, we observe that our extraction method does not reach equivalence in the provided time. In particular, the \(\hbox {L}^{*}\) hypotheses grow very large, and the extraction often times out while increasing the observation table: the internal table of sequence labels maintained by \(\hbox {L}^{*}\) between equivalence queries (i.e., the majority time is spent on refining \({\mathcal {A}}\) after each new counterexample).

In all of our experiments, whenever we run out of time, we return the last \(\hbox {L}^{*}\) hypothesis \({\mathcal {A}}\) as the extracted automaton. In this section, we check how much this hypothesis improves as we increase the time limit, and evaluate the option of returning the last abstraction \(A_{R,p}\) used by our method instead.

Table 7 shows a set of extractions from imperfectly trained RNNs, trained with the same training routine and number of repetitions as before. We make 10 DFAs all with \(|Q|=|\Sigma |=10\) and on each DFA train 4 2-layer RNNs: 2 GRUs and 2 LSTMs, each with hidden state sizes \(d_s=50\) and \(d_s=100\). We then extract from each RNN with 5 different time limits ranging from 50 to 1000 seconds. This means that overall Table 7 shows results for 10 DFAs, 40 RNNs, and 200 extractions (each row represents 10 extractions).Footnote 28

Alongside the details of the last \(\hbox {L}^{*}\) hypothesis \({\mathcal {A}}\), we also report the size of our final partitioning p (i.e., number of partitions it divides the state space into), the size (after minimisation) of the abstraction \(A_{R,p}\) it defines, and the accuracy of \(A_{R,p}\) against its target RNN.

The results show clearly that the \(\hbox {L}^{*}\) hypothesis is the preferable choice when the extraction does not complete. Effectively, the partitioning p and abstraction \(A_{R,p}\) it defines act as a tool for refining the \(\hbox {L}^{*}\) hypotheses, and not so much the other way around.Footnote 29

The results also show that, for these non-terminating extractions, it is ‘difficult’ to improve beyond the automata reached in the early stages: increasing the extraction time to 100, 200, and even 1000 seconds gives only a small increase in accuracy each time. We also see that the number of counterexamples used per extraction grows very slowly with the increase in time, i.e., more time does not significantly increase the number of hypotheses presented by \(\hbox {L}^{*}\).

Analysing the time spent by the extraction reveals that \(\hbox {L}^{*}\) gets ‘stuck’ refining the large hypotheses it creates, generating many membership queries without reaching new equivalence queries. The average equivalence query time across all experiments is \({<}1.5\)s, whereas the maximum hypothesis refinement time in each experiment grew to over 10, 48, 60, 170 and 314 seconds for each of the time limits respectively.Footnote 30 A more efficient implementation of \(\hbox {L}^{*}\), or possibly an approximation of it, would be an important step towards scaling this method.

7.9 Discussion

7.9.1 Adversarial inputs

Balanced Parentheses Excitingly, the penultimate counterexample returned by our method during the extraction of balanced parentheses (BP) in Sect. 7.7 is an adversarial input: a sequence with unbalanced parentheses that the network accepts (despite its target language accepting only sequences with balanced parentheses). This input is found in spite of the network’s seemingly perfect behavior on its set of 44000+ training samples. Note that the random sampler did not manage to find such samples.

Inspecting the extracted automata indeed reveals an almost-but-not-quite correct DFA for the BP language (Fig. 2). The RNN overfit to random peculiarities in the training data and did not learn the intended language, and our extraction method managed to discover and highlight an example of this ‘incorrect’ behaviour.

Email Addresses For a seemingly perfect LSTM-acceptor trained on the regular expression

$$[a - z][a - z0 - 9]*@[a - z0 - 9] + .({\text{com}}|{\text{net}}|{\text{co}}.[a - z][a - z])\$$$

(simple email addresses over the 38 letter alphabet \(\{\)a-z, 0-9, @, .\(\}\)) to 100% accuracy on a 40,000 sample train set and a 2,000 sample test set, our method quickly returned the counterexamples seen in Table 8, showing clearly words that the network misclassified (e.g., 25.net). We ran extraction on this network for 400 seconds, and while we could not extract a representative DFA in this time,Footnote 31 our method did show that the network learned a far more elaborate (and incorrect) function than needed. In contrast, given a 400 second overall time limit, the random sampler did not find any counterexample beyond the provided one.

We note that our implementation of k-means clustering and extraction had no success with this network, returning a completely rejecting automaton (representing the empty language), despite trying k values of up to 100 and using all of the network states reached using a train set with a 50:50 ratio between positive and negative samples.

Beyond demonstrating the capabilities of our method, these results also highlight the brittleness in generalisation of trained RNNs, and suggest that evidence based on test-set performance should be interpreted with extreme caution. This reverberates the results of Gorman and Sproat (2016), who trained a neural architecture based on a multi-layer LSTM to mimic a finite state transducer (FST) for number normalisation. They showed that the RNN-based network, trained on 22M samples and validated on a 2.2M sample development set to 0% error on both, still had occasional errors (though with error rate < 0.0001) when applied to a 240,000 sample blind test set.

7.9.2 Limitations and discussion

\(\hbox {L}^{*}\) Optimisation One limitation of the method shown in this work is the polynomial time complexity of \(\hbox {L}^*\), which becomes a significant issue as the extracted DFA grows (see Sect. 7.8, Timing out). Applying our method with more efficient variants of \(\hbox {L}^{*}\), such as the TTT algorithm presented by Isberner et al. (2014), may yield better results.

\(\hbox {L}^{*}\) and Noise Whenever applied to an RNN that has failed to generalise properly to its target language, our method soon finds several adversarial inputs, and begins to build very large DFAs. As noted above, to \(\hbox {L}^*\)’s polynomial complexity and intolerance to noise, this quickly becomes extremely slow.Footnote 32

Of course by the nature of \(\hbox {L}^*\), any complexity in the final returned automaton is only a result of the inherent complexity of the RNN’s learned behaviour, and so we may say that this result is not necessarily incorrect. Nevertheless, it limits us, and seeking a way to recognise and overcome ‘noise’ in the given network’s behaviour is an interesting avenue for future work.

Adversarial Inputs On the bright side, this same limitation does demonstrate the ease with which our method identifies imperfectly trained networks. These cases are annoyingly frequent: for many RNN-acceptors with 100% train and test accuracy on large test sets, our method was able to find many simple misclassified examples (Sect. 7.9.1).

Table 8 Counterexamples generated during extraction from an LSTM email-address network with \(100\%\) train and test accuracy. Examples of the network deviating from its target language are shown in bold

Note on Heuristics In Sect. 3, we note that existing works consider multiple RNNs, and then must choose the best according to a heuristic. Our method can also be seen as considering multiple DFAs and abstractions, with the equivalence query being the ‘heuristic’ deciding whether to terminate or consider more DFAs/abstractions. We highlight here our differences. First, in our method, the DFAs considered are always minimal (thanks to \(\hbox {L}^*\)), and the abstractions used can be much smaller than in other methods. In particular the abstractions can be small because they are dynamically refined by the method on an as-needed basis, and so can afford to be very coarse: ‘missed partitions’ are discovered and fixed automatically by the method. Secondly, even when the refinement eventually creates a very large abstraction, the equivalence query is applied ‘on-the-fly’, meaning it can cut off and return counterexamples/refine the abstraction even before \(A_{R,p}\) has been fully mapped.

8 Learning from only positive samples

Thus far, the method presented here can be used to learn a DFA from a set of positive and negative samples: we train an RNN-acceptor to generalise from them, and then extract a DFA from it.

However, we can also use our method to learn a DFA from positive samples only, by training an RNN using a language-modeling objective, and then extracting from an RNN-acceptor interpretation of it. Such RNNs are trained only on positive samples, attempting to model their distribution rather than classify what is or isn’t in the language:

A language-model RNN (LM-RNN) over an alphabet \(\Sigma\) and end-of-sequence symbol \(\$\notin \Sigma\) is an RNN with classification component \(f_R:S_R\rightarrow [0,1]^{\Sigma \cup \{\$\}}\) defining for every RNN-state a distribution over \(\Sigma \cup \{\$\}\). An LM-RNN effectively defines for every sequence \(w\in \Sigma ^*\) and token \(\sigma \in \Sigma \cup \{\$\}\) the probability of sampling \(\sigma\) after seeing w: \(P(\sigma |w)=f_R(\hat{g_R}(w))(\sigma )\).

LM-RNNs can be interpreted as classifiers by taking a threshold t and defining that they accept exactly the set of sequences \(w=w_1w_2...w_n\in \Sigma ^*\) which satisfy: 1. \(P(\$|w)\ge t\), and 2. for every strict prefix \(w'=w_1w_2...w_i\), \(i<n\) of w, \(P(w_{i+1}|w')\ge t\). This interpretation recently appears as locally \(\epsilon\)-truncated support in the work of Hewitt et al. (2020), with \(\epsilon =t\).

LM-RNNs can therefore be adapted for extraction as classifiers by defining each of their states as accepting or rejecting according to the probability they assign to $, and introducing an artificial sink-reject state vFootnote 33 that is entered whenever a sequence transitions through a token with too low probability. Formally:

Making an RNN acceptor Let R be an LM-RNN with reachable state space \(S\subsetneq {\mathbb {R}}^{d_s}\), initial state \(h_{0,R}\in S\), update function \(g_R\), and classification function \(f_R\). Let \(t\in [0,1]\) be a threshold and let \(v\in {\mathbb {R}}^{d_s}\setminus S\) be a vector that cannot be reached in R from any input sequence.Footnote 34 To create an RNN-acceptor \(R'\) from R, we build the components \(h'_{0,R}=h_{0,R}\), \(f'_R(s)={\left\{ \begin{array}{ll} Acc&{}:f_R(s)(\$)\ge t\\ Rej&{}:\mathrm {else} \end{array}\right. }\), and \(g'_R(s,\sigma )={\left\{ \begin{array}{ll} v&{}: f_R(s)(\sigma )<t \mathrm {\ or\ } s{=}v \\ g_R(s,\sigma )&{}:\mathrm {else} \end{array}\right. }\).

The new RNN-acceptor \(R'\) can now be passed directly to our algorithm for extraction.

When the language is ‘small’—in the sense that uniformly sampled sequences are likely to be rejected—sampling sequences according to the RNN’s distribution is likely to hit a sample that has not yet been considered by \(\hbox {L}^*\). Hence here random sampling according to the RNN’s distribution can be a useful augmentation to the equivalence query—though this can also create overly long counterexamples (Sect. 8.1.3).

This approach—training an LM-RNN, adapting it as a classifier, and then extracting from it with the method presented in this work—has been recently applied by Yellin&Weiss (2021) to elicit a sequence of DFAs from trained LM-RNNs, as part of a process for learning context free grammars from trained RNNs.

Note Extracting from LM-RNNs requires some hyperparameter tuning, as changing the threshold t changes the set of sequences accepted by \(R'\).

8.1 Proof of concept

We provide a small number of example extractions from LM-RNNs trained on non-regular languages, observing the ability of the method to generate increasingly ‘complex’ DFA approximations of the targets. More examples are also present in Yellin&Weiss (2021).

8.1.1 \(a^nb^n\)

We train a 2-layer LSTM-based LM-RNN with hidden dimension \(d_s=50\) on positive samples from the language \(a^nb^n=\{a^ib^i\ |\ i\in {\mathbb {N}}\}\).Footnote 35 We then interpret it as an RNN-acceptor as described above, and extract from it using our extraction method, with \(t=0.1\) and a time limit of 400 seconds.

As expected, the extraction generates a series of DFA approximations of the non-regular target language, we present some of these in Fig. 3. The extraction ultimately reached DFAs approximating \(a^nb^n\) up to \(n\le 20\) before timing out, with the majority of time spent on refining the \(\hbox {L}^{*}\) hypotheses, which grew slower as the DFA grew: the final hypotheses returned by \(\hbox {L}^{*}\) took 46, 54, and 63 seconds each to generate after their ‘prompting’ counterexamples, and the next \(\hbox {L}^{*}\) refinement after them also timed out after 53 seconds (meanwhile, each of the counterexamples took \(<5\) seconds to generate). This result suggests that this method may benefit from applying a more efficient implementation of \(\hbox {L}^*\), such as the TTT algorithm of Isberner et al. (2014).

Fig. 3
figure 3

Automata approximating the language \(a^nb^n\) up to different lengths, extracted from an RNN trained on only positive examples. The extraction created ‘correct’ approximations up to \(n=20\) before reaching the time limit

8.1.2 Dyck-3

We consider the language Dyck-3 with 3 additional neutral tokens, i.e.: correctly balanced sequences over the alphabet {}()[]abc. For example, {}a(b[])c is in the language, but ([)] and ()) are not.

We use a 2-layer GRU with dimension 50, and train it as a language model on 50000 non-unique samples of lengths 1-100 from Dyck-3 for 20 epochs, reaching a train, test, and validation cross-entropy loss of \({\approx }1.7\). We interpret the GRU as a classifier using rejection threshold \(t=0.01\), and extract from it using our method with a time limit of 400 seconds and initial split depth \(d=10\).Footnote 36

Fig. 4
figure 4

An automaton approximating the language Dyck-3 with neutral tokens a-c, obtained in 128 seconds as the \(24{}^{\mathrm{th}}\) hypothesis during extraction from a GRU trained on only positive samples from the language. The automaton correctly recognises many (but not all) correct parenthesis nestings up to depth \(n=3\), for example, it accepts the sequence {([])}() but not the sequence ({()}). It rejects the empty sequence, this is an artefact of the RNN’s behavior

Fig. 5
figure 5

The next hypothesis presented by \(\hbox {L}^{*}\) after receiving the counterexample [([])] to the DFA shown in 4, while extracting from our LM-GRU trained on Dyck-3. While the previous hypotheses reflected clear (regular) subsets of Dyck-3 with bounded depth, now \(\hbox {L}^{*}\) has found several ‘irregularities’ in the RNN, and encoded them into a new hypothesis which is much larger and more complicated than those before it

The abstraction-based equivalence query provides \(\hbox {L}^{*}\) with counterexamples teaching it new ‘parantheses nestings’ one at a time,Footnote 37 creating in 128 seconds the Dyck-3 approximation \({\mathcal {A}}_{24}\) shown in Figu. 4 (the \(24{}^{\mathrm{th}}\) hypothesis created during the extraction). Each of the counterexamples, including those after \({\mathcal {A}}_{24}\), takes under 3 seconds to find.

After the counterexample [([])] returned for \({\mathcal {A}}_{24}\) however, \(\hbox {L}^{*}\) begins to find irregularities in the LSTM’s behavior, and jumps from the 26 state DFA shown in Fig. 4 to the 47 state DFA shown in Fig. 5. The new hypothesis shows us how the GRU has overfitted to the training data. For example, one of the shortest sequences reaching the ‘new’ accepting state 41 is [([a]]), and indeed checking the GRU shows that it accepts this sequence despite it being incorrectly balanced. Following the transitions for this sequence, the GRU’s ‘first mistake’ appears to be on the neutral tokens of state 9, which instead of sitting on a self-loop now go to the different state 22.

Up until \({\mathcal {A}}_{24}\), the \(\hbox {L}^{*}\) refinement time (time from counterexample to next equivalence query) was \(<10\) seconds per hypothesis. The next refinement, creating \({\mathcal {A}}_{25}\), takes 68 seconds however, and from there all remaining refinements take \(15-35\) seconds each.

8.1.3 Sampling the LM-RNN for equivalence queries

Fig. 6
figure 6

The last DFA extracted from the LM-GRU trained on Dyck-3 with neutral tokens a-c, when extracting with \(\hbox {L}^{*}\) for 400 seconds and only using LM-sampling with maximum length 10 for the equivalence queries. It is not a subset of Dyck-3, for example, it accepts the sequence ]]]}. This seems to be an oversight in the extraction: the RNN does not accept this sequence, and an appropriate counterexample would fix this

Long Samples We take the same Dyck-3 RNN as above and again use \(\hbox {L}^{*}\) to extract from it for 400 seconds, but this time with the equivalence query based only on comparison of samples generated from the RNN’s distribution. Specifically, for each equivalence query, we sample sequences up to length 100 indefinitely (as the focus here is finding counterexamples, not reaching equivalence quickly) with tokens chosen according to the GRU’s next-token distribution.

Sampling the GRU is effective for creating well balanced nested parentheses, and the method rejects the initial hypotheses of \(\hbox {L}^{*}\)(in which the parentheses are not yet nested), in under one second. The counterexample has 57 tokens and is:

{c{}{(b[]){()}c}[]()}({{{}c}ccca}cc[]){b}bbb[]abc[]a[c]()

which reaches a maximum nesting depth of 4 and shows multiple parentheses nesting combinations. Unfortunately, a second equivalence query is never made before reaching the time limit. The length of the counterexample slows \(\hbox {L}^{*}\) down (it has polynomial time complexity in, among other things, the length of its counterexamples), and—possibly more significantly—it is possible that this counterexample has led \(\hbox {L}^{*}\) to many ‘incorrect’ behaviours in the RNN, forcing it to begin working on a large DFA covering all of them at a very early stage in the extraction.

LM Sampling: Short Samples A second attempt at extraction with RNN-sampled counterexamples,Footnote 38 this time with maximum sample length 10, creates 23 DFAs. The last of these is shown in Fig. 6.

The equivalence queries are fast (the first ten take \({<}1\) second each, and all take \({<}6\) seconds),Footnote 39 though the extraction does not as clearly resemble Dyck-3: the DFAs have irregularities relative to those obtained with the abstraction-based \(\hbox {L}^{*}\) extraction method. We do not know whether this is due to the random sampling missing key counterexamples (such as the [} counterexample in Sect. 8.1.2) or a reflection of unwanted behaviours in the RNN, but initial checks of misclassified sequences in the last DFA of this extraction show that the RNN actually classifies them correctly, suggesting that at least some key counterexamples could help ‘clean’ these DFAs.

9 Conclusions

We present a novel technique for extracting deterministic finite automata from recurrent neural networks, with roots in exact learning. As our method makes no assumptions as to the internal configuration of the network, it is easily applicable to any RNN architecture, and we evaluate it on the popular LSTM and GRU models. We show also how to apply it to RNNs trained as language models rather than acceptors.

In contrast to previous methods, our method is not affected by hidden state size, and successfully extracts representative DFAs from trained RNNs of arbitrary size—provided of course that the language learned by these RNNs can be approximated by DFAs. Our technique works with little to no parameter tuning, and requires very little prior information to get started (the input alphabet, and optionally 2 labeled samples).

By the nature of \(\hbox {L}^{*}\), which always returns the minimal automaton consistent with all of its observations, our method is guaranteed to never extract a DFA more complicated than the language of the RNN being considered. Moreover, the counterexamples returned during our extraction can point us to ‘incorrect’ (with respect to the target language) patterns that the network has learned without our awareness.

Beyond scalability and ease of use, our method obtains reasonable approximations for RNNs even if extraction is cut short: for the poorly trained RNNs (RNNs with \({<}80\%\) accuracy on their own test sets) considered in Table 7, our method obtains \({\ge }77\%\) train set accuracy in each of the extractions. Moreover, for networks that accurately represent small automata, we have shown that our method gets very good results: in these cases our method often obtains small, succinct DFAs, with accuracies of over \(99\%\) against their networks, in seconds or tens of seconds of extraction (Table 1). This is in contrast to existing methods, which require orders of magnitude more time to complete, and often return cumbersome or inaccurate DFAs (Tables 2 and 3).