Synthesizing Context-free Grammars from Recurrent Neural Networks

We present an algorithm for extracting a subclass of the context free grammars (CFGs) from a trained recurrent neural network (RNN). We develop a new framework, pattern rule sets (PRSs), which describe sequences of deterministic finite automata (DFAs) that approximate a non-regular language. We present an algorithm for recovering the PRS behind a sequence of such automata, and apply it to the sequences of automata extracted from trained RNNs using the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^{*}$$\end{document}L∗ algorithm. We then show how the PRS may converted into a CFG, enabling a familiar and useful presentation of the learned language. Extracting the learned language of an RNN is important to facilitate understanding of the RNN and to verify its correctness. Furthermore, the extracted CFG can augment the RNN in classifying correct sentences, as the RNN’s predictive accuracy decreases when the recursion depth and distance between matching delimiters of its input sequences increases.


Introduction
Recurrent Neural Networks (RNNs) are a class of neural networks adapted to sequential input, enjoying wide use in a variety of sequence processing tasks. Their internal process is opaque, prompting several works into extracting interpretable rules from them. Existing works focus on the extraction of deterministic or weighted finite automata (DFAs and WFAs) from trained RNNs [18,6,26,3].
However, DFAs are insufficient to fully capture the behavior of RNNs, which are known to be theoretically Turing-complete [20], and for which there exist architecture variants such as LSTMs [14] and features such as stacks [9,23] or attention [4] increasing their practical power. Several recent investigations explore the ability of different RNN architectures to learn Dyck, counter, and other non-regular languages [19,5,28,21], with mixed results.
While the data indicates that RNNs can generalize and achieve high accuracy, they do not learn hierarchical rules, and generalization deteriorates as the length and 'depth' of the input grows [19,5,28]. Sennhauser  "what the LSTM has in fact acquired is sequential statistical approximation to this solution" instead of "the 'perfect' rule-based solution" [19]. Similarly, Yu et. al. conclude that "the RNNs can not truly model CFGs, even when powered by the attention mechanism" [28]. This is line with Hewitt et. al., who note that a fixed precision RNN can only learn a language of fixed depth strings [13].
Goal of this paper We wish to extract a CFG from a trained RNN. In particular, we wish to find the CFG that not only explains the finite language learnt by the RNN, but generalizes it to strings of unbounded depth and distance.
Our approach Our method builds on the DFA extraction work of Weiss et al. [26], which uses the L * algorithm [2] to learn the DFA of a given RNN. As part of the learning process, L * creates a sequence of hypothesis DFAs approximating the target language. Our main insight is in treating these hypothesis DFAs as coming from a set of underlying rules, that recursively improve each DFA's approximation of the target CFG by increasing the distance and embedded depth of the sequences it can recognize. In this light, synthesizing the target CFG becomes the problem of recovering these rules.
We propose the framework of pattern rule sets (PRSs) for describing such rule applications, and present an algorithm for recovering a PRS from a sequence of DFAs. We also provide a method for converting a PRS to a CFG, and test our method on RNNs trained on several PRS languages. Pattern rule sets are expressive enough to cover several variants of the Dyck languages, which are prototypical context-free languages (CFLs): the Chomsky-Schützenberger representation theorem shows that any CFL can be expressed as a homomorphic image of a Dyck language intersected with a regular language [16].
A significant issue we address is that the extracted DFAs are often inexact, either through inaccuracies in the RNN, or as an artifact of the L * algorithm.
To the best of our knowledge, this is the first work on synthesizing a CFG from a general RNN (though some works extract push-down automata [23,9] from RNNs with an external stack, they do not apply to plain RNNs). The overall steps in our technique are given in Figure 1.
Contributions The main contributions of this paper are: -Pattern Rule Sets (PRSs), a framework for describing a sequence of DFAs approximating a CFL. -An algorithm for recovering the PRS generating a sequence of DFAs, that may also be applied to noisy DFAs elicited from an RNN using L * . -An algorithm converting a PRS to a CFG.
-An implementation of our technique 1 , and an evaluation of its success on recovering various CFLs from trained RNNs.

Deterministic Finite Automata
Definition 1 (Deterministic Finite Automata). A deterministic finite automaton (DFA) over an alphabet Σ is a 5-tuple Σ, q 0 , Q, F, δ such that Q is a finite set of states, q 0 ∈ Q is the initial state, F ⊆ Q is a set of final (accepting) states and δ : Q × Σ → Q is a (possibly partial) transition function.
Unless stated otherwise, we assume each DFA's states are unique to itself, i.e., for any two DFAs A, B -including two instances of the same DFA - We define the extended transition functionδ : Q × Σ * → Q and the language L(A) accepted by A in the typical fashion. We also associate a language with intermediate states of The states from which no sequence w ∈ Σ * is accepted are known as the sink reject states. Definition 2. The sink reject states of a DFA A = Σ, q 0 , Q, F, δ are the maximal set Q R ⊆ Q satisfying: Q R ∩ F = ∅, and for every q ∈ Q R and σ ∈ Σ, either δ(q, σ) ∈ Q R or δ(q, σ) is not defined.

Definition 3 (Defined Tokens).
Let A = Σ, q 0 , Q, F, δ be a complete DFA with sink reject states Q R . For every q ∈ Q, its defined tokens are def(A, q) {σ ∈ Σ | δ(q, σ) / ∈ Q R }. When the DFA A is clear from context, we write def(q).
All definitions for complete DFAs are extended to incomplete DFAs A by considering their completion -an extension of A in which all missing transitions are connected to a (possibly new) sink reject state.

Definition 5 (Replacing a State).
For a transition function δ : Q × Σ → Q, state q ∈ Q, and new state q n / ∈ Q, we denote by δ [q←qn] : Q × Σ → Q the transition function over Q = (Q \ {q}) ∪ {q n } and Σ that is identical to δ except that it redirects all transitions into or out of q to be into or out of q n .

Dyck Languages
A Dyck language of order N is expressed by the grammar D ::= ε | L 1 D R 1 | ... | L N D R N | D D with unique symbols L 1 ,...,L N ,D 1 ,...,D N . A common measure of complexity for a Dyck word is its maximum distance (number of characters) between matching delimiters and embedded depth (number of unclosed delimiters) [19]. We generalize and refer to Regular Expression Dyck (RE-Dyck) languages as languages expressed by the same CFG, except that each L i and each R i derive some regular expression.
We present regular expressions as is standard, for example: L({a|b}·c) {ac,bc}.

Patterns
Patterns are DFAs with a single exit state q X in place of a set of final states, and with no cycles on their initial or exit states unless q 0 = q X .

Definition 6 (Patterns).
Patterns are always given in minimal incomplete presentation.
We refer to a pattern's initial and exit states as its edge states. All the definitions for DFAs apply to patterns through A p . We denote each pattern p's language L p L(p), and if it is marked by some superscript i, we refer to all of its components with superscript i:

Pattern Composition
We can compose two non-circular patterns p 1 , p 2 by merging the exit state of p 1 with the initial state of p 2 , creating a new pattern p 3 satisfying L p 3 = L p 1 ·L p 2 .

Definition 7 (Serial Composition).
Let p 1 , p 2 be two non-circular patterns. Their serial composite is the pattern We call q 2 0 the join state of this operation.
If we additionally merge the exit state of p 2 with the initial state of p 1 , we obtain a circular pattern p which we call the circular composition of p 1 and p 2 . This composition satisfies L p = {L p1 ·L p2 } * .

Definition 8 (Circular Composition).
Let p 1 , p 2 be two non-circular patterns. Their circular composite is the circular pattern . We call q 2 0 the join state of this operation. A pattern pair is a pair P, P c of pattern sets, such that P c ⊂ P and for every p ∈ P c there exists exactly one pair p 1 , p 2 ∈ P satisfying p = p 1 p 2 for some ∈ {•, • c }. We refer to the patterns p ∈ P c as the composite patterns of P, P c , and to the rest as its base patterns.
We will often discuss patterns that have been composed into larger DFAs.

Definition 10 (Pattern Instances).
Let A = Σ, q A 0 , Q A , F, δ A be a DFA, p = Σ, q 0 , Q, q X , δ be a pattern, andp = Σ, q 0 , Q , q X , δ be a pattern 'inside' A, i.e., Q ⊆ Q A and δ ⊆ δ A . We say thatp is an instance of p in A ifp is isomorphic to p.
A pattern instance in a DFA A is uniquely determined by its structure and initial state: (p, q). If p is a composite pattern with respect to some pattern pair P, P c , the join state of its composition within A is also uniquely defined.
Definition 11. For every pattern pair P, P c , for each composite pattern p ∈ P c , DFA A, and initial state q of an instancep of p in A, join(p, q, A) returns the join state ofp with respect to its composition in P, P c .

Pattern Rule Sets
For any infinite sequence S = A 1 , A 2 , ... of DFAs satisfying L(A i ) ⊂ L(A i+1 ), for all i, we define the language of S as the union of the languages of all these DFAs: L(S) = ∪ i L(A i ). Such sequences may be used to express CFLs.
In this work we take a finite sequence A 1 , A 2 , ..., A n of DFAs, and assume it is a (possibly noisy) finite prefix of an infinite sequence of approximations for a language, as above. We attempt to reconstruct the language by guessing how the sequence may continue. To allow such generalization, we must make assumptions about how the sequence is generated. For this we introduce pattern rule sets.
Pattern rule sets (PRSs) create sequences of DFAs with a single accepting state. Each PRS is built around a pattern pair P, P c , and each rule application connects a new pattern instance to the current DFA A i , at the join state of a composite-pattern inserted into A i at some earlier point. To define where a pattern can be connected to A i , we introduce an enabled instance set I.

Definition 12.
An enabled DFA over a pattern pair P, P c is a tuple A, I such that A = Σ, q 0 , Q, F, δ is a DFA and I ⊆ P c × Q marks enabled instances of composite patterns in A.
Intuitively, for every enabled DFA A, I and (p, q) ∈ I, we know: (i) there is an instance of pattern p in A starting at state q, and (ii) this instance is enabled ; i.e., we may connect new pattern instances to its join state join(p, q, A).

Definition 13.
A PRS P is a tuple Σ, P, P c , R where P, P c is a pattern pair over the alphabet Σ and R is a set of rules. Each rule has one of the following forms, for some p, p 1 , p 2 , p 3 , p I ∈ P , with p 1 and p 2 non-circular:

and p 3 is non-circular
A PRS derives sequences of enabled DFAs as follows: first, a rule of type (1) creates A 1 , I 1 according to p I . Then, for every A i , I i , each rule may connect a new pattern instance to A i , specifically at a state determined by I i .
Definition 14 (Initial Composition). D 1 = A 1 , I 1 is generated from a rule ⊥ p I as follows: Let D i = A i , I i be the enabled DFAat step i and denote A i = Σ, q 0 , Q, F, δ . Note that for A 1 , |F | = 1, and for all A i+1 , F is unchanged (by future definitions).
Rules of type (1) extend A i by grafting a circular pattern to q 0 , and then enabling that pattern if it is composite.

Definition 15 (Rules of type (1)). A rule ⊥ p I with circular p I may extend
Rules of type (2) graft a circular pattern p 3 = Σ, q 3 0 , q 3 x , F, δ 3 onto the join state q j of an enabled pattern instancep in A i , by merging q 3 0 with q j . In doing so, they also enable the patterns composingp, if they are composite.

Example applications of rule (2) are shown in Figures 3(i) and 3(ii).
We also wish to graft a non-circular pattern p 3 between p 1 and p 2 , but this time we must avoid connecting the exit state q 3 X to q j lest we loop over p 3 multiple times. We therefore replicate the outgoing transitions of q j in p 1 • p 2 to the inserted state q 3 X so that they may act as the connections back into the DFA. Definition 17 (Rules of type (3) We call C the connecting transitions. We depict this rule application in example in Fig. 3 (iii), in which a member of C is labeled 'c'.
Multiple applications of rules of type (3) to the same instancep will create several equivalent states in the resulting DFAs, as all of their exit states will have the same connecting transitions. These states are merged in a minimized representation, as depicted in Diagram (iv) of Figure 3.
We write A ∈ G(P) if there exists a sequence of enabled DFAs derived from P s.t. A = A i for some A i in this sequence.

Examples
Example 1: Let p 1 and p 2 be the patterns accepting 'a' and 'b' respectively. Consider the PRS R ab with rules, ⊥ p 1 • p 2 and This PRS creates only one sequence of DFAs. Once the first rule creates the initial DFA, by continuously applying the second rule we obtain the infinite sequence of DFAs each satisfying L( Figure 2(i) presents A 1 , while A 2 and A 3 appear in Figure 4(i). We can substitute any non-circular patterns for p 1 and p 2 , creating the language {x i y i : i > 0} for any non-circular pattern regular expressions x and y.  6 . R Dyck2 defines the Dyck language of order 2. Figure 4 (ii) shows one of its possible DFA-sequences.

PRS Inference Algorithm
A PRS can generate a sequence of DFAs defining, in the limit, a context-free language. We are now interested in inverting this process: given a sequence of DFAs generated by a PRS P, can we reconstruct P? Coupled with an L * extraction of DFAs from a trained RNN, solving this problem will enable us to extract a PRS from an RNN -provided the extraction follows a PRS (as we often find it does).
We present an algorithm for this problem, and show its correctness. In practice the DFAs we are given are not "perfect"; they contain noise that deviates from the PRS. We therefore augment this algorithm, allowing it to operate smoothly even on imperfect DFA sequences created from RNN extraction.
In the following, for each pattern instancep in A i , we denote by p the pattern that it is an instance of. We use similar notationp 1 ,p 2 , andp I to refer to specific instances of patterns p 1 , p 2 and p I . Additionally, for each consecutive DFA pair A i and A i+1 , we refer byp 3 to the new pattern instance in A i+1 .
Main steps of inference algorithm. Given a sequence of DFAs S = A 1 · · · A n , the algorithm infers P = Σ, P, P c , R in the following stages: 1. Discover the initial pattern instancep I in A 1 . Insert p I into P and markp I as enabled. Insert the rule ⊥ → p I into R.
2. For i, 1 ≤ i ≤ n − 1: (a) Discover the new pattern instancep 3 in A i+1 that extends A i . (b) Ifp 3 starts at the state q 0 of A i+1 , then it is an application of a rule of type (1). Insert p 3 into P , markp 3 as enabled, and add ⊥ p 3 to R. (c) Otherwise (p 3 does not start at q 0 ), find the unique enabled pattern p =p 1 p 2 in A i s.t.p 3 's initial state q is the join state ofp. Add p 1 , p 2 , and p 3 to P , p to P c , and markp 1 ,p 2 , andp 3 as enabled. Ifp 3 is noncircular, add p s (p 1 • p 2 )•= p 3 to R; otherwise add p c (p 1 p 2 )•= p 3 .
3. Define Σ to be the set of symbols used by the patterns P .
We now elaborate on how we determine the patternsp I ,p 3 , andp.
Discovering new patternsp I andp 3 A 1 provides an initial pattern p I . For subsequent DFAs, we need to identify which states in A i+1 = Σ, q 0 , Q , F , δ are 'new' relative to A i = Σ, q 0 , Q, F, δ . From the PRS definitions, we know that there is a subset of states and transitions in A i+1 that is isomorphic to A i :

Definition 19. (Existing states and transitions)
For every q ∈ Q , we say that q exists in A i with parallel state q ∈ Q iff there exists a sequence w ∈ Σ * such that q =δ(q 0 , w), q =δ (q 0 , w), and neither is a sink reject state. Additionally, for every q 1 , q 2 ∈ Q with parallel states q 1 , q 2 ∈ Q, we say that (q 1 , σ, We denote A i+1 's existing states and transitions by Q E ⊆ Q and δ E ⊆ δ , and the new ones as By construction of PRSs, each state in A i+1 has at most one parallel state in A i , which can be found in one simultaneous traversal of the two DFAs.
The new states and transitions form a new pattern instancep in A i+1 , excluding its initial and possibly its exit state. The initial state ofp is the existing state q s ∈ Q E that has outgoing new transitions. The exit state q X ofp is identified by the Exit State Discovery algorithm: 1. If there exists a (q, σ, q s ) ∈ δ N , thenp is circular: q X = q s . (Fig. 3(i), (ii)). 2. Otherwise,p is non-circular. If it is the first (with respect to S) non-circular pattern grafted onto q s , then q X is the unique new state whose transitions into A i+1 are the connecting transitions from Definition 17 (Fig. 3 (iii)). 3. If there is no such state, thenp is not the first non-circular pattern grafted onto q s , and q X is the unique existing state q X = q s with new incoming transitions. (Fig. 3(iv)).
Finally, the new pattern instance is p = Σ, q s , Q p , q X , δ p , where Q p = Q N ∪ {q s , q X } and δ p is the restriction of δ N to the states of Q p .
Discovering the patternp (step 2c) In [27] we show that no two enabled pattern instances in a DFA can share a join state, that if they share any non-edge states, then one is contained in the other, and finally that a pattern's join states is never one of its edge states. This makes findingp straightforward: denoting q j as the parallel ofp 3 's initial state in A i , we seek the enabled composite pattern instance (p, q) ∈ I i for which join(p, q, A i ) = q j . If none is present, we seek the only enabled instance (p, q) ∈ I i that contains q j as a non-edge state, but is not yet marked as a composite. (Note that if two enabled instances share a non-edge state, then the containing one is already marked as a composite: otherwise we would not have found and enabled the other).
In [27] we define the concept of a minimal generator and prove the following: Theorem 1. Let A 1 , A 2 , ...A n be a finite sequence of DFAs that has a minimal generator P. Then the PRS Inference Algorithm will discover P.

Deviations from the PRS framework
Given a sequence of DFAs generated by the rules of PRS P, the inference algorithm given above will faithfully infer P. In practice however, we want to apply the algorithm to a sequence of DFAs extracted from a trained RNN using the L * algorithm (as in [26]). Such a sequence may contain noise: artifacts from an imperfectly trained RNN, or from the behavior of L * . The major deviations are incorrect pattern creation, simultaneous rule applications, and slow initiation.
Incorrect pattern creation Whether due to inaccuracies in the RNN classification, or as artifacts of the L * process, incorrect patterns are often inserted into the DFA sequence. Fortunately, these patterns rarely repeat, and so we can discern between them and 'legitimate' patterns using a voting and threshold scheme.
The vote for each discovered pattern p ∈ P is the number of times it has been inserted as the new pattern between a pair of DFAs A i , A i+1 in S. We set a threshold for the minimum vote a pattern needs to be considered valid, and only build rules around the connection of valid patterns onto the join states of other valid patterns. To do this, we modify the flow of the algorithm: before discovering rules, we first filter invalid patterns by splitting step 2 into two phases. Phase 1: Mark all the inserted patterns between each pair of DFAs, and compute their votes. Add to P those whose vote is above the threshold. Phase 2: Consider each DFA pair A i , A i+1 in order. If the new pattern in A i+1 is valid, and its initial state's parallel state in A i also lies in a valid pattern, then synthesize the rule according to the original algorithm. If a pattern is discovered to be composite, add its composing patterns to P .
As almost every DFA sequence produced by our method has some noise, the voting scheme greatly extended the reach of our algorithm.
Simultaneous rule applications In the theoretical framework, A i+1 differs from A i by applying a single PRS rule, and therefore q s and q X are uniquely defined. L * however does not guarantee such minimal increments between DFAs. In particular, it may apply multiple PRS rules between two subsequent DFAs, extending A i with several patterns. To handle this, we expand the initial and exit state discovery methods given above.
1. Mark the new states and transitions Q N and δ N as before.
2. Identify the set of new pattern instance initial states (pattern heads): the set H ⊆ Q \ Q N of states in A i+1 with outgoing new transitions. 3. For each pattern head q ∈ H, compute the relevant sets δ N |q ⊆ δ N and Q N |q ⊆ Q N of new transitions and states: the members of δ N and Q N that are reachable from q without passing through any existing transitions. 4. For each q ∈ H, restrict to Q N |q and δ N |q and compute q X and p as before.
If A i+1 's new patterns have no overlap and do not create an ambiguity around join states, then they may be handled independently and in arbitrary order. They are used to discover rules and then enabled, as in the original algorithm.
Simultaneous but dependent rule applications -such as inserting a pattern and then grafting another onto its join state -are more difficult to handle, as it is not always possible to determine which pattern was grafted onto which. However, there is a special case which appeared in several of our experiments (examples L13 ad L14 of Section 7) for which we developed a technique as follows.
Suppose we discover a rule r 1 : p 0 s (p l • p r )•= p and p contains a cycle c around some internal state q j . If later another rule inserts a pattern p n at the state q j , we understand that p is in fact a composite pattern, with p = p 1 • p 2 and join state q j . However, as patterns do not contain cycles at their edge states, c cannot be a part of either p 1 or p 2 . We conclude that the addition of p was in fact a simultaneous application of two rules: where p is p without the cycle c, and update our PRS and our DFAs' enabled pattern instances accordingly. The case when p is circular and r 1 is of rule type (2) is handled similarly.
Slow initiation Ideally, A 1 directly supplies an initial rule ⊥ p I to our PRS. In practice, the first few DFAs generated by L * have almost random structure. We solve this by leaving discovery of the initial rules to the end of the algorithm, at which point we have a set of 'valid' patterns that we are sure are part of the PRS. From there we examine the last DFA A n generated in the sequence, note all the enabled instances (p I , q 0 ) at its initial state, and generate a rule ⊥ p I for each of them. This technique has the weakness that it will not recognise patterns p I that do not also appear as extending patterns p 3 elsewhere in the sequence, unless the threshold for patterns is minimal.

Converting a PRS to a CFG
We present an algorithm to convert a given PRS to a context free grammar (CFG), making the rules extracted by our algorithm more accessible.
A restriction: Let P = Σ, P, P c , R be a PRS. For simplicity, we restrict the PRS so that every pattern p can only appear on the LHS of rules of type (2) or only on the LHS of rules of type (3) but cannot only appear on the LHS of both types of rules. Similarly, we assume that for each rule ⊥→ p I , the RHS patterns p I are all circular or non-circular. This restriction is natural: all of the examples in Sections 4.1 and 7.3 conform to it. Still, in [27] we show how to remove this restriction.
We create a CFG G = Σ, N, S, P rod . Σ is the same alphabet of P and we take S as a special start symbol. For every pattern p ∈ P , let G p = Σ p , N p , Z p , P rod p be a CFG describing L(p). Let P Y ⊆ P C be those composite patterns that appear on the LHS of a rule of type (2). Create the nonterminal C S and for each p ∈ P Y , create an additional non-terminal C p . We set Let ⊥ p I be a rule in P. If p I is non-circular, create a production S ::= Z p I . If p I is circular, create the productions S ::= S C , S C ::= S C S C and S C ::= Z p I . For each rule p s (p 1 • p 2 )•= p 3 create a production Z p ::= Z p1 Z p3 Z p2 . For each rule p c (p 1 • p 2 )•= p 3 create productions Z p ::= Z p1 C p Z p2 , C p ::= C p C p , and C p ::= Z p3 . Let P rod be the all the productions defined by the above process. We set P rod = { p∈P P rod p } ∪ P rod .

Theorem 2. Let G and P be as above. Then L(P) = L(G).
The proof is given in the extended version of this paper [27].
Expressibility Every RE-Dyck language (Section 2.2) can be expressed by a PRS, but the converse is not true; RE-Dyck languages nest delimiters arbitrarily, while PRS grammars may not. For instance, language L12 of Section 7.3 is not a Dyck language. Meanwhile, not every CFL can be expressed by a PRS [27].
Succinctness The construction above does not necessarily yield a minimal CFG G. For a PRS defining the Dyck language of order 2 -which can be expressed by a CFG with 4 productions and 1 non-terminal -our construction yields a CFG with 10 non-terminals and 12 productions. In this case, and often in others, we can recognise and remove the spurious productions from the generated grammar.

Methodology
We test the algorithm on several PRS-expressible context free languages, attempting to extract them from trained RNNs using the process outlined in Figure 1. For each language, we create a probabilistic CFG generating it, train an RNN on samples from this grammar, extract a sequence of DFAs from the RNN, and apply our PRS inference algorithm. Finally, we convert the extracted PRS back to a CFG, and compare it to our target CFG.
In all of our experiments, we use a vote-threshold s.t. patterns with less than 2 votes are not used to form any PRS rules (Section 5.1). Using no threshold significantly degraded the results by including too much noise, while higher thresholds often caused us to overlook correct patterns and rules.

Generating a sequence of DFAs
We obtain a sequence of DFAs for a given CFG using only positive samples [11,1] by training a language-model RNN (LM-RNN) on these samples and then extracting DFAs from it with the aid of the L * algorithm [2], as described in [26]. To apply L * we must treat the LM-RNN as a binary classifier. We set an 'acceptance threshold' t and define the RNN's language as the set of sequences s satisfying: 1. the RNN's probability for an end-of-sequence token after s is greater than t, and 2. at no point during s does the RNN pass through a token with probability < t. This is identical to the concept of locally t-truncated support defined in [13].
To create the samples for the RNNs, we write a weighted version of the CFG, in which each non-terminal is given a probability over its rules. We then take N samples from the weighted CFG according to its distribution, split them into train and validation sets, and train an RNN on the train set until the validation loss stops improving. In our experiments, we used N = 10, 000. For our languages, we used very small 2-layer LSTMs: hidden dimension 10 and input dimension 4.
In some cases, especially when all of the patterns in the rules are several tokens long, the extraction of [26] terminates too soon: neither L * nor the RNN abstraction consider long sequences, and equivalence is reached between the L * hypothesis and the RNN abstraction despite neither being equivalent to the 'true' language of the RNN. In these cases we push the extraction a little further using two methods: first, if the RNN abstraction contains only a single state, we make an arbitrary initial refinement by splitting 10 hidden dimensions, and restart the extraction. If this is also not enough, we sample the RNN according to its distribution, in the hope of finding a counterexample to return to L * . The latter approach is not ideal: sampling the RNN may return very long sequences, effectively increasing the next DFA by many rule applications. We place a time limit of 1, 000 seconds (∼ 17 minutes) on the extraction.

Languages
We experiment on 15 PRS-expressible languages L 1 − L 15 , grouped into 3 classes: 1. Languages of the form X n Y n , for various regular expressions X and Y. In particular, the languages L 1 through L 6 are X n i Y n i for: (X 1 ,Y 1 )=(a,b), (X 2 ,Y 2 )=(a|b,c|d), (X 3 ,Y 3 )=(ab|cd,ef|gh), (X 4 ,Y 4 )=(ab,cd), (X 5 ,Y 5 )=(abc,def), and (X 6 ,Y 6 )=(ab|c,de|f). LG  Table 1 shows the results. The 2nd column shows the number of DFAs extracted from the RNN. The 3rd and 4th columns present the number of patterns found by the algorithm before and after applying vote-thresholding to remove noise. The 5th column gives the minimum and maximum votes received by the final patterns (we count only patterns introduced as a new pattern p 3 in some A i+1 ). The 6th column notes whether the algorithm found a correct CFG, according to our manual inspection. For languages where our algorithm only missed or included 1 or 2 valid/invalid productions, we label it as partially correct.

Results
Alternating Patterns Our algorithm struggled on the languages L 3 , L 6 , and L 11 , which contained patterns whose regular expressions had alternations (such as ab|cd in L 3 , and ab|c in L 6 and L 11 ). Investigating their DFA sequences uncovered the that the L * extraction had 'split' the alternating expressions, adding their parts to the DFAs over multiple iterations. For example, in the sequence generated for L 3 , ef appeared in A 7 without gh alongside it. The next DFA corrected this mistake but the inference algorithm could not piece together these two separate steps into a single rule. It will be valuable to expand the algorithm to these cases.
Simultaneous Applications Originally our algorithm failed to accurately generate L 13 and L 14 due to simultaneous rule applications. However, using the technique described in Section 5.1 we were able to correctly infer these grammars. However, more work is needed to handle simultaneous rule applications in general.
Additionally, sometimes a very large counterexample was returned to L * , creating a large increase in the DFAs: the 9 th iteration of the extraction on L 3 introduced almost 30 new states. The algorithm does not manage to infer anything meaningful from these nested, simultaneous applications.
Missing Rules For the Dyck languages L 7 −L 9 , the inference algorithm was mostly successful. However, due to the large number of possible delimiter combinations, some patterns and nesting relations did not appear often enough in the DFA sequences. As a result, for L 8 , some productions were missing in the generated grammar. L 8 also created one incorrect production due to noise in the sequence (one erroneous pattern was generated two times,passing the threshold).

RNN Noise
In L 15 , the extracted DFAs for some reason always forced that a single character d be included between every pair of delimiters. Our inference algorithm of course maintained this peculiarity. It correctly allowed the allowed optional embedding of "abc" strings. But due to noisy (incorrect) generated DFAs, the patterns generated did not maintain balanced parenthesis.

Related work
Training RNNs to recognize Dyck Grammars. Recently there has been a surge of interest in whether RNNs can learn Dyck languages [5,19,21,28]. While these works report very good results on learning the language for sentences of similar distance and depth as the training set, with the exception of [21], they report significantly lower accuracy for out-of-sample sentences.
Among these, Sennhauser and Berwick [19] use LSTMs, and show that in order to keep the error rate within a 5 percent tolerance, the number of hidden units must grow exponentially with the distance or depth of the sequences (though Hewitt et. al. [13] find much lower theoretical bounds). They conclude that LSTMs do not learn rules, but rather statistical approximations. Bernardy [5] experimented with various RNN architectures, finding in particular that the LSTM has more difficulty in predicting closing delimiters in the middle of a sentence than at the end. Based on this, he conjectures that the RNN is using a counting mechanism, but has not truly learnt the Dyck language (its CFG). For the simplified task of predicting only the final closing delimiter of a legal sequence, Skachkova, Trost and Klakow [21] find that LSTMs have nearly perfect accuracy across words with large distances and embedded depth.
Yu, Vu and Kuhn [28] compare the three works above, and note that the task of predicting only the closing bracket of a balanced Dyck word is not sufficient for checking if an RNN has learnt the language, as it can be computed by only a counter. In their experiments, they present a prefix of a Dyck word and train the RNN to predict the next valid closing bracket. They experiment with an LSTM using 4 different models, and show that the generator-attention model [17] performs the best, and is able to generalize quite well at the tagging task . However, they find that it degrades rapidly with out-of-domain tests. They also conclude that RNNs do not really learn the Dyck language. These experimental results are reinforced by the theoretical work in [13], who remark that no finite precision RNN can learn a Dyck language of unbounded depth, and give precise bounds on the memory required to learn a Dyck language of bounded depth.
Despite these findings, our algorithm nevertheless extracts a CFG from a trained RNN, discovering rules based on DFAs synthesized from the RNN using the algorithm in [26]. Because we can use a short sequence of DFAs to extract the rules, and because the first DFAs in the sequence describe Dyck words with increasing but limited distance and depth, we are often able to extract the CFG perfectly even when the RNN does not generalize well. Moreover, we show that our approach works with more complex types of delimiters, and on Dyck languages with expressions between delimiters.
Extracting DFAs from RNNs. There have been many approaches to extract higher level representations from a neural network (NN), both to facilitate comprehension and to verify correctness. One of the oldest approaches is to extract rules from a NN [24,12]. In particular, several works attempt to extract FSAs from RNNs [18,15,25]. We base our work on [26]. Its ability to generate sequences of DFAs providing increasingly better approximations of the CFL is critical to our method.
There has been less research on extracting a CFG from an RNN. One exception is [23], where they develop a Neural Network Pushdown Automata (NNPDA) framework, a hybrid system augmenting an RNN with external stack memory. They show how to extract a push-down automaton from an NNPDA, however, their technique relies on the PDA-like structure of the inspected architecture. In contrast, we extract CFGs from RNNs without stack augmentation.
Learning CFGs from samples. There is a wide body of work on learning CFGs from samples. An overview is given in [10] and a survey of work for grammatical inference applied to software engineering tasks can be found in [22].
Clark et. al. studies algorithms for learning CFLs given only positive examples [11]. In [7], Clark and Eyraud show how one can learn a subclass of CFLs called CF substitutable languages. There are many languages that can be expressed by a PRS but are not substitutable, such as x n b n . However, there are also substitutable languages that cannot be expressed by a PRS (wxw R -see [27]). In [8], Clark, Eyraud and Habrard present Contextual Binary Feature Grammars. However, it does not include Dyck languages of arbitrary order. None of these techniques deal with noise in the data, essential to learning a language from an RNN.

Future Directions
Currently, for each experiment, we train the RNN on that language and then apply the PRS inference algorithm on a single DFA sequence generated from that RNN. Perhaps the most substantial improvement we can make is to extend our technique to learn from multiple DFA sequences. We can train multiple RNNs and generate DFA sequences for each one. We can then run the PRS inference algorithm on each of these sequences, and generate a CFG based upon rules that are found in a significant number of the runs. This would require care to guarantee that the final rules form a cohesive CFG. It would also address the issue that not all rules are expressed in a single DFA sequence, and that some grammars may have rules that are executed only once per word of the language.
Our work generates CFGs for generalized Dyck languages, but it is possible to generalize PRSs to express a greater range of languages. Work will then be needed to extend the PRS inference algorithm.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.