Analysis of recurrent neural networks via property-directed verification of surrogate models

This paper presents a property-directed approach to verifying recurrent neural networks (RNNs). To this end, we learn a deterministic finite automaton as a surrogate model from a given RNN using active automata learning. This model may then be analyzed using model checking as a verification technique. The term property-directed reflects the idea that our procedure is guided and controlled by the given property rather than performing the two steps separately. We show that this not only allows us to discover small counterexamples fast, but also to generalize them by pumping toward faulty flows hinting at the underlying error in the RNN. We also show that our method can be efficiently used for adversarial robustness certification of RNNs.


Introduction
Rather than programming manually, it seems charming to simply provide examples of the intended input-outputbehavior of a given function and derive the implementation of the function using algorithmic means.That is the promise of machine learning, in which often some form of classification problem is addressed by adjusting the parameters of some (deep) neural network until it fits the sample set appropriately.
While machine learning has shown to provide reasonable solutions in many cases, it may be expected that this approach also comes with a lot of deficiencies.Starting with the question of whether the examples are characteristic, it is unclear to which extent the learning algorithm considers the right aspects of the examples, whether the resulting system really realizes or closely approximates the right function, and whether it meets privacy standards.As such, sophisticated verification techniques for the learned artifacts seem extremely important.
In verification, the goal is to show that an implementation meets its specification.A huge number of verification algorithms have been developed over the past 50 years, mostly for program verification, as so-called formal methods.However, it has been noted [31] that formal specifications are often not 123 Content courtesy of Springer Nature, terms of use apply.Rights reserved.
available when machine learning is used.In fact, the given set of examples, the training set, can be considered as (an approximation of) the specification.That said, many verification procedures can be considered as analysis algorithms parameterized by a formal specification.For example, while originally model checking [6] answers the question whether system S satisfies its specification φ, one can consider the specifications φ as a query (of some query language) and the model checking procedure applied on S as a generic analysis routine.
As such, it seems promising to apply the enormous contributions in program verification also for the analysis of neural networks.To do so, two general approaches seem possible.First, one could try to adapt the procedures developed in formal methods to analyze the artifacts encountered in machine learning.Second, one may translate the artifacts found in machine learning, e.g., the neural network, into formal models well studied in program verification.In this paper, which is an extended version of [28], we are following the latter approach.More precisely, we consider recurrent neural networks as the object of study and model checking as verification technique.
Recurrent neural networks (RNNs) are a state-of-the-art tool to represent and learn sequence-based models.They have applications in time-series prediction, sentiment analysis, and many more.In particular, they are increasingly used in safety-critical applications and act, for example, as controllers in cyber-physical systems [3].Thus, there is a growing need for formal verification.However, research in this domain is only at the beginning.While model checking has been successfully used in practice and reached a certain level of industrial acceptance [25], a transfer to machinelearning algorithms has yet to take place.We will apply it on machine-learning artifacts rather than on the algorithm.
An emerging research stream aims at extracting statebased surrogate models from RNNs, such as finite automata [5,34,36,39,40,47], and, in general, we follow this approach in this paper as well.Finite automata turned out to be useful for understanding and analyzing all kinds of systems using testing or model checking.In other words, such models are also beneficial as an explanation of the underlying RNN.
A popular approach for extracting an automaton model from a given RNN is using active automata learning, based on the pioneering work by Angluin's L* algorithm [4].The general idea is to ask so-called membership queries to the underlying system (here the RNN) and equivalence queries whether the learned system is the right or a good enough approximation of the system to learn.Angluin's L* has been improved in several ways especially regarding when to ask queries and how to process and store the information obtained by the queries, starting from [42] and [26], and resulting in [23] in which especially the space consumption is optimized.For further developments in automata learning using L*, we refer the readers to the work by Vaandrager [45] and for hints on choosing which learning algorithm for maximal efficiency, we refer to [1].While our approach does not exploit all discussed optimizations to L*, it is rather easy to incorporate them to improve performance.
The challenging step in L* is the check whether the learned automaton is a good enough approximation of the RNN.A common technique follows statistical testing techniques and answers this question by comparing the two artifacts based on a random set of words.The work by Mayr and Yovine [36] uses probably approximately correct (PAC) learning [46].In this paper, we provide an approach based on Hoeffding's inequality bound [20] also used in statistical model checking [30].For sampling, we use several approaches, one being a mixture of A* and plain sampling as described in [7].
In the field of formal verification, it has proven to be beneficial to run the extraction and verification process simultaneously.Moreover, the state space of RNNs tends to be prohibitively large, or even infinite, and so do incremental abstractions thereof.Motivated by these facts, we propose an intertwined approach to verifying RNNs, where, in an incremental fashion, grammatical inference and model checking go hand-in-hand.Our approach is inspired by black-box checking [41], which exploits the property to be verified during the verification process.Our procedure can be used to find misclassified examples or to verify a system that the given RNN controls, and we call the approach property directed verification.
Property-directed verification.Let us give a glimpse of our method.We consider an RNN R as a binary classifier of finite sequences over a finite alphabet Σ.In other words, R represents the set of strings that are classified as positive.We denote this set by L(R) and call it the language of R. Note that L(R) ⊆ Σ * .We would like to know whether R is compatible with a given specification A, written R | A. Here, we assume that A is given as a (deterministic) finite automaton.Finite automata are algorithmically feasible, albeit having a reasonable expressive power: many abstract specification languages such as temporal logics or regular expressions can be compiled into finite automata [18].
But what does R | A actually mean?In fact, there are various options.If A provides a complete characterization of the sequences that are to be classified as positive, then | refers to language equivalence, i.e., L(R) = L(A).Note that this would imply that L(R) is supposed to be a regular language, which may rarely be the case in practice.Therefore, we will focus on checking inclusion L(R) ⊆ L(A), which is more versatile as we explain next.
Suppose N is a finite automaton representing a negative specification, i.e., R must classify words in L(N ) as negative at any cost.In other words, R does not produce false positives.This amounts to checking that L(R) ⊆ L(N )

123
Content courtesy of Springer Nature, terms of use apply.Rights reserved.
where N is the "complement automaton" of N .For instance, assume that R is supposed to recognize valid XML documents over a finite predefined set of tags.Seen as a set of strings, this is not a regular language.However, we can still check whether L(R) only contains words where every opening tag <tag-name> is eventually followed by a closing tag </tag-name> (while the number of opening and the number of closing tags may differ).As negative specification, we can then take an automaton N accepting the corresponding regular set of strings.For example, <book><author></author><author></book> ∈ L(N ), since the second occurrence of <author> is not followed by some </author> anymore.On the other hand, we have <book><author><author></author></book> ∈ L(N ) because <book> and <author> are always eventually followed by their closing counterpart.
Symmetrically, suppose P is a finite automaton representing a positive specification so that we can find false negative classifications: If P represents the words that R must classify as positive, we would like to know whether L(P) ⊆ L(R).Our procedure can be run using the complement of P as specification and inverting the outputs of R, i.e., we check, equivalently, L(R) ⊆ L(P).
An important instance of this setting is adversarial robustness certification, which measures a neural network's resilience against adversarial examples.Given a (regular) set of words L classified as positive by the given RNN, the RNN is robust wrt.L if slight modifications in a word from L do not alter the RNN's judgment.This notion actually relies on a distance function.Then, P is the set of words whose distance to a word in L is bounded by a predefined threshold, which is regular for several popular distances such as the Hamming or Levenshtein distance.Similarly, we can also check whether the neighborhood of a regular set of words preserves a negative classification.
In all these cases, we are faced with the question of whether the language of an RNN R is contained in the (regular) language of a finite automaton A. Our approach to this problem relies on black-box checking [41], which has been designed as a combination of model checking and testing in order to verify finite-state systems and is based on Angluin's L * learning algorithm [4].L * produces a sequence of hypothesis automata based on queries to R. Every such hypothesis H may already share some structural properties with R. So, instead of checking conformance of H with R, it is worthwhile to first check L(H) ⊆ L(A) using classical model-checking algorithms.If the answer is affirmative, we apply statistical model checking to check L(R) ⊆ L(H) to confirm the result.Otherwise, a counterexample is exploited to refine H, starting a new cycle in L * .Just like in black-box checking, our experimental results suggest that the process of interweaving automata learning and model checking is beneficial in the verification of RNNs and offers advantages over more obvious approaches such as (pure) statistical model checking or running automata extraction and model checking in sequence.A further key advantage of our approach is that, unlike in statistical model checking, we often find a family of counterexamples, in terms of loops in the hypothesis automaton, which testify conceptual problems of the given RNN.
Note that, though we only cover the case of binary classifiers, our framework is in principle applicable to multiple labels using one-vs-all classification.Related Work.Mayr and Yovine describe an adaptation of the PAC variant of Angluin's L* algorithm that can be applied to neural networks [36].As L* is not guaranteed to terminate when facing non-regular languages, the authors impose a bound on the number of states of the hypotheses and on the length of the words for membership queries.In [34,37], Mayr et al. propose on-the-fly property checking where one learns an automaton approximating the intersection of the RNN language and the complement of the property to be verified.Like the RNN, the property is considered as a black box, only decidability of the word problem is required.Therefore, the approach is suitable for non-regular specifications.
Weiss et al. introduce a different technique to extract finite automata from RNNs [47].It also relies on Angluin's L* but, moreover, uses an orthogonal abstraction of the given RNN to perform equivalence checks between them.
The paper [3] studies formal verification of systems where an RNN-based agent interacts with a linearly definable environment.The verification procedure proceeds by a reduction to feed-forward neural networks (FFNNs).It is complete and fully automatic.This is at the expense of the expressive power of the specification language, which is restricted to properties that only depend on bounded prefixes of the system's executions.In our approach, we do not restrict the kind of regular property to verify.The work [24] also reduces the verification of RNNs to FFNN verification.To do so, the authors calculate inductive invariants, thereby avoiding a blowup in the network size.The effectiveness of their approach is demonstrated on audio signal systems.Like in [3], a time interval is imposed in which a given property is verified.
For adversarial robustness certification, Ryou et al. [43] compute a convex relaxation of the nonlinear operations found in the recurrent cells for certifying the robustness of RNNs.The authors show the effectiveness of their approach in speech recognition.Besides, MARBLE [16] builds a probabilistic model to quantize the robustness of RNNs.However, these approaches are white-box based and demand the full structure and information of neural networks.Instead, our approach is based on learning with black-box checking.
Elboher et al. present a counter-example guided verification framework whose workflow shares similarities with our property-guided verification [17].However, their approach addresses FFNNs rather than RNNs.For recent progress in 123 Content courtesy of Springer Nature, terms of use apply.Rights reserved.
the area of safety and robustness verification of deep neural networks, see [29].
Outline.In Sect.2, we recall basic notions such as RNNs and finite automata.Section 3 describes two basic algorithms for the verification of RNNs, before we present property-directed verification in Sect. 4. How to handle adversarial robustness certification is discussed in Sect. 5.The experimental evaluation and a thorough discussion can be found in Sect.6.This paper extends [28] by a more comprehensive introduction and overview to verification of neural networks, by more elaborated explanations, full proofs of all theorems and lemmas and by using an A*-based heuristics for equivalence checks as well as an enriched evaluation.

Preliminaries
In this section, we provide definitions of basic concepts such as languages, recurrent neural networks, finite automata, and Angluin's L* algorithm.

Words and Languages.
Let Σ be an alphabet, i.e., a nonempty finite set, whose elements are called letters.A (finite) word w over Σ is a sequence a 1 . . .a n of letters a i ∈ Σ.The length of w is defined as |w| = n.The unique word of length 0 is called the empty word and denoted by λ.We let Σ * refer to the set of all words over Σ.
The symmetric difference of L 1 and L 2 is defined as Probability Distributions.In order to sample words over Σ, we assume a probability distribution ( p a ) a∈Σ on Σ (by default, we pick the uniform distribution) and a "termination" probability p ∈ (0, 1].Together, they determine a natural probability distribution on Σ * given, for w = a 1 . . .
Finite Automata and Recurrent Neural Networks.We employ two kinds of language acceptors: finite automata and recurrent neural networks.
Recurrent neural networks (RNNs) are a generic term for artificial neural networks that process sequential data.They are particularly suitable for classifying sequences of varying length, which is essential in domains such as natural language processing (NLP) or time-series prediction.For the purposes of this paper, we follow recent literature on extract-ing surrogate models from RNNs [8,[35][36][37]48] and make two assumptions on RNNs: 1. We assume that the inputs to an RNN are a finite set of symbols.While usually the symbols are vectors in onehot encoding, we abstract away from such implementation details and simply rely on a finite alphabet Σ. 2. We assume that the RNNs are a binary (or a one-vs-all) classifier.
One typical application of RNNs with such assumptions is sentimental analysis [33] where the task is to predict whether a text (e.g., a movie review) expresses positive or negative opinion.
The above assumptions, mathematically speaking, render an RNN R to be an effective function R : Σ * → {0, 1} with a language defined as There are several ways to effectively represent R.
Among the most popular architectures are (simple) Elman RNNs, long short-term memory (LSTM) [19], and GRUs [13].Their expressive power depends on the exact architecture, but generally goes beyond the power of finite automata, i.e., the class of regular languages.
A deterministic finite automaton (DFA) over Σ is a tuple A = (Q, δ, q 0 , F) where Q is a finite set of states, q 0 ∈ Q is the initial state, F ⊆ Q is the set of final states, and δ : Q × Σ → Q is the transition function.We assume familiarity with basic automata theory and leave it at mentioning that the language L(A) of A is defined as the set of words from Σ * that δ guides into a final state when starting in q 0 .That is, for the complement DFA A = (Q, δ, q 0 , Q \ F), we get It is well known that high-level specifications such as LTL formulas over finite words [18] or regular expressions can be compiled into corresponding DFAs.
We sometimes use RNNs and DFAs synonymously for their respective languages.For example, we say that R is Angluin's Algorithm.Angluin introduced L * , a classical instance of a learning algorithm in the presence of a minimally adequate teacher (MAT) [4].We do not detail the algorithm here but only define the interfaces that we need to embed L * into our framework.Given any regular language L ⊆ Σ * , the algorithm L * eventually outputs the unique minimal DFA H such that L(H) = L.The crux is that, while Σ is given, L is a priori unknown and can only be accessed through membership queries (MQ) and equivalence queries (EQ): (MQ) w ?∈ L for a given word w ∈ Σ * .Thus, the answer is either yes or no.

123
Content courtesy of Springer Nature, terms of use apply.Rights reserved.
(EQ) L(H) ?= L for a given DFA H. Again, the answer is either yes or no.If the answer is no, one also gets a counterexample word from the symmetric difference L(H) ⊕ L.
Essentially, L * asks MQs until it considers that it has a consistent dataset to come up with a hypothesis DFA H, which then undergoes an EQ.If the latter succeeds, then the algorithm stops.Otherwise, the counterexample and possibly more membership queries are used to refine the hypothesis.The algorithm provides the following guarantee: If MQs and EQs are answered according to a given regular language L ⊆ Σ * , then the algorithm eventually outputs, after polynomially 1 many steps, the unique minimal DFA H such that L(H) = L.

Verification approaches
Before we present (in Sect.4) our method of verifying RNNs, we here describe two simple approaches.The experiments will later compare all three algorithms wrt.their performance.

Statistical model checking (SMC).
One obvious approach for checking whether the RNN under test R satisfies a given specification A, i.e., to check whether L(R) ⊆ L(A), is by a form of random testing.The idea is to generate a finite test suite T ⊂ Σ * and to check, for each w ∈ T , whether for w ∈ L(R) also w ∈ L(A) holds.If not, each such w is a counterexample.On the other hand, if none of the words turns out to be a counterexample, the property holds on R with a certain error probability.The algorithm is sketched as Algorithm 1.
Note that the test suite is sampled according to a probability distribution on Σ * .Recall that our choice depends on two parameters: a probability distribution on Σ and a "termination" probability, both are described in Sect. 2.  Taking n = log(2/ε)/(2γ 2 ) random samples where m of them are counter examples, by Hoeffding's inequality bound [20] we get that

Algorithm 1: SMC
Therefore, if Algorithm 1 terminates without finding any counterexamples we get that R is ε-approximately correct wrt.A with probability at least 1 − γ .
While the approach works in principle, it has several drawbacks for its practical application.The size of the test suite may be quite huge and it may take a while both finding a counterexample or proving correctness.
Moreover, the correctness result and the algorithm assume that the words to be tested are chosen according to a random distribution that somehow also has to take into account the RNN as well as the property automaton.
It has been reported that this method does not work well in practice [47] and our experiments support these findings.Automaton Abstraction and Model Checking (AAMC).As model checking is mainly working for finite-state systems, a 123 Content courtesy of Springer Nature, terms of use apply.Rights reserved.
straightforward idea would be to (a) approximate the RNN R by a finite automaton A R such that L(R) ≈ L(A R ) and (b) to check whether L(A R ) ⊆ L(A) using model checking.The algorithmic schema is depicted in Algorithm 2.
Here, we can instantiate Approximation() by the DFAextraction algorithms from [36] or [47].In fact, for approximating an RNN by a finite-state system, several approaches have been studied in the literature, which can be, roughly, divided into two approaches: (a) abstraction and (b) automata learning.In the first approach, the state space of the RNN is mapped to equivalence classes according to certain predicates.The second approach uses automata-learning techniques such as Angluin's L * .The approach [47] is an intertwined version combining both ideas.
Therefore, there are different instances of AAMC, varying in the approximation approach.Note that, for verification as language inclusion, as considered here, it actually suffices to learn an over-approximation A R such that L(R) ⊆ L(A R ).
While the approach seems promising at first hand, its correctness has two glitches.First, the result "Property satisfied" depends on the quality of the approximation.Second, any returned counterexample w may be spurious: w is a counterexample with respect to A R satisfying A but may not be a counterexample for R satisfying A. If w ∈ L(R), then it is indeed a counterexample, but if not, it is spurious-an indication that the approximation needs to be refined.If the automaton is obtained using abstraction techniques (such as predicate abstraction) that guarantee over-approximations, well-known principles like CEGAR [14] may be used to refine it.In the automata-learning setting, w may be used as a counterexample for the learning algorithm to improve the approximation.Repeating the latter idea suggests an interplay between automata learning and verification-and this is the idea that we follow in the next section.However, rather than starting from some approximation with a certain quality that is later refined according to the RNN and the property, we perform a direct, property-directed approach.

Property-directed verification of RNNs
We are now ready to present our algorithm for propertydirected verification (PDV).The underlying idea is to replace the EQ in Angluin's L * algorithm with a combination of classical model checking and statistical model checking, which are used as an alternative to EQs.This approach, which we call property-directed verification of RNNs, is outlined as Algorithm 3 and works as follows.
After initialization of L * and the corresponding data structure, L * automatically generates and asks MQs to the given RNN R until it comes up with a first hypothesis DFA H (Line 3).In particular, the language L(H) is consistent with the MQs asked so far.
At an early stage of the algorithm, H is generally small.However, it already shares some characteristics with R.So it is worth checking, using standard automata algorithms, whether there is no mismatch yet between H and A, i.e., whether L(H) ⊆ L(A) holds (Line 4).Because otherwise (Line 10), a counterexample word w ∈ L(H) \ L(A) is already a candidate for being a misclassified input for R. If indeed w ∈ L(R), w is mistakenly considered positive by R so that R violates the specification A. The algorithm then outputs "Counterexample w" (Line 13).If, on the other hand, R happens to agree with A on a negative classification of w, then there is a mismatch between R and the hypothesis H (Line 14).In that case, w is fed back to L * to refine H. Now, let us consider the case that L(H) ⊆ L(A) holds (Line 5).If, in addition, we can establish L(R) ⊆ L(H), we conclude that L(R) ⊆ L(A) and output "Property satisfied" (Line 8).This inclusion test (Line 6) relies on statistical model checking using given parameters ε, γ > 0 (cf.Algorithm 1).If the test passes, we have some statistical guarantee of correctness of R (cf.Theorem 1).Otherwise, we obtain a word w ∈ L(R) \ L(H) witnessing a discrepancy between R and H that will be exploited to refine H (Line 9).
Overall, in the event that the algorithm terminates, we have the following theorem that assures the soundness of a returned counterexample and provides the statistical guarantees on the property satisfaction, depending on the result of the algorithm: Theorem 2 (Correctness of PDV) Suppose Algorithm 3 terminates, using SMC for inclusion checking with parameters ε and γ .If it outputs "Counterexample w", then w is mistakenly classified by R as positive.If it outputs "Property satisfied", then R is ε-approximately correct wrt.A with probability at least 1 − γ .Proof Suppose the algorithm outputs "Counterexample w" in Line 13. Due to Lines 11 and 12, we have w ∈ L(R)\L(A).Thus, w is a counterexample.
Although we cannot hope that Algorithm 3 will always terminate, we demonstrate empirically that it is an effective way for the verification of RNNs.

123
Content courtesy of Springer Nature, terms of use apply.Rights reserved.

Adversarial robustness certification
Our method can especially be used for adversarial robustness certification, which is parameterized by a distance function dist : (2) dist(w 1 , w 2 ) = dist(w 2 , w 1 ), and ( 3) dist(w 1 , w 3 ) ≤ dist(w 1 , w 2 )+dist(w 2 , w 3 ).Popular distance functions are Hamming distance and Levenshtein distance.The Hamming distance between w 1 , w 2 ∈ Σ * is the number of positions in which w 1 differs from w 2 , provided |w 1 | = |w 2 | (otherwise, the distance is ∞).The Levenshtein distance (edit distance) between w 1 and w 2 is the minimal number of operations among substitution, insertion, and deletion that are required to transform w 1 into w 2 .For L ⊆ Σ * and r ∈ N, we let N r (L) = {w ∈ Σ * | dist(w, w ) ≤ r for some w ∈ L} be the r -neighborhood of L. If L is regular and dist is the Hamming or Levenshtein distance, then N r (L) is regular (for efficient constructions of Levenshtein automata when L is a singleton, see [44]).
Let R be an RNN, L ⊆ Σ * be a regular language such that L ⊆ L(R), r ∈ N, and 0 < ε < 1.We call R εadversarially robust (wrt.L and r ) if Pr(N r (L) \ L(R)) < ε.Accordingly, every word from N r (L)\L(R) is an adversarial example.Thus, checking adversarial robustness amounts to checking the inclusion L(R) ⊆ N r (L) through one of the above-mentioned algorithms.
Note that, even when L is a finite set, N r (L) can be too large for exhaustive exploration so that PDV, in combination with SMC, is particularly promising, as we demonstrate in our experimental evaluation.
From the definitions and Theorem 2, we get: Lemma 1 Suppose Algorithm 3, for input R and a DFA A recognizing N r (L), terminates, using SMC for inclusion checking with parameters ε and γ .If it outputs "Counterexample w," then w is an adversarial example.Otherwise, R is ε-adversarially robust (wrt.L and r ) with probability at least 1 − γ .
Similarly, we can handle the case where Overall, this case amounts to checking L(R) ⊆ N r (L).

Experimental evaluation
We now present an experimental evaluation of the three algorithms SMC, AAMC, and PDV, and provide a comparison of their performance on LSTM networks [19] (a variant of RNNs using LSTM units).The algorithms have been imple-mented2 in Python 3.6 using PyTorch 19.09 and Numpy library.The experiments of adversarial robustness certification were run on Macbook Pro 13 with the macOS.The other experiments were run on NVIDIA DGX-2 with an Ubuntu OS.
Optimization For Equivalence Queries.In [36], the authors implement AAMC but with an optimization that was originally shown in [4].This optimization concerns the number of samples required for checking the equivalence between the hypothesis and the taught language.This number depends on ε, γ and the number of previous equivalence queries n and is calculated by 1  ε log 1 γ + log(2)(n + 1) .We adopt this optimization in AAMC and PDV as well (Algorithm 2 in Line 1 and Algorithm 3 in Line 6).

Evaluation on randomly generated DFAs
Synthetic Benchmarks.To compare the algorithms, we implemented the following procedure, which generates a random DFA A rand , an RNN R that learned L(A rand ), and a finite set of specification DFAs: (1) choose a random DFA A rand = (Q, δ, q 0 , F), with |Q| ≤ 30, over an alphabet Σ with |Σ| = 5; (2) randomly sample words from Σ * as described in Sect. 2 in order to create a training set and a test set; (3) train an RNN R with hidden dimension 20|Q| and 1+|Q|/10 layers-if the accuracy of R on the training set is larger than 95%, continue, otherwise restart the procedure; (4) choose randomly up to five sets F i ⊆ Q \ F to define specification DFAs A i = (Q, δ, q 0 , F ∪ F i ).Using this procedure, we created 30 DFAs/RNNs and 138 specifications.
Table 1 summarizes the executions of the three algorithms on our 138 random instances.The columns of the table are as follows: (1) Avg time was counted in seconds and all the algorithms were timed out after 10 min; (2) Avg len is the average length of the found counterexamples (if one was found); (iii) #Mistakes is the number of random instances for which a mistake was found; (iv) Avg MQs is the average number of membership queries asked to the RNN.
Note that not only is PDV faster and finds more errors than AAMC, the average number of states of the final DFA is also much smaller: 26 states with PDV and 319 with AAMC.Furthermore, it asked more than 10 times less MQs to the RNN.Comparing PDV to SMC, it is 4.5 times faster and the average length of counterexamples it found is 10 times smaller, even though with a little fewer mistakes discovered.

Comparing equivalence queries
The PDV algorithm heavily depends on the procedure for checking the language inclusion L(R) ⊆ L(H) between the hypothesized DFA H and the RNN model R. Checking whether L(R) is included in L(H), however, is generally computationally infeasible, and thus, we resort to statistical model-checking that ensures PAC guarantees.
In statistical model-checking, one of the crucial steps is the technique used for random sampling of words from Σ * .Thus, to determine how random sampling affects statistical model-checking, we investigate three different natural sampling techniques.We discuss them below.

Random: The first technique is to randomly sample words
based on the natural probability distribution on Σ * introduced in Sect. 2.

DFA-based:
The second technique exploits the hypothesis DFA H for random generation of words.To this end, we rely on the work by Bernardi and Giménez [11] who provide a linear algorithm for sampling words from DFAs.We built on top of their algorithm to generate words both accepted and rejected by H.As a heuristic, in our implementation, we incorporate modifications to reduce the chances of sampling the same word multiple times.

RNN-based:
The third technique exploits the RNN R for the random sampling of words.To this end, we rely on a technique similar to the one used by Barbot et al. [9].The technique, in essence, is an A * exploration in the rooted directed tree of all words Σ * , where each vertex is a word w ∈ Σ * and its children are wa for a ∈ Σ.The exploration is guided by a scoring function f : Σ * → R that indicates how likely a word is to be accepted by the RNN.For our experiments, we define the scoring function to be as follows: where val R (w) is a value assigned by an RNN R to a word w for determining its acceptance.Precisely, the RNN R accepts w if and only if val R (w) > 0.5.The scoring function f , defined above, prefers words w for which val R (w) is close to 0.5, since they can lead to words that can be accepted.To compare the performances, we run PDV using all of the sampling techniques on the synthetic benchmarks introduced in Sect.6.1.Table 2 summarizes the comparison results of the sampling techniques.We compare them based on the average runtime of inclusion checks, the number of mistakes found, and the number of membership queries (MQs) required.The comparison was run on a machine with an Intel Core i7 processor (using up to 1.80 Ghz), with 24GB of RAM.The timeout for each run was set to be 300 s.
From the above table, we observe that the sampling technique DFA-based performs the best in terms of the runtime, the number of mistakes identified and the number of MQs required.The sampling technique Random, on the other hand, spends more resources to find mistakes since it samples words simply based on a probability distribution.The RNN-based performs worst in our experiments because the function f , as we defined, does not direct the search toward appropriate words that could be potential mistakes.A better choice of function f , and consequently, a better understanding of the RNN R can improve this sampling technique.
In summary, we conclude that the random sampling technique for inclusion checks in PDV can greatly affect the search for mistakes in an RNN.
Faulty Flows.One of the advantages of extracting DFAs in order to detect mistakes in a given RNN is the possibility to find not only one mistake but a "faulty flow."For example, Fig. 1 shows one hypothesis DFA extracted with PDV, based on which we found a mistake in the corresponding RNN.The counterexample we found was abcee.One can see that the word abce is a loop in the DFA.Hence, we can suspect that this could be a "faulty flow."Checking the words w n = (abce) n e for n ∈ {1, . . ., 100}, we observed that, for any n ∈ {1, . . ., 100}, the word w n was in the RNN language but not in the specification.
To automate the reasoning above, we did the following: Given an RNN R, a specification A, the extracted DFA H, and the counterexample w: (1) build the cross product DFA H × A; (2) for every prefix w 1 of the counterexample w = w 1 w 2 , denote by s w 1 the state to which the prefix w 1 leads in H × A-for any loop starting from s w 1 , check if w n = w 1 n w 2 is a counterexample for n ∈ {1, . . ., 100}; (3) if w n is a counterexample for more than 20 times, declare a "faulty flow."Using this procedure, we managed to find faulty flows in 81/109 of the counterexamples that were found by PDV.

123
Content courtesy of Springer Nature, terms of use apply.Rights reserved.

Adversarial robustness certification
We also examined PDV for adversarial robustness certification, following the ideas explained in Sect.5, both on synthetic and real-world examples.
Synthetic Benchmarks.For a given DFA (representing one of the languages described below), we randomly sampled words from Σ * by using the DFA and created a training set and a test set.For RNN training, we proceeded like in step (3) for the benchmarks in Sect.6.1.Moreover, for certification, we randomly sampled 100 positive words and 100 negative words from the test set.For a given word w, we then let L = {w} and considered N r (L) where r = 1, . . ., 5.
Given an RNN R, we checked whether R satisfies adversarial robustness using the certification methods PDV, SMC, and neighborhood-automata generation SMC (NAG-SMC), with ε, γ = 0.01.In SMC, we randomly modified the input word within a certain distance to generate words in the neighborhood.In NAG-SMC, on the other hand, we first generated a neighborhood automaton of the input word, and sampled words that are accepted by the automaton.Here, we followed the algorithm by Bernardi and Giménez [11], who introduce a method for generating a uniformly random word of length n in a given regular language with mean time bit-complexity O(n).
Figure 2, which is a set of scatter plots, shows the results of the average time of executing the algorithms on the languages that we describe below.The x-axis and y-axis are both time in seconds, and each data point represents one adversarial robustness certification procedure.The length of words is from 50 to 500 and follows the normal distribution.
Simple Regular Languages.As a sanity check of our approach, we considered the following two regular languages and distance functions: 123 Content courtesy of Springer Nature, terms of use apply.Rights reserved.The size of the Hamming neighborhood will exponentially grow with the distance.The accuracies of the trained RNNs reached 100%.All three approaches successfully reported "adversarially robust" for the certified RNNs.
The first two diagrams on the first row of Fig. 2 compare the runtimes of PDV and SMC on the two regular-language datasets, resp., whereas the first two diagrams on the second row compare the runtimes of PDV and NAG-SMC.We make two main observations.First, on average, the running time of PDV (avg.15.70 s) is faster than SMC (avg.24.04 s) and NAG-SMC (avg.32.5 s), which shows clearly that combining symbolically checking robustness on the extracted model and statistical approximation checking is more efficient than pure statistical approaches.Second, although SMC and NAG-SMC are able to certify short words (whose length is smaller than 30) faster, when the length of words is greater, they have to spend more time (which is more than 60 s) for certification.This is because, for short words, statistical approaches can easily explore the whole neighborhood, but when the neighborhood becomes larger and larger, this becomes infeasible.
The first two diagrams on the third row of Fig. 2 compare the running time of SMC and NAG-SMC, respectively.In general, SMC is faster than NAG-SMC.This is mainly because, for sampling random words from the neighborhood, using the algorithm proposed by Bernardi et al. [11] is slower than combining the random.choicefunction in the Python library and the corresponding modification.Real-World Dataset.We used two real-world examples considered by Mayr and Yovine [36].The first one is the alternating-bit protocol (ABP) shown in Fig. 3.However, we add a special letter dummy in the alphabet and a selfloop transition labeled with dummy on every state.In the figure, for readability, we replace the letter dummy using let-ter d.The second example is a variant of an example from an e-commerce website [38], shown in Fig. 4.There are seven letters in the original automaton.Similarly, we also add letter dummy in the alphabet and also, in the self-loop transitions in every state (represented using d in the figure).In both the examples, we use the number of insertions of the letter dummy as the distance function.
The accuracies of the trained RNNs also reach 100%.For certification, the three approaches can certify the adversarial robustness for the RNNs as well.
The last two diagrams on the first (resp.second) row of Fig. 2 compare the runtime of PDV and SMC (resp.PDV and NAG-SMC) on the ABP and the E-commerce dataset.The data points in the first and second row have a vertical shape.The reason is that the running time of PDV is usually relatively stable (10-20 s), while the running time of SMC and NAG-SMC increases linearly with the word length.
The last two diagrams on the third row of Fig. 2 compare the runtimes of SMC and NAG-SMC on the two datasets.Here, the data points have a diagonal shape, but for NAG-SMC, when the word length is long (more than 300), it usually spends more time than SMC.This is mainly because it is inefficient to construct the neighborhood automaton and sample random words from the neighborhood.

RNNs identifying contact sequences
Contact tracing [27] has proven to be increasingly effective in curbing the spread of infectious diseases.In particular, analyzing contact sequences-sequences of individuals who have been in close contact in a certain order-can be crucial in identifying individuals who might be at risk during an epidemic.We, thus, look at RNNs which can potentially aid contact tracing by identifying possible contact sequences.However, in order to deploy such RNNs in practice, one would require them to be verified adequately.One does not want to alert individuals unnecessarily even if they are safe or overlook individuals who could be at risk.
In a real-world setting, one would obtain contact sequences from contact-tracing information available from, for instance, contact-tracing apps.However, such data is often difficult to procure due to privacy issues.Thus, in order to mimic a reallife scenario, we use data available from www.sociopatterns.org, which contains information about interaction of humans in public places (hospitals, schools, etc.) presented as temporal networks.
Formally, a temporal network G = (V , E) [21] is a graph structure consisting of a set of vertices V and a set of labeled edges E, where the labels represent the timestamp during which the edge was active.Figure 5 is a simple temporal network, which can be perceived as contact graph of four workers in an office where edge labels represent the time of meeting between them.A time-respecting path π ∈ V * -a 123 Content courtesy of Springer Nature, terms of use apply.Rights reserved.
Fig. 4 Automaton for e-commerce example sequence of vertices such that there exists a sequence of edges with increasing time labels-depicts a contact sequence in such a network.In the above example, C D AB is a timerespecting path while ABC D is not.
Benchmarks.For our experiment, given a temporal network G, we generated an RNN R recognizing contact sequences as follows: We considered only those RNNs that could be trained within 5 h with high accuracy (avg.99%) on the test data.3. We used a DFA that accepts all possible paths (disregarding the time labels) in the network as the specification, which would allow us to check whether the RNN learned unwanted edges between vertices.
Using this process, from the seven temporal networks, we generated seven RNNs and seven specification DFAs.We ran SMC, PDV, and AAMC on the generated RNNs, using the same parameters as used for the random instances.
Results.Table 3 notes the length of counterexample, the extracted DFA size (only for PDV and AAMC), and the running time of the algorithms.We make three main observations.First, the counterexamples obtained by PDV and AAMC (avg.length 2) are much more succinct than those by SMC (avg.length 13.1).Small counterexamples help in identifying the underlying error in the RNN, while long and random counterexamples provide much less insight.For example, from the counterexamples obtained from PDV and AAMC, we learned that the RNN overlooked certain edges or identified wrong edges.This result highlights the demerit of SMC, which has also been observed by [47].Second, the running time of SMC and PDV (avg.0.48 s and 0.41 s) is comparable, while that of AAMC is prohibitively large (avg.655.68 s), indicating that model checking on small and rough abstractions of the RNN produces superior results.Third, the extracted DFA size, in case of AAMC (avg.size 124.14), is always larger compared to PDV (avg.size 2), indicating that RNNs are quite difficult to be approximated by small DFAs and this slows down the model-checking process as well.Again, our experiments confirm that PDV produces succinct counterexamples reasonably fast.

Conclusion
We proposed property-directed verification (PDV) as a new verification method for formally verifying RNNs with respect 123 Content courtesy of Springer Nature, terms of use apply.Rights reserved.

1
In the index of the right congruence associated with L and in the size of the longest counterexample obtained as a reply to an EQ.

Fig.Fig. 1
Fig. Comparison of three algorithms on the regular languages

1 .
We created training and test data for the RNN by generating (1) valid time-respecting paths (of lengths between 5 and 15) using labeled edges from G, and (2) invalid time-respecting paths, by considering a valid path and randomly introducing breaks in the path.The number of time-respecting paths in the training set is twice the size of the number of labeled edges in G, while the test set is one-fifth the size of the training set.2. We trained RNN R with hidden dimension |V | (minimum 100) as well as 2 + |V |/100 layers on the training data.

Fig. 5
Fig. 5 Temporal network for contact between 4 people

Table 1
Comparison of verification algorithms

Table 2
Comparison of different equivalence queries (EQs) for PDV