Formal verification of neural agents in non-deterministic environments

We introduce a model for agent-environment systems where the agents are implemented via feed-forward ReLU neural networks and the environment is non-deterministic. We study the verification problem of such systems against CTL properties. We show that verifying these systems against reachability properties is undecidable. We introduce a bounded fragment of CTL, show its usefulness in identifying shallow bugs in the system, and prove that the verification problem against specifications in bounded CTL is in coNExpTime and PSpace-hard. We introduce sequential and parallel algorithms for MILP-based verification of agent-environment systems, present an implementation, and report the experimental results obtained against a variant of the VerticalCAS use-case and the frozen lake scenario.


Introduction
Forthcoming autonomous and robotic systems, including autonomous vehicles, are expected to use machine learning (ML) methods for some of their components. Differently from more conventional AI systems that are programmed directly by engineers, components based on ML are synthesised from data and implemented via neural networks. In an autonomous system these components could execute functions such as perception [38,48] and control [30,33]. Employing ML components has considerable attractions in terms of performance (e.g., image classifiers), and, sometimes, ease of realisation (e.g., non-linear controllers). However, it also raises concerns in terms of overall system safety. Indeed, it is known that neural networks, as presently used, are fragile and hard to understand [52].
If ML components are to be used in safety-critical systems, including various forthcoming autonomous systems, it is essential that they are verified and validated before deployment; standard practice for conventional software. In some areas of AI, notably multi-agent systems (MAS), considerable research has already addressed the automatic verification of AI systems. These concern the validation of either MAS models [20,35,41], or MAS programs [8,14] against expressive AI-inspired specifications, such as those expressible in epistemic and strategy logic. However, with the exceptions discussed below, there is little work addressing the verification of AI systems synthesised from data and implemented via neural networks. This paper makes a contribution in this direction.
Specifically, we formalise and analyse a closed-loop system composed of a reactive neural agent, synthesised from data and implemented by a feed-forward ReLU-activated neural network (ReLU-FFNN), interacting with a non-deterministic environment. Intuitively, the system follows the usual agent-environment loop of observations (of the environment by the agent) and actions (by the agent onto the environment). To model the complexity and partial observability of rich environments, we assume that the neural agent is interacting with a non-deterministic environment, where non-deterministic updates of the environment's state disallow the agent from fully controlling and fully observing the environment's state. Under these assumptions, differently from all related work, the system's evolution is not linear but branching in the future.
We study the verification problem of these systems against a branching time temporal logic. As is known, scalability is a concern in verification and is also an issue in the case of neural systems. To alleviate these difficulties, we are here concerned with a method that is aimed at finding shallow bugs in the system execution, i.e., malfunctions that are realised within a few steps from the system's initialisation. This kind of analysis has been shown to be of particular importance in applications, see, e.g., bounded model checking (BMC) [12], as, experimentally, bugs are often realised after a limited number of steps. Given this, we focus on a bounded version of CTL, i.e., a language expressing temporal properties realisable in a limited number of execution steps. This allows us to reason about applications where the agents ought to bring about a state of affairs within a finite number of steps, or to verify whether a system remains within safety bounds within a number of steps. This enables us to retain decidability even if we consider infinite domains over the reals for the system's state variables, whereas the verification problem for plain CTL is undecidable, as we show. To further alleviate the difficulty of the verification problem, we also introduce a novel algorithm that checks for the occurrence of bugs in parallel over the execution paths. As we show, in the case of bounded safety specifications, this enables us to return a bug to the user as soon as a violation is identified on any of the branching paths that are explored in parallel. This gives considerable advantages in applications, as we show in an avionics application.
A key feature of the parallel verification procedure that we introduce lies in its completeness: we can determine with precision when a potentially infinite set of states (up to a number of steps from the system's initialisation) satisfies a temporal formula. While this results in a heavier computational cost than some incomplete approaches, there are obvious benefits in precise verification, notably the lack of false positives and false negatives. To the best of our knowledge this is the first sound and complete verification framework for closed-loop neural systems that accounts for non-deterministic, branching temporal evolutions.
The rest of the paper is organised as follows. After discussing related work, in Sect. 4 we formally define systems composed by a neural agent, implemented by a ReLU-FFNN, interacting with non-deterministic environments. We analyse the resulting models built on branching executions and define a bounded version of the branching temporal logic CTL to express specifications of these systems. After defining the verification problem, Sect. 5 introduces monolithic and compositional verification algorithms with a complexity study. In this context we show results ranging from undecidability for unbounded reachability, to coNExpTimE upper bound for bounded CTL. We present a toolkit for the practical verification of these systems in Sect. 7, implementing said procedure, providing additional functionalities, and reporting the experimental results obtained. We conclude in Sect. 8.

Related work
In [3] a closed-loop neural agent-environment system was put forward and analysed. Like the present contribution the agent was modelled via a ReLU-FFNN. However, differently from here, a simple deterministic environment was considered. As a consequence, the system executions were linear and only bounded reachability properties were analysed. [2] extended this work to neural agents formalised via recurrent ReLU-activated neural networks and verified the resulting linear system executions against bounded LTL properties. In contrast, the model put forward here can account for complex, partially observable environments resulting in branching traces, and the strictly more expressive specification language allows for existential and universal quantification over paths. In addition, while the papers above focus on sequential verification procedures, we here develop a parallel approach specifically tailored at identifying shallow bugs efficiently. This requires novel verification algorithms and mixed-integer linear programming [56] (MILP) encodings.
A number of other proposals have also addressed the issue of closed loop systems. For example, [31] presents an approach based on hybrid systems to analyse a controlplant where neural networks are synthesised controllers. Their approach is incomparable with the one here pursued, since they target sigmoidal activation functions (while we focus on ReLU activation functions). Also their verification procedure is not complete, while completeness is a key objective here. Similarly, [15,28,32,57] present work addressing closed loop systems with learned controllers and focus on reachable set estimation and, hence, incomplete techniques for such systems.
Lastly, there has been recent activity on complete approaches for verifying standalone ReLU-FFNNs [6, 9-11, 17, 26, 27, 34, 36, 37, 40, 44, 53]. The systems considered in these approaches are not closed-loop and do not incorporate the environment. This makes the problems considered there different from those analysed here; for instance no temporal evolution can be considered for neural network-controlled agents interacting with an environment. We refer to [24,29,39] for surveys on the emerging area of verification of neural networks.
In comparison with [1], we here report several novel optimisations in the tool with a more efficient encoding of the discussed aircraft collision avoidance scenario. We also evaluate our tool on an additional reinforcement learning scenario.
More broadly, this line of work is related to long standing efforts in bounded model checking [7,47] that are tailored to finding malfunctions easily accessible from the initial states. While our approach is technically different from BMC, it shares with it the characteristic of being more efficient than full exploration methods when only a fraction of the model needs to be explored.

Background
In this section we summarise basic concepts pertaining to feed-forward ReLU networks and the formalisation of their verification problem in mixed integer linear programming. • 0 is the input to the network and i is the output of the i-th layer.

Feed-forward ReLU networks
• i is the result of the affine transformation of the i-th layer, known as the pre-activation of the layer, for a weight matrix i ∈ ℝ s i ×s i−1 and a bias vector i ∈ ℝ s i . • i is the activation function of the i-th layer. Figure 1 gives a graphical representation of a FFNN. Each layer i , i ∈ {1, … , L − 1} , is said to be a hidden layer; the last layer L of the network is said to be the output layer. Each element of each layer i is said to be a node (see Fig. 2). The weights and biases of the layers are determined during a training phase which aims at fitting to a data set consisting of input-output pairs specifying how the network should behave (see, e.g., [21]) for more details).
Here we are only concerned with FFNNs whose hidden layers use the Rectified Linear Unit (ReLU) and the output layer uses the identity function as activation functions; we abbreviate these networks by ReLU-FFNN. The ReLU activation function is widely used in supervised learning tasks because of its effectiveness in training [43]. The function is defined as and is applied element-wise to a pre-activation vector i (see Fig. 3). Since the function consists of two linear parts (0, for z < 0 and z for z ≥ 0 ), it is a piecewise-linear (PWL) function; that is, a function whose input domain can be split into a collection of subdomains on each of which it is an affine function. Consequently, since a ReLU-FFNN composes affine transformations with PWL activation functions, ReLU-FFNNs are also PWL functions.
In this paper we are concerned with the reachability problem for ReLU-FFNN. The problem is to establish whether there is an admissible input within a possibly uncountable set of inputs for which a given ReLU-FFNN computes an output within a given set of outputs (see e.g., [4,5,16,34])). Formally, we have Definition 1 (ReLU-FFNN reachability problem) Given a ReLU-FFNN ∶ ℝ s 0 → ℝ s k , a set of inputs X ⊂ ℝ s 0 and a set of outputs Y ⊂ ℝ s k , the neural network reachability problem is to determine whether The neural network reachability problem is known to be NP-complete [34].

Mixed integer linear programming
A mixed integer linear programming (MILP) is an optimisation problem whereby a linear objective function over real-and integer-valued variables is sought to be minimised subject to a set of linear constraints. Formally, we have there exists ∈ X such that ( ) ∈ Y.
For the purposes of this paper, we are interested in the MILP feasibility problem. This is concerned with checking whether a set of MILP constraints is feasible, i.e., whether there exists an assignment to the variables that satisfies all constraints. Therefore, we hereafter assume that the objective function is constant (i.e., it does not depend on the variables), and associate a MILP with a set of linear and typing constraints. It is known that the feasibility problem of MILP is NP-complete [46].
A PWL function can be MILP-encoded using the "Big-M" method. For instance, the pairs (z, x), where x = ReLU(z) and z ∈ [l, u] can be found as solutions to the following set of MILP constraints that use a binary variable , real-valued variables z and x and constants l and u: Here, when = 1 , the constraints imply that x = z and z ≥ 0 , and when = 0 , the constraints imply that x = 0 and z ≤ 0 . This approach of "switching off" constraints using large enough constants (l and u in this case) is called the "Big-M" method [22]. Equivalently to this formulation, we can compute the same solutions by making use of indicator constraints. These have the form ( = v) ⇒ c , for a binary variable , binary value v ∈ {0, 1} and a linear constraint c: The indicator constraints are read as follows: if = 1 then z ≥ 0 and x = z should hold, and if = 0 then z ≤ 0 and x = 0 should hold. Indicator constraints are supported by all major commercial MILP solvers and can be seen as syntactic sugar for Big-M constraints, where one does not have to provide the big M constant in advance. In particular, indicator constraints can be used to naturally express disjunctive cases, cf. monolithic encoding in Sect. 5.
Since ReLU-FFNNs are PWL, the ReLU-FFNN reachability problem has an exact MILP representation: a feasible solution of the corresponding MILP can be used to find an input 0 from the given set of inputs X so that ( 0 ) belongs to the given set of outputs Y [4,5,9,40,53]. To define the associated MILP, we assume the following: (i) X and Y can be respectively expressed as sets of linear constraints over variables of the inputs and outputs of the network; (ii) a lower bound i j and an upper bound i j for each pre-activation i j have been computed (the bounds can be computed from X via bound propagation methods, see, e.g., [50,55]).

Definition 2 (MILP formulation)
The MILP formulation of the ReLU-FFNN reachability problem for a ReLU-FFNN , an input set X and an output set Y is There exists an input 0 ∈ X such that ( 0 ) ∈ Y iff the MILP above is feasible.

Neural agent-environment systems
In this section we introduce systems with a neural agent operating on a non-deterministic environment (NANES ). These are an extension to non-deterministic environments of the deterministic neural agent-environment systems put forward in [3]. In contrast to traditional models of agency, where the agent's behaviour is given in an agent-based programming language, a NANES accounts for the recent shift to synthesise the agents' behaviour from data [33]; we consider agent protocol functions implemented via ReLU-FFNNs [25]. Differently from [3], following the dynamism and unpredictability of the environments where autonomous agents are typically deployed [42], a NANES models interactions of an agent with a partially observable environment. In this setting an agent cannot observe the full environment state, and therefore cannot deterministically predict the effect of any of its actions.
We now proceed to give a formal description of NANES components: a neural agent and a non-deterministic environment. The description closely follows the formalism of interpreted systems, a mainstream semantics for multi-agent systems [19]. To this end, we fix a set S ⊆ ℝ m of environment states and a set Act ⊆ ℝ n of actions, for m, n ∈ ℕ . We assume that the agent is stateless and that its protocol (also known as action policy) has already been synthesised, e.g., via reinforcement learning [51], and is implemented via a ReLU-FFNN or via a PWL combination of them.

Definition 3 (Neural Agents) Let S be a set of environment states.
A neural agent (or simply an agent) Ag acting on an environment is defined as the tuple Ag = (Act, prot) , where: • Act is a set of actions; • prot ∶ S → Act is a protocol function that determines the action the agent will perform given the current state of the environment. Specifically, given ReLU-FFNNs 1 , … , h , h ≥ 1 , prot is a PWL combination of the latter.
The environment is stateful and non-deterministically updates its state in response to the actions of the agent. where: • S ⊆ ℝ m is a set of states.
• t E ∶ S × Act → 2 S is a transition function that determines the temporal evolution of the state of the environment. Specifically, given the current state of the environment and the current action of the agent, the transition function returns the set of next possible environment states.
Given the above we can now define a closed-loop system comprising of an agent interacting with an environment.

Definition 5 (NANES ) A Neural Agent operating on a Non-Deterministic Environ-
ment System (NANES) is a tuple S = (Ag, E, I) where Ag = (Act, prot) is a neural agent, E = (S, t E ) is an environment, and I ⊆ S is a closed set 1 of initial states for the environment.
Hereafter we assume the environment's transition function is PWL and its set of initial states is expressible as a set of linear constraints over integer and real-valued variables. Also, to enable its finite MILP representation, we assume that the function's branching factor is bounded, i.e., there is a (arbitrarily large) b ∈ ℕ such that the cardinality of t E (s, a) is bounded by b for all s ∈ S and a ∈ Act.
Example 1 Consider as a running example a variant of the FrozENLakE scenario [45], where an agent navigates in a grid world consisting of walkable tiles (frozen surface) and non-walkable ones (holes in the ice) leading to the agent falling into water. The goal of the agent is to reach the goal tile while avoiding the holes. At each step the agent chooses a direction to walk, but since walkable tiles are slippery, the resulting movement direction is uncertain and does not only depend on the agent's choice. Namely, the agent may result moving in any of the three directions: the chosen one and to ones to its left or right, see Fig. 4 for an illustration. We now formalise this scenario as follows.
The agent is defined as Ag = (Act , prot ) , where The non-deterministic environment models the slippery nature of ice. Here we assume a 3 × 3 grid world, so formalise the environment as E = (S , t E ) , where: where i is the vector with 1 at i-th position and 0 everywhere else, and where mv(s, d) returns the state s ′ resulting from moving from state s in direction d, a −1 is the direction to the right of a and a +1 is the direction to the left of a when looking in the direction a (e.g., a −1 = 1 and a +1 = 3 for a = 2 ). To see that mv(s, d) is a PWL function, note that both S and Act are finite sets.
Finally, we define the set of initial states as I = { 1 } and the FrozENLakE system as S = (Ag , E , I ) . ◻ With each NANES S we can associate a temporal model M S that is used to interpret temporal specifications. Figure 5 gives a graphical depiction of the temporal model of the FrozENLakE system S , assuming that prot

Example 2
In the rest of the paper, we assume to have fixed a NANES S and the associated model M S . An M S -path, or simply path, is an infinite sequence of states s 1 s 2 … where s i ∈ R and s i+1 is a successor of s i , i.e. (s i , s i+1 ) ∈ T , for each i ≥ 1 . Given a path we use (i) to denote the i-th state in . For an environment state s = (c 1 , … , c m ) , we write (s) to denote the set of all paths originating from s and we use s.d to denote its d-th component c d .
We verify NANES against properties expressed in a bounded variant of the temporal logic CTL [13]. It is also possible to verify NANES against bounded versions of LTL, but not pursued here. Inspired by Real-Time Computation Tree Logic (RTCTL) [18], formulae of bounded CTL build upon temporal modalities indexed with natural numbers denoting the temporal depth up to which the formula is evaluated. Definition 7 (Bounded CTL) Given a set of environment states S ⊆ ℝ m , the bounded CTL specification language over linear inequalities, denoted bCTL ℝ < , is defined by the following BNF: Here atomic propositions are linear constraints on the components of a state. For instance, the atomic proposition (d 1 ) + (d 2 ) < 2 states that "the sum of the d 1 -st and d 2 -nd components is less than 2." The temporal formula EX k stands for "there is a path such that holds after k time steps", whereas AX k stands for "in all paths holds after k time steps". Note that the restriction of bCTL ℝ < to strict inequalities is crucial to the verification algorithm introduced in the next section; the algorithm relies on encoding the negation of the specification to check into MILP, which does not support strict inequalities.

Example 3
Consider the FrozENLakE scenario from Example 1. We are interested in assessing the safety of the scenario in terms of the following specifications: and for various values of k, where ≜ ⋀ h∈{3,7} (h) < 0.1 (i.e., holes voided) and ≜ (9) > 0.9 . The formula k states that in every evolution of the system the agent always avoids a hole within the first k steps. The formula k states that in every evolution of the system the agent always avoids a hole within the first k − 1 steps and reaches the goal state at the k-th step. Intuitively, the former means that all k-bounded runs are safe (no hole is reached), while the latter means that all k-bounded runs are safe and successful (the goal is reached in the end state). We now define the logic CTL ℝ < built from the atoms of bCTL ℝ <.

Definition 8 (CTL) The branching-time logic CTL ℝ < is defined by the following BNF:
where is an atomic proposition in bCTL ℝ <.
Comparing bCTL ℝ < to CTL ℝ < , we observe that on the one hand AX k and EX k are expressible, respectively, as AX(⋯ (AX ) ⋯) and ¬AX(⋯ (AX¬ ) ⋯) , where AX is applied k times. On the other hand, CTL ℝ < includes the AF ("in all paths eventually") and EU (unbounded until) modalities capable of expressing arbitrary reachability, whereas bCTL ℝ < admits bounded specifications only. Note that, while bCTL ℝ < is clearly less expressive than CTL ℝ < , it still captures properties of interest. Notably, bounded safety is expressible in bCTL ℝ < as AG k safe stating that every state on every path is safe within the first k steps.
We interpret bCTL ℝ < formulae on a temporal model as follows.

Definition 9 (Satisfaction)
For a model M S , an environment state s, and a bCTL ℝ < formula , the satisfaction of at s in M S , denoted (M S , s) ⊧ , or simply s ⊧ when M S is clear from the context, is inductively defined as follows: We assume the usual definition of satisfaction for CTL ℝ < ; this can be given as standard by using the atomic case from Definition 9.
Although bCTL ℝ < does not include negation, it still allows us to express arbitrary CTL formulae of bounded temporal depth since it supports all Boolean and temporal operators with their duals. Useful abbreviations of bCTL ℝ < are the temporal modalities EF k ("Possibly within k steps) and EG k ("Possibly for k steps): The dual temporal modalities AG k and AF k , prefixed by the universal path quantifier, are analogously defined: Moreover, bounded until E( U k ) ("there is a path such that holds within k time steps, and where holds up until then") can be defined by the abbreviations

and analogously with A( U k ).
We also note that the formula QX k , for Q ∈ {A, E} , is equivalent to QX 1 (⋯ (QX 1 ) ⋯) , where QX 1 is applied k times to .
A specification is said to be satisfied by S if (M S , s) ⊧ for all initial states s ∈ I . We denote this by S ⊧ . It follows that, for example, to check bounded safety we need to verify that from all (possibly infinitely many) initial states no state (out of possibly infinitely many) within the first k evolutions is an unsafe state. This is the basis of the verification problem that we define below.
Definition 10 (Verification problem) Given a NANES S and a formula , determine whether S ⊧ .

Remark 1
The verification problem is uniquely associated with a model checking problem which is to check whether M S ⊧ given M S and .
However, when the input is specified in terms of a NANES S and a specification , generally, the size (of the relevant part) of the model M S grows exponentially in the size of the input.
In the next section we study the decidability and complexity of the verification problem here introduced.

The verification problem
In this section we study the verification problem for a NANES against CTL and bCTL ℝ < specifications. First, we show that verifying against CTL formulae is undecidable for deterministic environments and simple reachability properties. In the rest of the section, we focus on bounded CTL, where we develop a decision procedure for the verification problem based on producing a single MILP and checking its feasibility. Then we devise a parallelisable version of the procedure that produces multiple MILPs and that can be particularly efficient at finding counter-examples for bounded safety properties. Following this, we analyse the computational complexity of the verification problem against bCTL ℝ < formulae.

Unbounded CTL
In this subsection we show undecidability of the verification problem for deterministic NANES against simple reachability properties, where a deterministic NANES is a tuple (Ag = (Act, prot), E = (S, t E ), I) , where |t E (s, a)| = 1 for all s ∈ S and a ∈ Act . The undecidability result for arbitrary NANES and full CTL follows.

Theorem 1 Verifying deterministic NANES against formulae of the form AF is undecidable.
Proof We show the result by reduction from the Halting problem which is known to be undecidable. Let M = ⟨Q, , 0 , , q 0 , q a ⟩ be a Turing machine, where 2} is a finite tape alphabet, where we treat 0 as the blank symbol, is the set of input symbols, • q 0 , q a ∈ Q are the initial and accepting states, respectively, and R} is the transition function such that given a state q ∈ Q ⧵ {q a } and a tape symbol , (q, ) = (q � , � , m) for q � ∈ Q , � ∈ and m ∈ {L, R} with the following meaning: if the Turing machine is in the state q and currently reads symbol (i.e., the content of the cell where the head currently is is ), then write in the current cell ′ , move the head left if m = L and right if m = R , and change the state to q ′ .
We can assume that the tape of M is open-ended only on the right-hand side, that is, the head never goes to the left of the first cell (cell number 1). The Halting problem defined as "given an input string 0 ⊆ 0 h , decide whether M halts on 0 = a 1 … a h , that is, whether M eventually enters q a " for such Turing machine is known to be undecidable.
We construct a NANES S = ((Act, prot), (S, t E ), I) and an unbounded temporal formula such that S ⊧ iff M halts on 0 = a 1 … a h .
• each state of S encodes the current configuration of the Turing machine, that is, the current state, the content of the tape and position of the head on the tape. We account for the position of the head implicitly by storing the content of the tape to the left and to the right of the head. Therefore, the state space S consists of tuples g = (q, l , , r ) where q ∈ Q, l represents the left part of the tape and is a real number from [0, 0.3) whose ith digit after the dot stores the content of the ith cell to the left of the head, and hence is one of 0, 1 or 2.
-∈ is the symbol under the head. r represents the right part of the tape and is a real number from [0, 0.3) whose ith digit after the dot stores the content of the ith cell to the right of the head, and hence is one of 0, 1 or 2.
The left l and right r parts of the tape can be seen as stacks with the symbols in the top part being closer to the head, while in the lower part further from the head. In what follows, we refer to them as left stack and right stack, respectively. Moving the head to the right then corresponds to popping the top symbol from the right stack and pushing onto the left stack, and the other way around.
• the agent's network N computes the constant (hence, linear) function prot(q, l , , r ) = a. • the transition function is deterministic and follows closely the transition function of M. The main technicality here is to implement by suitably updating l , and r .
where l = ⌊ l ⋅ 10⌋ and r = ⌊ r ⋅ 10⌋ are the top symbols of l and r , respectively, -there exists n and m such that (q, ) = (q � , n , m), - , that is, we pop the top symbol from the left stack and push n onto the right stack.
• is the reachability specification EF((1) > |Q| − 1) , where we assume that all states in Q are numbered from 1 to |Q|, and q a is the state number |Q|.
It is straightforward to see that S and are as required.
It can also be seen that t E is a linearly definable function. We only show an implementation of the function computing the integer part ⌊x⌋ of a non-negative real number x ∈ [0, 3) via a ReLU FFNN N ⌊⋅⌋ . Then t E can be computed by combining it with appropriate linear expressions and conditional statements.
Below we depict the network N ⌊⋅⌋ in which each hidden neuron is split into two nodes, resulting from the linear transformation of the previous layer, and the labelled result of ReLU activation; the weights are drawn on the edges and biases below the nodes. For instance, To see that this network computes what is intended, observe that the intermediate values y i and z i are as follows: Conversely, we can construct a NANES S � = ((Act � , prot � ), (S � , t � E ), I � ) where t ′ E is a linear function and prot ′ is a PWL function such that S ′ ⊧ iff M halts on 0 . Namely, ) = a , and • prot � (s) = t E (s, _).

◻
We observe that the above result holds even for strongly restricted NANES where either the protocol or the transition function is linear (but not both at the same time). Intuitively, since every piecewise-linear function can be exactly represented by a ReLUactivated neural network, NANES (even with deterministic environment transition function) are able to simulate recurrent neural networks that are known to be Turing-complete [49].
As a corollary, we obtain undecidability of the verification problem against full CTL.

Bounded CTL
We now proceed to investigate the verification problem for the bounded CTL specification language. We start by showing an auxiliary result that allows us to assume without loss of generality that the cardinality of t E (s, a) is the same for each state s and action a.
Lemma 1 Given a NANES S = ((Act, prot), (S, t E ), I) and specification a 2 )| for all s 1 , s 2 ∈ S � , and a 1 , a 2 ∈ Act , and a specification � ∈ bCTL ℝ < such that S ⊧ iff S ′ ⊧ ′ . Now, suppose that S = ℝ m . We set specification ′ to be the formula in bCTL ℝ < obtained from by replacing each atomic proposition with ∧ ((m + 1) > 0.9) . It is straightforward to see that S ⊧ iff S ′ ⊧ ′ , and hence, ′ is as required. ◻ In the rest of this section we assume that |t E (s, a)| = b for all s and a, and that t E is given as b piecewise-linear (PWL) functions t i ∶ ℝ m+n → ℝ m . To see that the latter is possible, note that, assuming an ordering (e.g., lexicographical) on the elements of ℝ m , we can define t i to return the i-th element of the result by t E , which is clearly a PWL function. Also note that this assumption is used when devising the verification procedure presented below.

Proof
The procedures that we put forward recast the verification problem to the MILP feasibility problem (see Sect. 3 for preliminaries on MILP). Given a MILP program , we use vars( ) to denote the set of variables in . Denote by the assignment function ∶ vars( ) → ℝ , which defines the specific (binary, integer or real) value assigned to a MILP program variable. We write ⊧ if satisfies , i.e., if ( ) ∈ {0, 1} for each binary variable , ( ) ∈ ℕ for each integer variable , and all constraints in are satisfied. Hereafter, we denote by boldface font tuples of MILP variables (of length m for S ⊆ ℝ m the set of environment states) representing an environment state and call them state variables.
As a stepping stone in our procedures, we encode the computation of a successor environment state as a composition of the protocol function prot and of the transition functions t i . By assumption, prot and each t i is a PWL function, and so the predicate Proof The result follows from the fact that t i ( , prot( )) is a PWL function and that every PWL function can be encoded into a set of MILP constraints. Details can be found in [3]. ◻

Monolithic encoding
First, we devise a recursive encoding of a NANES and a formula into a single MILP, referred to as monolithic encoding, and define the corresponding monolithic verification procedure. Denote by bCTL ℝ ≤ the bounded CTL language over atomic propositions where ≷∈ {≤, ≥} (i.e., linear constraints over non-strict inequalities). Given a NANES S and a formula ∈ bCTL ℝ ≤ , we construct a MILP program S , , whose feasibility corresponds to the existence of a state in M S that satisfies . For ease of presentation, and without loss of generality, we assume that may contain only the temporal modalities EX 1 and AX 1 , for which we write EX and AX, respectively.
We now define the monolithic encoding S , .

Definition 11
Given a NANES S and a formula ∈ bCTL ℝ ≤ , their monolithic MILP encoding S , is defined as the MILP program S , ( ) , where is a tuple of fresh state variables, and S , ( ) is built inductively using the rules in Fig. 6.
The monolithic encoding creates one single MILP that entirely accounts for the semantics of the formula, and in particular, every kind of disjunction (in 1 ∨ 2 and EX ) is handled by appropriate MILP constraints. Intuitively, the variables in the program S , ( ) refer to the states that satisfy the formula . In Fig. 6, the base case S , ( ) for an atom produces the MILP program consisting of a single linear constraint corresponding to and using variables in . Each inductive case depends on the state variables but might in turn generate programs for subformulas which depend on freshly created state variables different to (such as , 1 , 2 , etc). All other auxiliary variables employed in the encoding are also fresh, preventing undesirable interactions between unrelated branches of the program. • Disjunction uses a binary variable and two sets of indicator constraints. In a feasible assignment , when is 1, 1 is satisfied and the values of are assigned according to 1 , while when is 0, 2 holds and the values of are assigned as per 2 . • We encode conjunction as the union of the constraints for each of the conjuncts, which all must be satisfied at the same time. • We encode an operator EX by a b-ary disjunction using binary variables 1 , … , b .
Each of the possible b next states is chosen by activating one of i hence ensuring that the relevant C i ( , ) is satisfied. The variables refer to the successor state which must satisfy , therefore, the subprogram for depends on . Notably, only one copy of S , is required. • To satisfy AX , all b possible successor states should satisfy , and so we take the union of all C i ( , i ) and of b copies of S , , each depending on one of the successor state variables i .
Note that the size of S , may grow exponentially due to b repetitions of S , in S , AX ( ) ; for = AX k , the size of S , is O(k ⋅ b k ⋅ |S|) . The same estimate works in the general case for the temporal bound k of . On the other hand, when contains no AX operator, the size of S , remains polynomial O(k ⋅ b ⋅ |S|).
We can prove that S , is as intended.    Finally, we are ready to devise a procedure that solves the verification problem by checking feasibility of the monolithic MILP encoding for the negation of the property to be verified together with a restriction to the initial states of S , S , (¬ )∧ I . The idea here is to look for a proof that the property is not satisfied by a state s ∈ I . A feasible solution to S , (¬ )∧ I then provides such a proof in the form of a counter-example. Conversely, infeasibility implies that no counter-example could be found, and so the property is satisfied.

Lemma 3 Given a NANES S , a formula ∈ bCTL ℝ ≤ and a state s in
The procedure is given by Algorithm 1. Recall that strict inequalities are not supported in the MILP solver. Note that by our assumption the set I of initial states is closed and expressible as a set of linear constraints. Therefore, we can represent I by a Boolean formula from bCTL ℝ ≤ (i.e., a formula without temporal operators). For instance, the hyperrectangle [l 1 , u 1 ] × ⋯ × [l m , u m ] is represented by the formula Further note that we only pass ¬ (the negation of the specification) to the encoding in negation normal form (NNF). In this process, negation is eliminated by pushing it down and through the atoms resulting in all strict inequalities of atoms of the original specification being converted to non-strict inequalities. Therefore, � = (¬ ) ∧ I is a formula from bCTL ℝ ≤ , so S , ′ is well-defined and can be processed by an MILP solver.
Soundness and completeness of the verification procedure relies on Lemma 3 and is shown by the following.

Theorem 2 Given a NANES S and a formula ∈ bCTL ℝ < , Algorithm 1 returns False iff S ̸ ⊧ .
Proof Suppose that Algorithm 1 returns False. It follows that S , ¬ ∧ I ( ) is feasible, and there exists an assignment to vars( S , ¬ ∧ I ( )) such that ⊧ S , ¬ ∧ I ( ) . Take s = ( ) . Since I is a Boolean formula (without temporal operators), it is straightforward to see that s ∈ I . Therefore s is a state in M S , and by Lemma 3 we have that s ⊧ I and s ⊧ ¬ . It follows that s ̸ ⊧ , and consequently, S ̸ ⊧ . Conversely, if there exists s ∈ I such that s ̸ ⊧ , we obtain that there is an assignment satisfying S , ¬ ∧ I ( ) , and therefore Algorithm 1 returns False. ◻ =EX 1 . Then

Compositional encoding
Observe that due to its handling of disjunctions, the previously introduced monolithic encoding S , might result in excessively large programs whose feasibility is a computationally expensive task. We now propose a different encoding that instead of delegating disjunction to the MILP solver (the 1 ∨ 2 and EX cases) creates a separate program for each disjunct, whose feasibility results can be combined to solve the verification problem. More specifically, for a formula , we define a set S, of MILP programs with the property that there exists a state s in M S such that s ⊧ iff at least one of the programs in S, is feasible. Below, given a set C of linear constraints, we write [C] to denote the respective MILP program. Given sets Definition 12 Given a NANES S and a formula ∈ bCTL ℝ ≤ , their compositional MILP encoding S, is defined as the set of MILP programs S, ( ) , where is a tuple of fresh state variables, and S, ( ) is built inductively using the rules in Fig. 7.
Following the monolithic encoding in Fig. 6, in Fig. 7 C ( ) is the linear constraint corresponding to the atomic proposition defined over . We use the same convention regarding the state and auxiliary variables of subprograms. In S, every program represents one of the encodings of .
• For disjunction we take the union of the two sets of encodings. The set S, grows exponentially with the temporal depth of ; however each program in the set can be smaller than the monolithic MILP S , . Similarly to Lemma 3 we can prove that S, is as intended.

Lemma 4
Given a NANES S , a formula ∈ bCTL ℝ ≤ and a state s in M S , the following are equivalent: There is a MILP ( ) ∈ S, ( ) and an assignment to vars( ( )) such that s = ( ) and ⊧ ( ).
Proof Let s be a state in M S . We prove the statement by induction on the structure of . Suppose that the thesis holds for 1 and 2 . Consider the following cases: there exist ( ) ∈ S, 1 ( ) and an assignment 1 such that 1 ⊧ ( ) and s = 1 ( ) , or there exist ( ) ∈ S, 2 ( ) and an assignment 2 such that 2 ⊧ ( ) and s = 2 ( ) ⇔ there exist ( ) ∈ S, 1 ( ) ∪ S, 2 ( ) and an assignment a such that ⊧ ( ) and ( ) = s ⇔ there exist ( ) ∈ S, 1 ∨ 2 ( ) and an assignment such that ⊧ ( ) and ( ) = s.
there exists a successor s ′ of s in M S such that s ′ ⊧ 1 ⇔ , i ( ) = s and i ( ) = s � , and there exist ( ) ∈ S, 1 and an assignment ′ such that � ⊧ ( ) and � ( ) = s � ⇔ there exist ( ) ∈ S,EX 1 and an assignment such that ⊧ ( ) and ( ) = s.
for every successor s i of s in M S it holds that s i ⊧ 1 ⇔ for each i = 1, … , k , it holds that there exists an assignment i such that i ⊧ C i ( , i ) , i ( ) = s , i ( i ) = s i , and there exist i ( ) ∈ S, 1 and an assignment ′ i such that 1 and an assignment such that ⊧ ( ) and ( ) = s . ◻ We now devise a compositional verification procedure that searches for a feasible MILP in the set of MILPs generated by the compositional encoding. Similarly to the monolithic procedure, we pass the negation of the property to be verified together with a restriction to the initial states of S to the compositional encoding. If all problems in S, (¬ )∧ I are infeasible, no initial state s ∈ I and no possible evolution of the system from s could be found that would make ¬ true; therefore, the property is satisfied and the procedure returns True. However, if at least one of the programs has a solution, from the solution we can extract the counterexample for the specification in question, so the procedure returns False. The procedure is presented in Algorithm 2.
Since the feasibility checks of the generated programs can be executed independently of each other, they naturally lend themselves to parallelisation. The compositional procedure can thus be particularly efficient at finding bugs that can be reached within a few steps along some of the paths from the initial states. This has parallels with bounded model checking [12] since, once a bug has been detected, the whole procedure can be terminated. As we will see in the next section, this is particularly useful when verifying bounded safety.

Computational complexity of the verification problem
In this section we study the complexity of the verification problem for bCTL ℝ < . The upper bound follows from the monolithic verification procedure and the lower bound can be obtained by reduction from the validity problem of QBF.

Theorem 3 Verifying NANES against bCTL ℝ < is in coNExpTimE and pSpacE-hard in combined complexity.
Proof The coNExpTimE upper bound follows from the fact the MILP program in Algorithm 1 takes exponential time to construct and is of exponential size, and that infeasibility of MILP is a coNP-complete problem.
The pSpacE lower bound can be shown by adapting the lower bound proof in [34]. The idea is to represent the full binary tree of variable assignments in the temporal model of NANES and then to use a bCTL ℝ <-specification to check validity of the QBF.
Let be a QBF of the form 1 x 1 … m x m , where is in 3CNF. We now construct a NANES S = ((Act, prot), (S, t E ), I) and a bCTL ℝ <-specification such that S ⊧ iff is valid.
Let n be the number of clauses in . … , a m+1 ) , a i , i ≤ m , encodes an assignment to the variable x i with -1 being the value not yet set, and 0 and 1 being False and Truth respectively, and a m+1 holds the number of satisfied clauses in under the given assignment, or -1 if the assignment has not been set yet.
• the neural agent performs two different tasks depending on the input. Let (a 1 , … , a m+1 ) ∈ S . If at least one of a 1 , … , a m is −1 , the network returns the vector (0, … , 1, … , 0) with 1 at position i where i is the minimal index with a i = −1 . Other-wise, the network computes the value of for the assignment given by a 1 , … , a m . N is defined in detail below. • I consists of one initial state (−1, … , −1) , and • the specification property Q is defined as 1 One can show that the agent's protocol function and the environment transition function are PWL. It is straightforward to see that S and are as required. Observe that pSpacEhardness holds already for a single initial state, i.e., when I is a singleton set.
We show how to construct the agent's neural network N. We do so by defining a number of gadgets that will constitute N.
• The undefined gadget, given a node a i ∈ {−1, 0, 1} outputs a node b i that is 1 if a i = −1 (i.e., the value of variable X i is not set), and 0 otherwise.
• The i first undefined gadget, which given nodes b 1 , … , b m ∈ {0, 1} outputs a node c i that is 1 if b i = 1 and for 1 ≤ j < i , b j = 0 , and 0 otherwise.
• The SAT gadget evaluates for the assignment provided by the values of a 1 , … , a m , and outputs a node y (see [34] for details). If the assignment is valid (i.e., all a i are from {0, 1} , the value of y is n iff it is a satisfying assignment.
• We are going to output y but only if all a i s, 1 ≤ i ≤ m , have been assigned a value, and hence c i s are all 0. Otherwise, the value of y will be discarded with the help of the following evaluation result gadget.
To conclude with N, the output nodes of N are c 1 , … , c m , c . We can schematically depict N as follows: ◻ We also show that the complexity of the verification problem is reduced to coNP for the bounded safety fragment of bCTL ℝ <.

Corollary 2 Verifying NANES against bounded safety properties is coNP-complete in combined complexity.
Proof The upper bound follows from the fact that we can check whether a property = AG k safe is not satisfied by S by guessing an initial state s and a path of length k originating from s, and by verifying that (i) ̸ ⊧ safe for some i = 1, … , k . If such an initial state s exists, then there exists an initial state s ′ with the same properties of polynomial size. This follows from the encoding into MILP and the fact that if a MILP instance is feasible, there is a solution of polynomial size.
The lower bound can be adapted from the NP lower bound of the satisfiability problem of neural networks properties [34], and holds already for one-step formulae. ◻

Implementation and experiments
We have implemented the verification procedures described in the previous section in an open source toolkit called VENmaS [54]. The tool takes as input a bCTL ℝ < specification and a NANES S . The top-level call to the tool returns if is satisfied by S , and returns if is not satisfied at some initial state of S . In the latter case, a trace in the form of state-action pairs is produced, giving an example run of the system which failed to satisfy the specification.
The user provides a parameter to determine whether the monolithic or compositional procedure with parallel or sequential execution is to be used. For monolithic verification VENmaS follows Algorithm 1. For compositional verification VENmaS performs the computation in line 7 of Algorithm 2 in a splitting process that adds subprograms to a jobs queue. The computation in line 9 is performed asynchronously across a specified number of worker processes (executing on the same machine) retrieving tasks from the jobs queue: one worker process under sequential execution and multiple worker processes under parallel execution (the default in the implementation is 8). The main process finishes either when a MILP query terminates with a feasible solution (i.e., a counter-example), or all the jobs return infeasible results, or no result is returned within a given time limit.
In order to produce stronger MILP formulations, both in the case of monolithic and compositional encodings, we compute lower and upper bounds for all state and action variables. These are computed by propagating through the networks the bounds of the input states given by I using symbolic interval propagation. 2 The bounds are used in the Big-M encoding of the ReLU nodes (see Sect. 3.2). For other kind of constraints (e.g., those expressing the transition function), the bounds are propagated using standard interval arithmetic. The bounds of the EX successor state variables are taken as the widest bounds among all possible b successors.
We observe that the compositional encoding in Fig. 7 requires computing the whole set S, in memory, before the individual jobs can become available to the worker processes. This presents several drawbacks. First, it requires exponential memory and can be infeasible for large networks and big values of temporal depth. Second, the workers are idle at the beginning of the process. Therefore, for the experiments below, we have implemented and used a version of the compositional encoding that accepts the fragment of bCTL ℝ ≤ that consists of (arbitrary) disjunctions, conjunctions over atomic propositions, and the existential path quantifier EX. The compositional encoding computes the programs in S, and adds them to the jobs queue one by one in a depth-first fashion. The worker processes then can start to solve verification queries as soon as individual jobs become available. If a counter-example has been detected in one of the first jobs, the whole verification procedure can terminate without computing all subproblems. Additionally, we integrated a further optimisation whereby we may discard particular jobs by examining the bounds of the state variables when encoding an atomic proposition: if the bounds (e.g., x 1 ∈ [−116, −112] ) contradict the atomic formula (e.g, (1) ≥ −100 ), then the whole MILP instance is trivially infeasible and no verification job is created. Finally, disjunction of atomic propositions is handled as in the monolithic case; this limits the blow-up on the number of MILPs generated whilst not hindering the efficiency of the verification procedure.
The tool is implemented in Python and uses Gurobi ver. 9.1 [23] as a back-end to solve the generated MILP problems. To evaluate our tool, we performed experiments on a machine equipped with an Intel Core i7-7700K CPU @ 4.20GHz with 16GB of RAM, running Ubuntu 20.04, kernel version 5.8.
We now describe the experimental results obtained on two scenarios.

FrozenLake scenario
We started by validating VENmaS on the FrozENLakE scenario from Example 1 against the specifications reported in Example 3. We considered the grid world as in Fig. 4: there are two holes -tiles 3 and 7 , tile 1 is the initial state, and tile 9 is the goal. We trained the agent's neural network using an actor-critic approach [51]. Each training episode ends when the agent reaches the goal or falls in a hole. A reward of 1 is received if the agent reaches the goal, otherwise no reward is given. So, the agent is trained to maximise the chance of reaching the goal, including avoiding the holes. The resulting network is a ReLU-FFNN with 2 hidden layers both consisting of 32 neurons, 9 inputs and 4 outputs. The environment transition function was implemented using appropriate MILP constraints.
To evaluate how apt the agent was at realising safe and successful runs, we verified the FrozENLakE system S against the specifications k and k previously defined in Example 3, stating, respectively, that all k-bounded runs are safe and all k-bounded runs are safe and successful. To derive more optimal encodings, the specifications were equivalently formulated to minimise the number of AX operators: The experiments showed that the agent was always able to avoid holes, hence to realise safe runs; however the agent was shown to be unable to ensure that the goal was reached in a given number of time steps. Table 1 reports the time (in seconds) taken to resolve the specifications k and k for k ∈ {1, … , 10} for each of the execution modes. The results for the monolithic procedure are denoted moNoLiThic, and the results for the compositional procedure with parallel and sequential execution are denoted comp-par and comp-SEq, respectively. All cases use a fixed timeout of one hour. Here we see that the compositional procedure, regardless of the execution mode, was very efficient both at finding counter-examples to k and at proving that k is satisfied. There was no difference between the sequential and parallel executions when verifying k because all jobs have been discarded by the splitting process.
The monolithic procedure also managed to locate counter-examples quickly. As for proving safety, it was significantly slower than the compositional procedure for k ≥ 6 . The main reason for this is that it produced very loose MILP formulations due to overapproximated bounds. Note that in this scenario we start from a single state; so for every program in the compositional encoding we are able to compute each next state exactly. However, in the monolithic procedure we are forced to over-approximate the bounds of the state variables when encoding EX since we do not know in advance which of the b successors they refer to. This difficulty is further exacerbated by the discrete nature of the state space in this scenario. To see this, assume that we are in the state 5 , the chosen direction is down, and has exact bounds (0, 0, 0, 0, 1, 0, 0, 0, 0). Then the possible successors are 8 , 4 and 6 , and the bounds become very large: the upper bounds of the state variables become (0, 0, 0, 1, 0, 1, 0, 1, 0) and the lower bounds all zeros. These bounds are further loosened as they are propagated through the network resulting in MILP formulations that are hard to solve.

The aircraft collision avoidance system VerticalCAS
For the second set of experiments, we consider a scenario involving two aircraft, the ownship and the intruder, where the ownship is equipped with a collision avoidance system VerticalCAS [32]. The intruder is assumed to follow a constant horizontal trajectory. Every second VerticalCAS issues vertical climbrate advisories to the ownship pilot. This is to avoid a near mid-air collision (NMAC), a region where the ownship and intruder are separated by less than 100ft vertically and 500ft horizontally. The possible advisories are:   Figure 8 illustrates the vertical geometry of the encounter, which is given by h and ̇h 0 , and the time until the ownship (black) and intruder (red) are no longer horizontally separated. The , one for each advisory, with three inputs (h,̇h 0 , ) , five fully-connected hidden layers of 20 units each, and nine outputs representing the score of each possible advisory.

NANES encoding and specification
We now formalise the VErTicaLcaS scenario as a NANES S = (Ag , E , I ) . We model VerticalCAS as the neural agent Ag = (Act , prot ) with the set of actions Act = [9] and the protocol function producing an action corresponding to the highestscoring advisory formally defined as prot (s) = arg max ( ( (s), s)) , where for a state s = (h,̇h 0 , , adv) ∈ S : • ∶ S → F selects the neural network corresponding to the previous advisory adv , (s) = f adv ; • ∶ F × ℝ 4 → ℝ 9 computes the output of a neural network for a given state, (f , s) = f (h,̇h 0 , ), • arg max ∶ ℝ 9 → [9] returns the index of the score with highest value from a neural network's output.
Since each of the above functions and the ReLU-FFNNs are PWL, the composition prot is also PWL. We model the ownship pilot's non-deterministic behaviour in the environment of S , defined as E = (S , t E ) . Thus, the environment transition function t E "chooses" an The acceleration chosen by the pilot depends on the issued advisory adv ′ and the current vertical climbrate ̇h 0 of the ownship. We bound the number of possible successor states of t E by 3, that is b = 3 , so the set of next possible accelerations is Acċh 0 adv � = {ḧ (1) 0 ,ḧ (2) 0 ,ḧ (3) 0 } defined as follows. If the current vertical climbrate ̇h 0 of the ownship is compliant with the advisory, the pilot maintains a constant climbrate, i.e., ḧ (i) . Otherwise, the pilot chooses acceleration from the continuous interval defined for each advisory. We discretise the set of possible accelerations into b equally spaced cells, so for instance, Acċh 0 DND = { g 4 , 7g 24 , g 3 }. Given the current state s ∈ S , the issued advisory adv � = prot (s) , and the set of next possible accelerations Acċh 0 adv � = {ḧ (i) 0 ∶ i ∈ [b]} , we define each of the transition functions t 1 , … , t b for t E as: . Thus, each t i simulates a possible choice of acceleration by the pilot and computes the next state by taking into account the state transition dynamics.
The set I of initial states is defined as the set of all initial climbrates. This is a potentially risky encounter with the intruder initially below the ownship, but with the ownship descending towards the intruder.
We are interested in checking whether by following the advisories issued by Vertical-CAS and independently of the acceleration chosen by the pilot, the ownship can manage to stay outside of the unsafe region ( |h| ≤ 100 ), entering which may potentially lead to an NMAC for small values of . So we consider the safety specifications for various values of k. Recall that the term (1) represents the first component of the state s and so refers to h, the intruder altitude relative to ownship. Thus, the formula k states that in every evolution of the system starting from every initial state in I after k time steps the absolute value of vertical separation is greater than 100 ft.

Implementation and experiments
Our implementation of VerticalCAS system aims to compute tightest possible bounds for all states and intermediate variables in an effort to produce efficient MILP encodings.
We implemented the agent as a combination of custom MILP constraints and of NN MILP encodings. Note that in the worst case we need to choose between 9 networks to compute the action (next advisory), requiring 9 additional binary variables (similarly to encoding EX) and the constraints for each of the 9 networks. We keep the number of MILP constraints as low as possible, by avoiding to include the encodings of the networks f adv for advisory values that lie outside of the current bounds of the (previous) advisory variable. For instance, if these bounds identify advisory as being COC (i.e., lower and upper bounds are 1), then only the encoding of f 1 is included. This also enables computing tighter bounds for the new advisory. The environment is implemented via a set of custom MILP constraints. Given a new advisory by the agent, we need to choose the acceleration depending on 9 possible values of the advisory. Again, we compute tight acceleration bounds by considering the bounds of the new advisory. In particular, when the latter are exact (i.e., the exact value of the advisory is known), the former are exact as well. Tight acceleration bounds allow for computing tight bounds for the next state, and so on.
We verified the VerticalCAS system S against the specifications k for various values of k. The experiments showed that for high values of descent rate the ownship enters the unsafe region for a number of steps, eventually managing to recover. As the descent rate decreases, the time spent in the unsafe region decreases, until for the lowest value of −19.5 where the ownship remains safe for the entire period. For instance, the trace produced by VENmaS for ̇h 0 = −22.5 and k = 3 shows that the agent issues the advisory CL1500 at each time step, thereby causing the pilot to accelerate at g 4 ft/ s 2 so as to climb to avoid collision with the intruder. The descent rate was not reduced quickly enough to avoid the unsafe state (h,ḣ 0 , , adv) = (−97.725, 1.65, 22, CL1500) being reached by the third timestep. Table 2 reports the performance of the tool in terms of time (in seconds) taken to resolve the specification k for k ∈ {1, … , 10} with initial climbrates ̇h 0 ∈Ḣ 0 for each of the execution modes. For all cases we use a fixed timeout of one hour. Here we see that the monolithic procedure is the most performant method, both for proving safety and for finding counter-examples. We attribute this to the fact that, unlike in the FrozENLakE scenario, here it was possible to compute tight bounds for the state variables even after 10 time steps, resulting in tight MILP formulations whose (in)feasibility can be solved efficiently by a MILP solver. The compositional procedure was penalised as it had to construct an exponential number of programs and analyse each of them for the configurations where the property was satisfied. As before, there is no difference between sequential and parallel executions when all created subproblems have been discarded early.
We have also identified several more challenging configurations that are likely to lie close to the boundary between the safe and unsafe initial positions, such as for ̇h 0 ∈ {−39, −39.5, −40, −40.5} for k ∈ {9, 10} . Because of the uncertainty, the bounds become looser. As a result, not all problems are discarded during the compositional encoding and the workers are assigned MILP problems; in this case the parallel execution is approximately twice as fast as the sequential execution. Instead, the monolithic procedure required more branch and bound iterations to find a feasible assignment or to rule out its existence. Among these initial states we also found few cases where compositional procedure was more efficient than the monolithic one ( ̇h 0 = −39 , k = 9 ; ̇h 0 = −39.5 , k = 10 ; and ̇h 0 = −40.5 , k = 9). Lastly, we note that we used double-precision floating point numbers for representing real values. Gurobi, the back-end MILP solver that we used, uses a default tolerance of 10 −6 , representing the amount of numerical error allowed on a constraint while still considering it "satisfied". We relied on Gurobi for dealing with any further numerical issues. Finally, note that the encoding here presented is more efficient than [3], which does not compute variable bounds for their MILP encoding.
In terms of comparisons, we are unable to present an evaluation with other tools because, as far as we are aware, no other tool supports branching models and CTL specifications as we do here.

Conclusions
As we argued in Sect. 1, forthcoming autonomous systems will make greater use of machine learning methods; therefore there is an urgent need to develop techniques aimed at providing guarantees on the resulting behaviour of such systems. While the benefits of formal methods have long been recognised, and they have found large adoption in safetycritical systems as well as in industrial-scale software, there have been few efforts to introduce verification techniques for systems driven by neural networks. In this paper we defined a system composed of a neural agent driven by deep feedforward neural networks interacting with a non-deterministic environment. The resulting system displays branching evolutions. We defined and studied the resulting verification problem. While the problem is undecidable for full reachability, we isolated a fragment of the temporal language and showed that its corresponding verification problem is in coN-ExpTimE. We developed and reported on a toolkit which includes a novel parallel algorithm to verify temporal properties of the complex environment defined in the VerticalCAS scenario. As demonstrated, while the parallel algorithm remains complete, it offers considerable advantages over its sequential counterpart when searching for counterexamples to bounded safety specifications in concrete examples.
In future work we plan to tackle scalability issues by developing alternative encodings to the ones here presented.