Greedy Neural Network Veriﬁer

. Neural Networks (NNs) have increasingly apparent safety implications commensurate with their proliferation in real-world applications: both unanticipated as well as adversarial misclassiﬁcations can result in fatal outcomes. As a consequence, techniques of formal veriﬁ-cation have been recognized as crucial to the design and deployment of safe NNs. In this paper, we introduce a new approach to formally verify the most commonly considered safety speciﬁcations for ReLU NNs – i.e. polytopic speciﬁcations on the input and output of the network. Like some other approaches, ours uses a relaxed convex program to mitigate the combinatorial complexity of the problem. However, unique in our approach is the way we use a convex solver not only as a linear feasibility checker, but also as a means of penalizing the amount of relaxation allowed in solutions. In particular, we encode each ReLU by means of the usual linear constraints, and combine this with a convex objective function that penalizes the discrepancy between the output of each neuron and its relaxation. This convex function is further structured to force the largest relaxations to appear closest to the input layer; this provides the further beneﬁt that the most “problematic” neurons are conditioned as early as possible, when conditioning layer by layer. This paradigm can be leveraged to create a veriﬁcation algorithm that is not only faster in general than competing approaches, but is also able to verify considerably more safety properties; we evaluated PEREGRiNN on a standard MNIST robustness veriﬁcation suite to substantiate these claims.


Introduction
Neural Networks have become an increasingly central component of modern machine learning systems, including those that are used in safety-critical cyberphysical systems such as autonomous vehicles. The rate of this adoption has exceeded the ability to reliably verify the safe and correct functioning of these components, especially when they are integrated with other components such as  controllers. Thus, there is an increasing need to verify that NNs reliably produce safe outputs, especially subject to malicious adversarial inputs [16,20,27,28].
In this paper, we propose PEREGRiNN, an algorithm for efficiently and formally verifying the input/output behavior of ReLU NNs. In this context, PERE-GRiNN falls into the broad category of sound and complete search and optimization NN verifiers [22]. The search aspect of PEREGRiNN involves iterating over different combinations of neuron activation patterns to verify that each is compatible with the specified safety constraints (on the input and output of the network). Like other algorithms in this category, PEREGRiNN combines this search with optimization techniques to make inferences about the feasibility of full-network activation patterns on the basis of activation patterns of only a subset of neurons. The optimization in question reformulates the original NN feasibility problem into a relaxed convex feasibility problem to allow sound inferences: i.e. if the convex relaxation is infeasible, then the original NN problem may soundly be concluded to be infeasible. In this relaxed feasibility problem, the output of each individual neuron is assigned a relaxation variable that is decoupled from the actual output of that neuron. PEREGRiNN also uses a type of reachability analysis (symbolic interval analysis) both to enhance the optimization-based inference described above and as a source of additional sound inference itself. For this reason, PEREGRiNN's search procedure searches neurons in a layer-by-layer fashion, preferring to fix the phases of neurons closest to the input layer first.
In contrast to other search and optimization algorithms, however, PERE-GRiNN augments each convex feasibility query with a (convex) penalty function in order to obtain better guidance on which activation patterns to search next. In particular, we note that the amount of relaxation needed on a neuron can be regarded as a quasi-measure of how close the convex solver came to operating the associated neuron in a valid regime -i.e. at a valid evaluation of that neuron on a particular input. In this sense, the amount of relaxation in aggregate can be regarded as a quasi-measure of how close the solver came to finding a valid evaluation of the network as a whole. Inversely, the largest distance between a relaxation variable and its neuron's closest ReLU constraint intuitively corresponds in some sense to how "problematic" that neuron is with regard to obtaining such a valid evaluation. These distances we refer to as the "slacks" for each neuron. Thus, PEREGRiNN may be regarded as greedily minimizing a slack-based penalty.
Finally, we evaluated the performance of PEREGRiNN by using it to verify the adversarial robustness of networks trained on the MNIST [21] dataset. Our experiments show that PEREGRiNN is on average 1.27× faster than Neurify [31], 1.24× faster than Venus [6], 1.15× faster than nnenum [4], and 1.65× faster than Marabou [19]. It also proves 27%, 19%, 10%, and 51% more properties than the other solvers, respectively. PEREGRiNN's unique convex penalty augmentations are also considered in ablation experiments to validate their benefits.
Related Work. Since PEREGRiNN is a sound and complete verification algorithm, we restrict our comparison to other sound and complete algorithms. NN verifiers can be grouped into roughly three categories: (i) SMT-based methods, which encode the problem into a Satisfiability Modulo Theory problem [11,18,19]; (ii) MILP-based solvers, which directly encode the verification  [3,[5][6][7][8]14,23,29]; (iii) Reachability based methods, which perform layer-by-layer reachability analysis to compute the reachable set [4,13,15,17,30,32,34,35]; and (iv) convex relaxations methods [10,31,33]. In general, (i), (ii) and (iii) suffer from poor scalability. On the other hand, convex relaxation methods depend heavily on pruning the search space of indeterminate neuron activations; thus, they generally depend on obtaining good approximate bounds for each of the neurons in order to reduce the search space (the exact bounds are computationally intensive to compute [9]). These methods are most similar to PEREGRiNN: for example, [7,25,32] recursively refine the problem using input splitting, and [31] does so via neuron splitting. Other search and optimization methods include: Planet [11], which combines a relaxed convex optimization problem with a SAT solver to search over neurons' phases; and Marabou [19], which uses a modified simplex algorithm.

Problem Formulation
In this paper, we will consider Rectified Linear Unit (ReLU) NNs. An n-layer ReLU network, is a composition of n ReLU layer functions: i.e. N N = f n • f n−1 • · · · • f 1 where the i th ReLU layer function is defined as We refer to f 1 as the input layer. Finally, to refer to individual neurons, we use the notation (z) j to indicate the j th element of z. Verification Problem. Let N N be an n-layer NN as defined above. Furthermore, let P y0 ⊂ R k0 be a convex polytope in the input space of N N , and let P yn ⊂ R kn be a convex polytope in the output space of N N . Finally, let h : R k0 ×R kn → R, = 1, . . . , m be convex functions defining joint input/output constraints on N N . Then the verification problem is to decide whether

PEREGRiNN Overview
The general structure of PEREGRiNN is depicted in Fig. 1. Like other search and optimization based NN verifiers it has two main components: a search component and an inference component, and PEREGRiNN iterates back and forth between these these two components until termination. In particular, the search and inference components interact in the following way. The search component successively iterates over all possible on/off activations for each neuron; this is done by fixing these activations one neuron at a time, starting from the input layer and working towards the output layer. The process of fixing a neuron's activation is referred to as conditioning its phase: each neuron can be in either its active phase (operating linearly) or inactive phase (outputting zero). Thus, the search component provides the inference component a subset of neurons, each of which has been conditioned; the inference component then attempts to soundly reason about whether the remaining, unconditioned neurons can be operated in such a way as to violate the safety constraint. If the inference component soundly concludes safety for all possible activations of the remaining unconditioned neurons, then the search component backtracks, oppositely reconditioning one of the neurons that was already conditioned. Otherwise, if a sound safe conclusion is not made, then the search component uses information from the inference component to decide on a new neuron to condition, and the process repeats. The algorithm terminates if either a counterexample to safety is found, or else all possible neuron activations are considered without finding such a counterexample.
The convex program inference block is at the heart of the inference component and PEREGRiNN itself. In this block, PEREGRiNN, like other search and optimization solvers, uses a relaxed linear feasibility program where the output of each individual neuron is assigned a relaxation variable that is decoupled from the actual output of that neuron. In the notation of Sect. 2, such a linear feasibility program can be written as follows, where the vector variables y i , i = 0 are the relaxation variables.
Importantly, if (2) is infeasible, then the original NN problem in (1) may be soundly concluded to be infeasible as well -and hence, safe. However, as described above, the primary function of the convex feasibility program is to use a set of conditioned neurons supplied by the search component in order to soundly reason about the remaining neurons. To do this, the conditioned neurons supplied by the search component are incorporated into the feasibility program (2) as equality constraints in the following way: Inferences created by the symbolic interval inference block using Symbolic Interval Analysis [32] are also incorporated using equality constraints like (3) and (4).
Of the remaining blocks, the "Backtracking & Reconditioning" block is essentially described above. The "Condition New Neuron" and "Sampling Inference" blocks have features unique to PEREGRiNN that are described in Sect. 4; the former implements a novel neuron prioritization, and the latter is a unique approach to quickly obtaining initial safety counterexamples.

Sum-of-Slacks Penalty
The core enhancement in PEREGRiNN is the inclusion of a specific objective function in the convex program used by the inference component. As per the discussion above, this objective function is interpreted as a penalty on how far away a particular solution is from a valid input/output response of the network (and activation pattern on all hidden neurons). Specifically, this penalty function penalizes the sum of all of the "slack" variables for the entire network, where each neuron's slack variable is defined as s i y i −(W i ·y i−1 +b i ). That is the distance between a relaxation variable y i and the linear response of its associated neuron. During each feasibility/inference call, this has the obvious effect of incentivizing the convex solver to choose an actual input/output response of the network.
In addition, this penalty is effectively the L 1 -norm of the vector of all the slack variables, since the slack variables are non-negative. The L 1 -norm of a vector, used as a penalty function, is well known to effectively encourage sparsity on the resulting optimal solution. Thus, the sum-of-slacks effectively incentivizes the convex solver to leave as few neurons as possible indeterminate in the solution. That is a sum-of-slacks penalty effectively encourages the convex solver to fix the phases of as many neurons as possible.

Max-Slack Conditioning Priority
As noted above, the search component of PEREGRiNN operates layer-wise from input layer to output layer in order to leverage Symbolic Interval Analysis for additional inference. Hence, the search component always chooses the next neuron to be searched (i.e. conditioned) from among those as-yet-unconditioned neurons that are closest to the input layer. It further makes sense to only consider conditioning neurons that the convex solver was unable to operate at valid inputs/output. However, the convex solver typically returns several neurons to choose from with this property, and it is necessary to choose which of them to search next. Given the interpretation of a neuron's "slack" variable as a measure of how "problematic" that neuron was for the solver to obtain a valid evaluation of the network, PEREGRiNN's search component chooses the next neuron to condition based on slack-order ranking of those neurons that are not being operated at valid input/output points. This "max-slack" heuristic choice is unique to PEREGRiNN; compare to the output gradient heuristic employed in [31].

Layer-wise-Weighted Penalty
PEREGRiNN takes the "max-slack" neuron search priority one step further, though. Using techniques similar to those in [26], it is possible to show that there exists weights q 1 , . . . , q n such that solving (2) with the penalty min y0,..,yn n i=0 ki j=1 q i s ij (5) will result in a solution that is guaranteed to concentrate the most total slack in the earliest (unconditioned) layer. Thus, by using the layer-wise weighted sum-ofslacks penalty in (5), PEREGRiNN is uniquely able to force the (unconditioned) layer closest to the input layer to have the largest total slack among all the layers. As a consequence, PEREGRiNN effectively concentrates the most "problematic" neurons in the layer where the next conditioning choice will be made. This scheme makes it much more likely that the neuron with the highest slack among all of the neurons will be among the next neurons considered for conditioning -in effect, often guiding the search component to condition on the most problematic neuron in the whole network (although this is not guaranteed).
As noted above, SMC [26] can be used to obtain layer-wise weights that guarantee concentration of slack in the earliest (shallowest) layer. However, these weights are often very large, since they depend on bounding the slack variables (most readily by over-approximation); the effect of this is possible computational instability in the convex program. Thus, as an implementation matter, we instead select these weights using a heuristic scheme characterized by two real-valued hyperparameters, λ 0 and γ. In particular, the weight of the i th layer, q i , is selected as q i = λ 0 · γ i . In our experiments, we found the values λ 0 = 10 −7 and γ = 10 3 to effectively achieve the maximum slack concentration in the earliest layers.

Initial Counterexample Search by Sampling
Finally, PEREGRiNN extends a simple idea first introduced in [32] to rapidly identify counterexamples by means of sampling. The basic idea is to sample within a known region of the input to the NN (or the input to some deeper layer), and evaluate the NN (sub-NN) exactly on those samples in order to rapidly identify a counterexample; this approach help identify un-safe networks/properties early on. However, whereas [32] samples from within hyper-rectangle sets derived by symbolic interval analysis, PEREGRiNN uses the Volesti [12] Python library to uniformly sample points within the polytopic input constraint set, P y0 , and thus applies to be more general input constraint sets in (1).

Experiments
We evaluated the performance and effectiveness of PEREGRiNN at verifying the adversarial robustness of NNs trained to recognize digits using the standard MNIST dataset. This verification problem fits into the general NN verification problem described in Sect. 2, and it is described subsequently in detail. In this context, we evaluated PEREGRiNN with two objectives described as follows. . Each instance of of any verifier was run within its own single-core Virtual Box VM with 30 GB of memory; no more than 4 VMs were run concurrently on a host machine with 48 hyperthreaded cores and 256 GB of memory.

Adversarial Robustness Verification Task
Subsequent experiments used the testbench we describe in this section; it is largely identical to the PAT-FCN test in the VNN-COMP 2020 competition [2].
Neural Networks. We used three ReLU NNs to recognize digits using the standard MNIST training database; these NNs are exactly as in the PAT-FCN portion of [2]. The sizes of these fully-connected networks are described in Table 1. Each entry in the "Architecture" column of Table 1 is the number of number of neurons in a layer, from input layer on the left to output layer on the right.

Verification Properties.
We created a number of NN verification tasks based on proving whether the above described networks were robust against max-norm perturbations of their inputs. In particular, each verification task involves proving whether a particular input image, x , always results in the same classification when it is subjected to a max-norm perturbation of at most some fixed size, > 0. Thus, each such verification problem is parameterized by both the specified input image, x , and the maximum amount of perturbation, .
Formally, let x be a given image in category t ∈ {1, . . . , M}, and let > 0 be a specified maximum amount of max-norm perturbation of x . Then we say that a NN with M classification outputs, N N , is robust if for each classification category m ∈ {1, . . . , M} \ {t} the set of inputs yielding classification of x as m is empty. Note that each instance of (6) is compatible with the problem in (1).
Adversarial Robustness Verifier Testbench. Our verification testbench was then constructed by selecting 50 test images from the MNIST test dataset; this set of test images includes the 25 used in the PAT-FCN portion of [2]. Each test instance was then a combination of one of those images, one of the networks from Table 1

Ablation Experiments
In this series of experiments we evaluated the contribution that each of the primary PEREGRiNN enhancements made to its overall performance. This was done by comparing the full PEREGRiNN algorithm -as described in Sect. 4with altered versions that replace exactly one of those enhancements at a time. Note: removing core features of PEREGRiNN often resulted in much longer run times, so the experiments in this section use a testbench T B ⊂ T B that excludes all tests with one of the larger networks FC2 or FC3 and = 0.05.

Penalty Function Ablation.
Our first ablation experiment evaluated the contribution of PEREGRiNN's unique penalty function features; see Sect. 4.1 and Sect. 4.3. In particular, we ran different variants of PEREGRiNN with the following penalty functions used inside the convex program inference block: 1. "Weighted sum of slacks": PEREGRiNN's own weighted sum of slacks penalty; 2. "Sum of slacks": A sum-of-slacks penalty with equal weighting on all layers; 3. "Feasibility": A feasibility-only convex program such as the one used in other tools, e.g. [31] (i.e. simply using a constant penalty function of 1); 4. "Inverted weighted sum of slacks": PEREGRiNN's own weighted sum of slacks penalty, except with the layer-wise weights applied in reverse order to force slack towards deeper layers rather than shallower ones (see also Sect. 4.3). Figure 2a shows a cactus plot of the number of proved cases vs. the timeout permitted to the algorithm: i.e. to prove at least a specified number of the test cases, each algorithm must have its timeout set at to the value of its curve in Fig. 2a. Figure 2b shows a histogram of the number of times each of the algorithm variants needed to call the convex solver in order to terminate; this quantifies each algorithm's cost in a well-known unit of computation, also the single most computationally costly part of PEREGRiNN. Figure 2b plots the number of convex solver calls required for evenly spaced bins of convex solver calls. Conclusions: Figure 2a demonstrates that PEREGRiNN's weighted sum of slacks has a clear benefit over both a uniformly weighted sum-of-slacks penalty and a plain feasibility convex program. For timeouts of longer than ≈ 1.2 seconds, PEREGRiNN overtakes the other two in terms of number of properties proved; even the uniform sum-of-slacks penalty considerably outperforms the feasibility convex program at similar timeouts. Note that reversing the layer-wise weights of PEREGRiNN's penalty function incurs a performance hit, especially for timeouts > 1.2 s. This suggests that driving slacks toward shallower layers, where the next neuron is conditioned, is the correct heuristic to apply. Figure 2b also shows that going from feasibility to sum-of-slacks to weighted sum-of-slacks significantly reduces the number of test cases that require between 425 and 525 calls to the convex solver. The performance of these algorithm variants is shown in Fig. 3a and Fig. 3b. As in the previous ablation experiment, Fig. 3a shows a cactus plot of the number of proved cases vs. the timeout, and Fig. 3b shows a histogram of the number of calls to the convex solver required under each of the conditioning priorities. Figure 3a shows that PEREGRiNN's max-slack neuron priority allows it to prove slightly more properties than either a random neuron choice priority or the minimum-slack priority. The maximum slack priority also required the fewest total convex calls across all instances: it used 178 fewer than minimum slack and 686 fewer than a random choice. Thus, we conclude PEREGRiNN's max-slack heuristic slightly improves performance on this testbench.

Comparison with Other NN Verifiers
In this experiment, we evaluated PEREGRiNN with respect to a number of state-of-the-art NN verifiers on our adversarial robustness testbench, T B. In particular, we ran the following tools on T B: Venus [6]; Marabou [19]; Neurify [31]; and nnenum [4]. Venus was run with st ratio=0.4, depth power=4, offline deps = True, online deps = True, and ideal cuts = True; Marabou and Neurify were used with default parameters but THREADS = 1; and nnenum had ADVERSARIAL SEARCH turned off. Each algorithm had its own one-core VM.  Figure 4 contains a cactus plot showing the results for each of these algorithms, including PEREGRiNN. For a given number of test cases to be proved, Fig. 4 depicts the corresponding timeout required for each of the algorithm to prove that many cases. Of all the algorithms, PEREGRiNN was able to prove the most properties within the timeout limit of 600 s: PEREGRiNN was able to prove 190 properties; it was followed by nnenum, which proved 172; Venus, which proved 159; Neurify, which proved 149; and Marabou, which proved 125. Marabou consistently performed the worst, proving fewer cases than any other algorithm at every timeout. By contrast, Neurify was able to prove significantly more test cases than any other algorithm for extremely short timeouts, but it failed to prove more than 150 out of 300 test cases across the whole experiment. nnenum performed worse than Neurify on the way to proving 150 test cases, but it fared significantly better than either PEREGRiNN or Venus, which had more or less similar performance below this threshold. However, after ≈150 test cases, PEREGRiNN significantly outperformed all other algorithms: as the timeout was increased, PEREGRiNN proved additional properties at a rate significantly outpacing its closest competitor in this regime, nnenum. We further note that all algorithms proved a mixture of SAT and UNSAT properties.
This data, taken as a whole, suggests that PEREGRiNN suffers from a worse "best-case" performance than several other algorithms, especially nnenum and Neurify. However, PEREGRiNN's performance seems to be much more consistent across different test cases. This allows it to prove more properties in aggregate at the expense of being slower on a smaller subset of them. This further suggests that PEREGRiNN is significantly less sensitive to peculiarities of particular test cases on the T B testbench. This will likely be a considerable advantage, on average, when faced with verifying unknown networks and properties of this type.

Discussion: Analogy to SAT Solvers
It is possible to draw a loose analogy between SAT solvers and search-andoptimization NN verifiers such as PEREGRiNN. Indeed, since each neuron has two phases, the operational phase of each neuron can be captured by a binary variable; then any valuation of all these variables can be interpreted as SAT or UNSAT based on the Input/Output properties to be verified on the network (subject to that conditioning). Thus, the neuron conditioning step in PERE-GRiNN is analogous to variable splitting in a SAT solver, and the backtrack and re-condition block (see Fig. 1) functions analogously to backtracking. In this analogy, infeasibility of the convex program and symbolic interval analysis function roughly like unit resolution in a SAT solver: they soundly reason about the overall property before all neurons have been conditioned (i.e. variables split). However, the main contribution of PEREGRiNN is a heuristic for deciding which neuron to condition next: it is thus analogous to a heuristic for choosing the next variable to split in a SAT solver. Specifically, PEREGRiNN's heuristic provides a numerical ranking of the as-yet-unconditioned neurons, and therefore has a functional similarity to variable-ranking heuristics in SAT solvers (e.g. VSIDS [24]). On the other hand, PEREGRiNN's neuron ranking comes directly from the output of the convex solver, which we argued reveals some information about the underlying verification problem -this has no direct SAT-solver analog.

Conclusion
In this paper, we introduced PEREGRiNN, a new tool for formally verifying input/output properties for ReLU NNs. PEREGRiNN compares favorably with other state-of-the-art NN verifiers, thanks to a number of unique algorithmic features. The benefits of these features were established with ablation experiments.