1 Introduction

During the last few years, various approaches have been presented that aim to provide formal guarantees on the behaviour of neural networks. The use of such verification methods may be crucial to enable the secure and certified application of neural networks for safety-critical tasks. Moreover, based on first results of [32], awareness was raised that neural networks are prone to fail on so called adversarial examples. These are created by small perturbations of input samples, such that the changes are (almost) imperceptible to humans. However, these perturbations are often sufficient to make a neural network fail on the input sample. The existence of such adversarial examples can be ruled out by methods of neural network verification. In fact, a closely related line of research termed as robustness certification is focused explicitly on this topic.

In the following section we formally introduce the problem that we regard. In Sect. 3 we provide an overview of related work, and present formulations of the verification problem as MIP in Sect. 4. In the subsequent sections we consider approximation techniques, primal heuristics, and branching methods for verification of neural networks. Extensive computational results on the performance of our solver and others can be found in Sect. 8, and Sect. 9 concludes the paper with some final remarks. Additional material can be found in the appendices. Our solver, which is based on the academic MIP solver SCIP [13], is publicly available at https://github.com/roessig/verify-nn.

For the ease of notation, we use [n] for \(n \in {\mathbb {N}}\) to denote the set \(\{1, \ldots , n\}\). In our work, we only regard trained neural networks, which can be seen as immutable and deterministic functions \(F: {\mathbb {R}}^n \rightarrow {\mathbb {R}}^m\). F is determined by its weights and biases \(((A_l, b_l))_{l=1}^L\). It holds \(A_l \in {\mathbb {R}}^{N_l \times N_{l-1}}\) for \(l \in [L]\) and \(b_l \in {\mathbb {R}}^{N_l}\), \(l \in [L]\), where L is the number of layers in the neural network. \(N_0, \ldots , N_L\) are the numbers of neurons per layer (cf. Bölcskei et al. [32], Definition 1).

2 Problem definition

Now we give a formal definition of the verification problem for ReLU neural networks and comment on some relevant properties of this problem. In the following we use the term (solving) model to refer to an algorithmic approach for solving the verification problem. This may encompass a range of choices in obtaining an actual algorithm.

Definition 1

(Verification Problem for ReLU Neural Networks) Assume that \(\emptyset \ne X \subset {\mathbb {R}}^n\) is a polytope, and let \(\emptyset \ne Y \subset {\mathbb {R}}^m\) be such that \(Y = \bigcap _{i=1}^k Q_i\) or \(Y = \bigcup _{i=1}^k Q_i\) where \(k \in {\mathbb {N}}\) and \(Q_i \subseteq {\mathbb {R}}^m\) is a halfspace for \(i \in [k]\). Given a neural network \(F: X \rightarrow {\mathbb {R}}^m\) with ReLU activation function, the verification problem consists in the decision whether \(F(X) \subseteq Y\) holds. A triple (XYF) will be called an instance of the verification problem (for ReLU neural networks). Furthermore, if \(F(X) \subseteq Y\), we say that the instance is verifiable, otherwise it is refutable.

The construction of the feasible input polytope X and the set of admissible outputs Y is solely based on the application for which the neural network shall be used. Depending on the algorithm which is used to solve the problem, the halfspaces \(Q_i\) can either be open or closed. Though, either all of them must be closed or all of them must be open. However, the use of floating point arithmetic by a solver for the verification problem makes this distinction rather unimportant, since numerical comparisons require the use of a certain threshold difference. Moreover, Katz et al. [32] show that the verification problem for ReLU neural networks is NP-complete. Hence we cannot expect that the problem can be solved efficiently in general. We also follow the naming concept of Katz et al. [32] and refer to verifiable instances of the verification problem as UNSAT instances, and to refutable instances as SAT instances. This naming corresponds to the existence of a counterexample as defined in the following remark.

Remark 1

If an instance (XYF) is refutable, i.e. \(F(X) \not \subseteq Y\), we want to provide \(x \in X\) such that \(F(x) \notin Y\). We will refer to this \(x \in X\) as a counterexample for the instance.

Remark 2

More complex properties can be investigated by spliting them into separate instances. For example, if \(Y = \big ( \bigcup _{i=1}^k Q_i \big ) \cap \big ( \bigcup _{j=1}^l P_j \big )\) for halfspaces \(Q_i\) and \(P_j\) and \(k, l \in {\mathbb {N}}\), then \(F(X) \subseteq Y\) holds if and only if \(F(X) \subseteq (\bigcup _i^k Q_i)\) and \(F(X) \subseteq (\bigcup _i^l P_i)\).

Remark 3

Considering an instance \({\varPi }= (X, Y, F)\) of the verification problem with \(X \subset {\mathbb {R}}^n\), we will often assume the existence of bounds \(l_i\), \(u_i\) for \(i \in [n]\) such that \(l_i \le x_i \le u_i\) for \(x \in X\). Indeed, the requirements of Definition 1 justify this assumption. These bounds can be computed by solving one LP per bound. We set

$$\begin{aligned} l_i := \min _{x \in X} x_i \qquad \text {and} \qquad u_i := \max _{x \in X} x_i \qquad \text {for} \ i \in [n]. \end{aligned}$$

In fact, for all publicly available instances of the verification problem that we are aware of, the polytope X is actually a box which is directly given by the bounds \(l_i\), \(u_i\) for \(i \in [n]\). For these instances it is thus not necessary to solve any LP in order to obtain the bounds. However, in this paper we also consider instances where X is not a box, cf. Sect. 8 and “Appendix A”.

Remark 4

Assume that we are given an instance \({\varPi }= (X, Y, F)\) of the verification problem as introduced in Definition 1. Some solving models of other authors are not only limited to instances where the input polytope X is in fact a box. Also the choice of output constraints as represented by Y is more restricted for these models. These require that \(Y = \bigcup _{i=1}^k Q_i \subseteq {\mathbb {R}}^m\), where \(k \in {\mathbb {N}}\) and \(Q_i \subseteq {\mathbb {R}}^m\) is an open halfspace for \(i \in [k]\). Indeed, this is the only of the cases which are regarded in Definition 1 where \({\mathbb {R}}^m \backslash Y\) is a polyhedron. Yet, it is possible to use such restricted solving models to solve an instance \({\varPi }= (X, Y, F)\) where \(Y = \bigcap _{i=1}^k Q_i \subseteq {\mathbb {R}}^m\) for open halfpaces \(Q_i \subseteq {\mathbb {R}}^m\). To this end, it is necessary to split the corresponding instance into k instances \((X, Q_i, F)\). Clearly, if \(F(X) \subseteq Q_i\) for all \(i \in [k]\), then it holds \(F(X) \subseteq Y\) and \({\varPi }\) is verifiable. On the other hand, if there is \(x \in X\) and some \(i \in [k]\) such that \(F(x) \notin Q_i\), we know that \({\varPi }\) is refutable since \(F(X) \not \subseteq Y\). We will refer to such an instance \({\varPi }\) as conjunction instance. On the other hand, an instance \({\varPi }= (X, Y, F)\) where \(Y = \bigcup _{i=1}^k Q_i \subseteq {\mathbb {R}}^m\) for open halfspaces \(Q_i \subseteq {\mathbb {R}}^m\), will be called disjunction instance. We will also regard those instances as disjunction instances that fulfill \(Y = Q\) for some open halfspace \(Q \subseteq {\mathbb {R}}^m\). In fact, all instances that we consider in our computational experiments (see Sect. 8) are based on open halfspaces. Closed halfspaces are only mentioned in some cases to provide a comprehensive explanation.

Often, we will regard constraints of the form \(y = \,\mathrm {ReLU}(x) := \max (\{x, 0\})\) for \(x \in [l, u]\), \(y \in {\mathbb {R}}\), that refer to a certain neuron with ReLU activation function. If the bounds \(l, u \in {\mathbb {R}}\) with \(l \le u\) are such that either \(l \ge 0 \) or \(u \le 0\), we say that the corresponding neuron is fixed (in its phase).

3 Related work

The key properties of the problem as given in Definition 1 are considered likewise in the literature for neural network verification [4,5,6, 9, 10, 17, 19, 21, 22, 33, 35, 36, 40, 41]. In view of an instance \({\varPi }= (X, Y, F)\) these can be summarized as follows.

A box, a polytope or a union of polytopes is defined as the feasible input domain X for the property which shall be verified. Then, linear properties are defined that we denote in terms of a set Y, such that \({\varPi }\) is verifiable if and only if \(F(X) \subseteq Y\). Complete algorithms (except in [9]) are employed to solve this problem, i.e. if there exists \({\tilde{x}} \in X\) such that \(F({\tilde{x}}) \notin Y\), this will be reported. Clearly, the verification problem is not necessarily limited to neural networks with ReLU activations, i.e. other activation functions are sometimes considered, too. Cheng et al. [32] and Narodytska et al. [32] regard the verification problem on binarized neural networks, which we do not investigate further.

First approaches to verification of neural networks [21, 22, 26] belong to the field of satisfiability modulo theories (SMT), which generalize the boolean satisfiability problem by replacing variables with predicates from various theories. Also the solver Reluplex [17] for verification of neural networks is presented in this context, but solves instances which are significantly more difficult, using an extended version of the well known simplex algorithm. Ehlers [32] presents the solver Planet, which is based on LP and SAT solving. Dvijotham et al. [32] formulate the verification problem as a non-convex optimization problem and obtain an incomplete algorithm. Xiang et al. [32] regard the propagation of an input polytope through a ReLU neural network, and Xiang et al. [32] propose to discretize the input domain in order to verify neural networks. However, their work remains limited to theoretical considerations and the presentation of numerical toy examples.

Various authors [6, 11, 19, 33] consider MIP models for the verification problem. The performance of such MIP models is predominantly determined by the quality of the bounds which are computed for the ReLU neurons in the neural network. For that reason, the computation of such bounds is discussed in more detail in Sect. 5. The use of appropriate branching schemes is also important for an MIP model of the verification problem, we will provide more details on this in Sect. 7. In fact, it is not necessary to solve the verification problem as an MIP if such approximation and branching methods are used. Bunel et al. [32] present such a branch-and-bound method without solving the verification problem as an MIP directly. Moreover, they provide a good comparison of various methods for neural network verification. Besides their own approach, the empirical evaluation includes Reluplex [17], Planet [10], and an MIP model based on the suggestions of various authors [6, 19, 33]. While we also implement an MIP model to solve the verification problem, its functioning is more similar to the branch-and-bound method of Bunel et al. [32] than to the MIP model they use in their comparison. Besides, we consider various additional aspects, and therefore speed up the solving process significantly. For a computational comparison of other solvers with ours, we select Reluplex [17] and the branch-and-bound method [5]. The other solvers regarded by Bunel et al. [32] are not competitive with these, as their experimental results show. Moreover, we regard the solvers ReluVal and Neurify as introduced by Wang et al. [35, 36]. The concept for both solvers is also a branch-and-bound scheme, that works with a frequent linear approximation of the regarded neural network. In contrast to the method of Bunel et al. [32], the approximation is not as good, but much faster to compute.

Anderson et al. [32] present an ideal MIP formulation for ReLU constraints which is closely related with the techniques used in our work. Especially, they present a separation routine which can be used to compute stronger neuron bounds. Optionally, we include this separator in our solving model as mentioned in Sect. 8.2. Nevertheless, it should be noted that the results of Anderson et al. [32] do refer only to single ReLU neurons and at most the layer before. Hence we do not have an ideal formulation of the whole network which implies that solving the verification problem cannot be reduced to solving an LP using their formulations.

The idea of output range or reachability analysis is in principle to compute the output range F(X) of a neural network F, given an input domain X. Since this is quite difficult, the relevant work of Dutta et al. [32] and Ruan et al. [32] is limited to computing the range g(F(X)), for some function \(g: F(X) \rightarrow {\mathbb {R}}\). The function g should then give some insights into the output of the neural network F on input domain X. Clearly, this problem is closely related to the verification problem.

Several authors [12, 23, 29,30,31, 34, 37,38,39, 42, 43] consider the problem of computing robustness guarantees for neural networks which are used for classification. Robustness means, that the classification of an input sample should remain the same when the input is changed by small perturbations. The computation of certified robustness bounds should rule out the existence of adversarial examples. Indeed, this problem is a special case of neural network verification. Except for Tjeng et al. [32], this problem is solved by incomplete algorithms. That means, an algorithm either returns a guarantee that a region around an input sample is free of adversarial examples, or no result, which is due to the use of approximations.

Modelling ReLU neural networks as MIPs is considered in the literature for other application domains, too. Grimstad and Andersson [32] investigate the usage of ReLU neural networks as surrogate models in MIPs and study various bound tightening techniques. Serra et al. [32] apply a MIP formulation of a ReLU neural network to enable a lossless pruning method. This way, equivalent neural networks of smaller size can be obtained. The computation of linear regions in ReLU neural networks is another field of application [27].

4 Neural network verification as MIP

It is straightforward to formulate the verification problem as a mixed integer program (MIP), see [6, 8]. We present a slightly improved formulation, as it can be found in [5, 34]. In this formulation, each neuron is represented by one or two (continuous) variables. The value of a neuron before application of the ReLU function is given as a linear combination of the output values of the predecessor neurons in the network plus the bias. That means, this connection can be simply modelled by a linear equation in the MIP. We need two variables for neurons with ReLU activation function. Let variable x be the input value to the ReLU function and y be the output value. In this setting we will refer to x as the ReLU input variable and to y as ReLU output variable. We want to model \(y = \max \{0, x\}\), which is represented using one additional binary variable d. Furthermore, we need that upper and lower bounds \( l \le x \le u \) are known. Then we obtain the following constraints which are equivalent to \(y = \max \{0, x\}\):

$$\begin{aligned} y&\ge x, \quad y \ge 0 \nonumber \\ y&\le x - (1-d) l, \quad y \le d \cdot u \nonumber \\ d&\in \{0,1\} \nonumber \\ x&\in [l, u], \ l< 0 < u \end{aligned}$$
(1)

Of course it is possible that we have \(l \ge 0 \) or \(u \le 0\) for the bounds. In these cases, we can omit the binary variable d and replace (1) as follows. If \(l \ge 0\), this implies \(y = \max \{0, x\} = x\), i.e. (1) is replaced by \(y = x\) for \(x \in [l, u]\). If \(u \le 0 \), we have \(y = \max \{0, x\} = 0\) and thus we can set \(y = 0\) for \(x \in [l, u]\). These cases correspond to fixing the binary variable d to 1 or 0, respectively.

Let \( {\varPi }= (X, Y, F)\) be a disjunction instance of the verification problem such that it holds \(Y = \bigcup _{i=1}^k Q_i \subseteq {\mathbb {R}}^m\) for certain open halfspaces \(Q_i\), \(i \in [k]\). Then it is straightforward to formulate an MIP which is feasible if and only if \({\varPi }\) is refutable. The instance \({\varPi }\) of the verification problem is represented by the following constraints:

$$\begin{aligned} x \in X, \ y \in {\mathbb {R}}^m \backslash Y \text { and } y = F(x) \end{aligned}$$
(2)

This is an MIP, since \(x \in X\) and \(y \in {\mathbb {R}}^m \backslash \ Y\) can be represented by linear constraints. Especially, \(y = F(x)\) can be expressed by linear constraints combined with integrality constraints for auxiliary binary variables that are used to model the ReLU function as shown in (1). Now, if the MIP (2) is feasible, there exists \(x \in X\) such that \(F(x) = y \notin Y\). This implies \(F(X) \not \subseteq Y\) and hence \({\varPi }\) is refutable. Otherwise, if MIP (2) is not feasible, that means that for all \(x \in X\) it holds \(F(x) = y \in Y\) and thus \({\varPi }\) is verifiable.

For conjunction instances where \(Y = \bigcap _{i=1}^k Q_i \subseteq {\mathbb {R}}^m\) for open halfspaces \(Q_i\), \(i \in [k]\), we consider two options. Either we split instance \({\varPi }\) into k instances as mentioned in Remark 4 or we formulate the verification problem as an optimization problem as proposed by Bunel et al. [32]. In this setting, an instance \( {\varPi }= (X, Y, F)\) is verifiable if the optimum value of the corresponding optimization problem is greater than zero and refutable if it is lower than zero.

Assume that \(Y = \bigcup _{i=1}^k Q_i \subseteq {\mathbb {R}}^m\) where \(Q_i\), \(i \in [k]\) are open halfspaces. This implies the existence of \(q_i \in {\mathbb {R}}^m\) and \(b_i \in {\mathbb {R}}\) for \(i \in [k]\) such that we have halfspaces \(Q_i = \{ x \in {\mathbb {R}}^m \ | \ q_i^Tx > b_i \}\). Then we see that

$$\begin{aligned}&y \in \bigcup _{i=1}^k Q_i\\ \Leftrightarrow \quad&\exists j \in [k]: \ y \in Q_j = \{ x \in {\mathbb {R}}^m \ | \ q_j^Tx> b_j \} \\ \Leftrightarrow \quad&\exists j \in [k]: \ q_j^Ty - b_j> 0 \\ \Leftrightarrow \quad&\max _{i \in [k]} \ \big ( q_i^Ty - b_i \big ) > 0. \end{aligned}$$

The same holds for closed halfspaces \(Q_i\), \(i \in [k]\), if all inequalities “>” are replaced by their counterparts “\(\ge \)”. Analogously, with open halfspaces \(Q_i\) as before and \(Y = \bigcap _{i=1}^k Q_i\) the same can be shown with “min” instead of “max”. For the case \(Y = \bigcup _{i=1}^k Q_i \) we consider the following MIP:

$$\begin{aligned} \text {minimize} \quad t \nonumber \\ \text {s.t.} \quad x&\in X \nonumber \\ y&= F(x) \nonumber \\ z_i&= q_i^Ty - b_i \qquad&\forall i \in [k] \nonumber \\ t&= \max \{z_1, \ldots , z_k \} \end{aligned}$$
(3)

Indeed, (3) is an MIP since the constraint \(t = \max \{z_1, \ldots , z_k \}\) can be replaced by linear constraints using k additional binary variables as shown in [5, 34]. In this case, we can also replace the constraint by \(t \ge z_1, \ldots , t \ge z_k \).

Theorem 1

Instance \({\varPi }= (X, Y, F)\), where \(Y = \bigcup _{i=1}^k Q_i \) for some open halfspaces \(Q_i = \{ x \in {\mathbb {R}}^m \ | \ q_i^Tx > b_i \}\), \(q_i \in {\mathbb {R}}^m\) and \(b_i \in {\mathbb {R}}\) for \(i \in [k]\), is verifiable if and only if the optimum value of (3) is greater than zero.

Proof

Assume that \({\varPi }\) is verifiable, i.e. \(F(X) \subseteq Y = \bigcup _{i=1}^k Q_i \). Hence, for any \(x \in X\) there exists \(j \in [k]\) such that \(y := f(x) \in Q_j\), i.e. \(q_j^T y - b_j > 0\). It follows \(t \ge z_j := q_j^T y - b_j > 0\) which implies the desired result since \(x \in X\) was arbitrary. Remind that we regard optimum solutions of an MIP so it suffices to consider finitely many \(x \in X\).

For the opposite direction, assume that the optimum value \({\hat{t}}\) of (3) fulfills \({\hat{t}} > 0\). Let \(x \in X\) be arbitrary and \(y = F(x)\). With \(z_i = q_i^T y - b_i\) for \(i \in [k]\) it holds \(\max \{ z_1, \ldots , z_k \} \ge {\hat{t}} > 0\) since \({\hat{t}}\) is optimal. In other words, there is \(j \in [k]\) such that \(q_j^T y - b_j = z_j > 0\) and thus \(y \in Q_j \subseteq Y\). Since \(x \in X\) was arbitrary, \({\varPi }\) is verifiable. \(\square \)

It works also for the case \(Y = \bigcap _{i=1}^k Q_i \) by replacing “\(\max \)” with “\(\min \)” in (3), and similarly for closed halfspaces. In practice, the optimum value \({\hat{t}}\) of (3) will usually be significantly greater than zero if an instance is indeed verifiable. Clearly, it is not necessary to actually compute \({\hat{t}}\) in order to solve the verification problem as Bunel et al. [32] point out. If the dual bound of (3) is greater than zero, the instance is verifiable. We mainly use this formulation in our implementation. On the other hand, if the primal bound of (3) is lower than zero, we know that the corresponding instance of the verification problem is refutable as \({\hat{t}} < 0 \) is already implied. However, this case has less relevance since primal solutions are usually only found by specialized heuristics which we describe in Sect. 6.

Besides, we note that the verification problem can be modelled as quadratic program. This formulation does not require any integer or binary variables as the nonlinear behavior of the ReLU activations is modelled by the quadratic objective function and an optimality condition. Let \({\varPi }= (X, Y, F)\) be a disjunction instance, i.e. \(Y = \bigcup _{i=1}^k Q_i\) for open halfspaces \(Q_i\), \(i \in [k]\) and \(k \in {\mathbb {N}}\). Let \(((A_l, b_l))_{l=1}^L\) be the weights and biases corresponding to F. Here, L is the number of layers in the neural network and \(N_0, \ldots , N_L\) are the numbers of neurons per layer. This implies \(X \subseteq {\mathbb {R}}^{N_0}\) and \(Y \subseteq {\mathbb {R}}^{N_L}\) and we can state the formulation:

$$\begin{aligned} \text {minimize} \quad&\sum _{l=1}^{L-1} x_l^T(x_l - A_l x_{l-1} - b_l) \nonumber \\ x_l&\ge A_l x_{l-1} + b_l, \quad x_l \ge 0, \quad x_l \in {\mathbb {R}}^{N_l}&\forall l \in [L-1] \nonumber \\ x_L&= A_L x_{L-1} + b_L \nonumber \\ x_0&\in X, \quad x_L \in {\mathbb {R}}^{N_L} \backslash \ Y \end{aligned}$$
(4)

Theorem 2

Instance \({\varPi }\) is refutable if and only if the quadratic program (4) is feasible and the optimum value is zero. Otherwise \({\varPi }\) is verifiable.

Proof

We first assume that \({\varPi }\) is refutable so that we can find \(x \in X\) with \(F(x) \notin Y\). We set \(x_0 := x\), \(x_L := F(x) \in {\mathbb {R}}^{N_L} \backslash \ Y\) and for \(l \in [L-1]\) we let \(x_l := \,\mathrm {ReLU}(A_l x_{l-1} + b_l)\) which implies \(x_l \ge 0\) and \( x_l \ge A_l x_{l-1} + b_l\). Furthermore it is \(x_L = A_L x_{L-1} + b_L\) and for each \(l \in [l-1]\) we have for each \(i \in [N_l]\) that either \([x_l]_i = 0\) or \([x_l]_i = [A_l x_{l-1} + b_l]_i\). Since \(x_l \in {\mathbb {R}}^{N_l}\), this leads to the conclusion that \( x_l^T(x_l - A_l x_{l-1} - b_l) = 0\) for all \(l \in [L-1]\). Hence, the quadratic program (4) is feasible and its optimum value is zero.

On the other hand, if (4) is feasible and the optimum value is zero, we know that there is \(x_0 \in X\), such that \(F(x) = x_L \notin Y\) which means that \({\varPi }\) is refutable. Indeed, it holds \(F(x) = x_L\) since for all \(l \in [L-1]\) we have \(x_l \ge 0 \) and \( x_l \ge A_l x_{l-1} + b_l \), i.e. \( x_l^T(x_l - A_l x_{l-1} - b_l) \ge 0 \) for all \(l \in [L-1]\). Hence we know \( x_l^T(x_l - A_l x_{l-1} - b_l) = 0 \) for all \(l \in [L-1]\) as the objective value of (4) is zero, and it follows that \([x_l]_i [(x_l - A_l x_{l-1} - b_l)]_i = 0 \) for all \(i \in [N_l]\) and \(l \in [L-1]\). Subsequently it holds \([x_l]_i = \,\mathrm {ReLU}(A_l x_{l-1} - b_l)\) and thus we can conclude that \(F(x) = x_L\). \(\square \)

To evaluate this formulation computationally, we tried a plain implementation in SCIP [13]. Within a time limit of one hour, SCIP is not able to solve any of the disjunction instances in our SAT and UNSAT evaluation sets. Only very easy MNIST instances could be solved with this formulation.

Anderson et al. [32] present an ideal MIP formulation for ReLU constraints which can replace (1). It should be noted that the formulation is ideal for a single ReLU neuron but not for the whole neural network. As the formulation of Anderson et al. [32] has an exponential number of constraints, they also describe a separation routine that runs in linear time. This allows to strengthen formulation (1) by adding additional cuts to the LP relaxation, as obtained by the separation routine.

Fig. 1
figure 1

Naive approximation of ReLU function in one dimension. Here we have lower bound \(-4\) and upper bound 6 for the ReLU input variable x. The feasible domain of the ReLU output variable y is given by the solid black line for the actual ReLU function and by the shaded area for the naive approximation

5 Approximations of ReLU neural networks

Solving the problem of neural network verification requires to model constraints of the form \(y = \max \{0, x\}\) for all ReLU input variables x and corresponding ReLU output variables y of each layer. It is crucial to obtain tight bounds \(l, \ u\) on the value of x before the application of the ReLU function. Especially, we regard the linear approximation of these constraints for a whole layer at once, an idea so far considered only briefly in [2, 5, 23].

Given an instance \({\varPi }= (X, Y, F)\) of the verification problem with \(X \subset {\mathbb {R}}^n\), we will assume the availability of input bounds \(l_i\), \(u_i\) with \(i \in [n]\) for the components of X throughout this section (cf. Remark 3). All approximation methods that we present are executed layer by layer. Based on the input bounds, we compute bounds for the neurons in the following layer. This process is iterated until the last layer is reached, i.e. the output layer. Depending on the instance and the bound computation approach, it may be possible to prove that \({\varPi }\) is verifiable using only these bounds for the output layer. Assume that we have a set A which approximates the neural network output F(X), i.e. \( F(X) \subseteq A\). In case that \(A \subseteq Y\), we have thus shown that \(F(X) \subseteq Y\), which means that \({\varPi }\) is verifiable.

5.1 Basic approximation methods for bound computations in neural networks

The most simple approximation approach is naive interval arithmetic as used in [10, 35]. Figure 1 provides a visual representation of this approximation, to which we will refer as naive approximation. This simple approach mainly suffers from the fact that it assumes the independency of all predecessor neurons when computing a new bound. Therefore, the bounds computed with this method are so bad, that they only serve to solve tiny instances.

Wang et al. [32] use symbolic interval arithmetic to keep track on some of the neuron dependencies in order to compute better bounds. The idea is to keep a symbolic equation, based on the input values of the network, for each neuron. This symbolic approach can only provide better bounds if at least some of the ReLU activations can be fixed positively, i.e. \(l \ge 0\). Otherwise, the symbolic interval arithmetic uses the same bounds as the naive method and computes new bounds in the same way.

To overcome this drawback, Wang et al. [32] improve the method by introducing a different approximation for the case \(l< 0 < u\). The main idea is to maintain the symbolic dependencies also in this case. Though, the linear equation for the value of a ReLU neuron with input bounds \(l< 0 < u\) cannot be kept. Instead a symbolic equations is introduced which provides a lower and upper bound for the neuron value. These symbolic bounds can then be propagated through the network and have the advantage that the dependency information partially remains. For the propagation of the symbolic equactions an approximation is used as visualized in Fig. 2.

Fig. 2
figure 2

Approximation of ReLU function in one dimension as proposed by [32]. Here we have lower bound \(-4\) and upper bound 6 for the ReLU input variable x

Now we consider a linear approximation of ReLU constraints which was first proposed by Ehlers [32]. In fact, we will show that it is best possible in a certain sense, which we define in the following subsection. Given \(x \in [l, u]\), where \(l< 0 < u\), and \(y = \max \{0, x\}\) it holds (i) \(y \ge 0 \), (ii) \(y \ge x\), and (iii) \(y \le \frac{u(x - l)}{u - l}\). We graphically depict this approximation in Fig. 3 which in fact coincides with the linear relaxation of the MIP formulation (1) for ReLU constraints, see [2, 5].

Fig. 3
figure 3

Approximation of ReLU function in one dimension as proposed by Ehlers [10]. Here we have lower bound \(-4\) and upper bound 6 for the ReLU input variable x

Of course, the linear approximation of Ehlers [32] remains valid, if either the constraint \(y \ge 0\) or the constraint \(y \ge x\) is removed. This enables the use of matrix multiplication (cf. Zhang et al. [32]) or static analyzers with abstract domains (cf. Singh et al. [32]) for the propagation of the inequations.

Another approximation method is proposed by Raghunathan et al. [32] in the context of robustness certification. It consists in an SDP relaxation for ReLU neural networks that acts simultaneously on all neurons of a layer.

5.2 Comparison of linear ReLU approximations

In general, one ReLU layer contains several neurons, and we are interested to compute an approximation of the output range of the layer. This approximated output range can then be regarded as input to the next layer. As we want to reach a quick propagation of the output ranges through the layers, it is important that the approximated output range is a polytope. This allows to compute neuron bounds quickly using linear programming. In the following, we develop a theoretical framework to analyse different linear approximations.

Definition 2

(ReLU approximation) Let \(n, m \in {\mathbb {N}}\), \(A, B \in {\mathbb {R}}^{m \times n}\), \(c \in {\mathbb {R}}^n\), and \(P \subset {\mathbb {R}}^n\) be a polytope. We say that

$$\begin{aligned} Q := \Big \{\begin{pmatrix} x\\ y \end{pmatrix} \in P \times {\mathbb {R}}^n\ | \ Ax + By \le c \Big \} \subset {\mathbb {R}}^{2n} \end{aligned}$$

is a ReLU approximation (of P) if it holds that \( (P \times \text {ReLU}(P)) \subseteq Q\). Q is called an independent ReLU approximation, if for all \(j \in [m]\), there exists \(i \in [n]\) such that \(A_{jl} = B_{jl} = 0\) for all \(l \in [n] \ \backslash \ \{i\}\). A polytope P is called ReLU proper, if for all \(i \in [n]\) it holds

$$\begin{aligned} \min _{x \in P} \ x_i< 0 < \max _{x \in P} \ x_i. \end{aligned}$$

The consideration of ReLU proper polytopes simplifies the formulation of statements, as fixed ReLU neurons are not regarded. If we apply the naive approximation to a ReLU proper polytope we obtain a box \([0, u_1] \times \ldots \times [0, u_n]\), where \(u_i\) is the upper bound for the corresponding variable. We see that this is an independent ReLU approximation. Let \(A = 0 \in {\mathbb {R}}^{2n \times n}\) and for each \(i \in [n]\), we add two rows to matrix B and vector c to enforce \(0 \le y_i \le u_i\) for \(i \in [n]\), i.e. \(m = 2n\) for the m in Definition 2. These rows are \(e_i^T \ y \le u_i\) and \(-e_i^T \ y \le 0\), where \(e_i\) is the i-th unit vector in \({\mathbb {R}}^n\). Hence, we have exactly one non-zero coefficient in each row of B and only zero coefficients in A, so that the property holds. In passing we notice that the approximation proposed in Wang et al. [32] is an independent ReLU approximation, too.

Now we use our definition of a ReLU approximation for a more thorough investigation of the possibilities to approximate ReLU constraints. Within the restrictions of the definition, we would like to find matrices AB and c for a ReLU proper polytope P, such that Q is as small as possible (with respect to inclusion). First, we will restrict our analysis to independent ReLU approximations and claim: the approximation proposed by Ehlers [10] is best possible among all independent ReLU approximations of a ReLU proper polytope. We define this approximation formally as a ReLU approximation in order to state the result in Theorem 3.

Definition 3

Let \(P \subset {\mathbb {R}}^n\) be a ReLU proper polytope. The ReLU approximation of P corresponding to the approximation of Ehlers [32] will be denoted as \(Q_E\). In detail, for \(i \in [n]\), we set

$$\begin{aligned} A^{(i)} = \begin{bmatrix} 0\\ e_i^T \\ \frac{u_i}{l_i - u_i} e_i^T \end{bmatrix}, \ B^{(i)} = \begin{bmatrix} -e_i^T\\ -e_i^T \\ e_i^T \end{bmatrix} \text { and } c^{(i)} = \begin{bmatrix} 0\\ 0\\ \frac{u_i l_i}{l_i - u_i} \end{bmatrix}. \end{aligned}$$

For that, we use \( l_i := \min _{x \in P} \ x_i \text { and } u_i := \max _{x \in P} \ x_i\) and eventually define

$$\begin{aligned} A_E = \begin{bmatrix} A^{(1)}\\ \vdots \\ A^{(n)} \end{bmatrix}, \ B_E = \begin{bmatrix} B^{(1)}\\ \vdots \\ B^{(n)} \end{bmatrix} \text { and } c_E = \begin{bmatrix} c^{(1)}\\ \vdots \\ c^{(n)} \end{bmatrix}. \end{aligned}$$

Thus we obtain

$$\begin{aligned} Q_E := \Big \{\begin{pmatrix} x\\ y \end{pmatrix} \in P \times {\mathbb {R}}^n\ | \ A_E x + B_E y \le c_E \Big \} \subset {\mathbb {R}}^{2n}. \end{aligned}$$

Remark 5

Indeed, \(Q_E\) is an independent ReLU approximation. All rows of A and B are either 0 or a multiple of a transposed unit vector \(e_i^T \in {\mathbb {R}}^n\). If the latter is the case, \(i \in [n]\) is the same both in A and B when regarding the same row indices of A and B.

Theorem 3

Let \(P \subset {\mathbb {R}}^n\) be a ReLU proper polytope and \(Q_E\) be the approximation of P as in Definition 3. For any independent ReLU approximation Q of P it holds \(Q_E \subseteq Q\).

For the proof see “Appendix C”. In the following section, we explain how the approximation of Ehlers [32] is used in [5, 10] and discuss possibilities to speed up the computation to make this method more efficient. Then we present an approximation that is stronger than the one of Ehlers [32] and hence not independent in Sect. 5.4.

5.3 Efficient optimization based bound tightening for neural network verification

If we build the MIP model using some preliminary lower and upper bounds for each ReLU neuron, we can use the LP relaxation of the model to approximate the output values of the neural network. As in [5, 10], we can also tighten the neuron bounds using the LP relaxation, which is identical to the approximation of Ehlers [32]. For each ReLU input variable x we compute an optimal solution of the LP relaxation for the objective functions x and \(-x\). The optimum objective values hence give the new bounds for x in the neural network. In accordance with Gleixner et al. [32], we call this technique optimization based bound tightening (OBBT).

After the bound update, it is crucial to improve the MIP formulation (1). This allows to compute significantly tighter bounds in the next layer. Indeed, it is possible to build the approximation of the whole network only during the process of bound optimization. That means, each variable (corresponding to a neuron) is added separately to the relaxed MIP model such that the LP is always as small as possible. We regard the ideas of Gleixner et al. [32], who implemented an OBBT propagator in SCIP [13], to reduce the computational cost. Gleixner et al. [32] treat two main topics: First, they show how to generate and propagate Langrangian variable bounds (LVBs), and second, they propose methods for the acceleration of OBBT.

LVBs are valid inequalities with respect to the LP relaxations of mixed integer nonlinear problems (MINLP), which also includes LP relaxations of MIPs. Gleixner et al. [32] state that LVBs can be viewed as a one-row relaxation of the given MINLP, that provide a local approximation of the effect of OBBT. They are obtained as a by-product of the LP solutions which are computed during the execution of OBBT. Dual multipliers of the LP relaxation and an objective cutoff constraints are used to create an LVB. For the actual definition and more details see Gleixner et al. [32]. If we model the neural network verification problem as optimization problem as described in Sect. 4, we are only interested whether there exists a solution with objective value smaller than or equal to zero or not. Hence, we can safely cut off all solutions with an objective value greater than some \(\varepsilon > 0\). For our experiments we set \(\varepsilon := 0.1\) to have a sufficiently big margin to zero in order to prevent erroneous results.

The advantage of LVBs is that they can be propagated efficiently through a branch-and-bound tree, while the frequent application of OBBT requires a great computational effort for each branch-and-bound node that is processed. We see in our experiments that for some instances the use of LVBs is able to speed up the solving process significantly. See Sect. 8.2 for an overview of the experiments.

Moreover, we consider ideas of Gleixner et al. [32] for accelerating the application of OBBT. One aspect is filtering of bounds which can hardly be improved by executing OBBT. Assume that y is the value of a variable, which is a candidate for the application of OBBT, in a feasible solution of the LP relaxation. Moreover, let \(l \le y \le u\) be the bounds which are currently known for this variable. If then \(y - l \le \varepsilon \) or \(u - y \le \varepsilon \) for some \(\varepsilon > 0\), OBBT can strengthen the corresponding bound by at most \(\varepsilon \), as Gleixner et al. [32] point out. Yet, initial experiments showed, that usually almost all bounds can be improved significantly by OBBT, so that filtering bounds is not useful for verification of neural networks. Another aspect is the order of the variables for which OBBT is executed. As OBBT is executed layer by layer in our case, the order of variables can only be changed within each layer. However, the various strategies of Gleixner et al. [32] did not show any advantage over a simple fixed order in our computational experiments, see Rössig [32] for details.

Eventually, we consider another approach for bound computations in neural networks that is also a form of OBBT. Instead of using the LP relaxation to compute bounds, it is also possible to employ the exact MIP model and compute bounds for the neurons with OBBT. Computing the neuron bounds using the MIP formulation instead of the LP relaxation leads to strongly improved bounds. Although not all MIPs are solved to optimality, clear improvements of the corresponding bounds can be reached within a time limit of few seconds per MIP. These improvements render it possible to solve also relevant instances without specialized branching rules for neural network verification, however the bound computations take a lot of time.

5.4 Optimization based relaxation tightening for two variables

In general we regard neural network layers that feature ReLU activations for all neurons of the layer. Hence, we investigate in more detail how the ReLU function behaves in higher dimensions, i.e. if the ReLU function is applied componentwise to layers with several neurons. The following theorem can be found in Xiang et al. [32] as Corollary 1:

Theorem 4

For a polytope \(P \subset {\mathbb {R}}^n\), \(\,\mathrm {ReLU}(P)\) is a finite union of polytopes.

Hence we see that the best possible convex approximation \(\,\mathrm {conv}(\,\mathrm {ReLU}(P))\) of \(\,\mathrm {ReLU}(P)\) is the convex hull of the union of polytopes in Theorem 4. We investigate a simple example to see how the approximation of Ehlers [32] differs from this best possible convex approximation. We consider a toy example as depicted in Fig. 8 of the “Appendix”. Figure 4 shows the feasible input polytope of the ReLU layer and the corresponding ReLU image. The same ReLU image can be seen in Fig. 5, replenished with a depiction of its convex hull, the approximation of Ehlers [32] and the naive approximation. Figure 5 clearly shows that even for only two variables the convex hull of the ReLU image is strictly smaller compared to the approximation of Ehlers [32]. It seems appealing to find an improved approximation of the ReLU image closer to the convex hull, which is the best possible convex approximation.

Fig. 4
figure 4

Example for the feasible set before (blue polytope) and after (set enclosed by red lines) application of the ReLU function

Fig. 5
figure 5

The red lines enclose the ReLU image and the black line (with the coordinate axes) indicates the convex hull of this ReLU image. The approximation of Ehlers [32] is bounded by the orange segments, the naive approximation by the green ones (and coordinate axes)

Subsequently, we propose an efficient method which can strengthen the approximation of Ehlers [32] by considering at least pairs of two neurons jointly. This new ReLU approximation is not independent (cf. Definition 2). The depiction in Fig. 5 shows, that in this situation we could actually add one inequality and would improve the approximation to be exactly the convex hull of the ReLU image. This inequality is induced by the connecting segment between the vertices of the convex hull that maximize \(y_1\) or \(y_2\), respectively. Of course, we cannot make this inequality tighter, since otherwise feasible points of the ReLU image would be cut off. Though, the segment between the vertices that maximize \(y_1\) or \(y_2\), respectively, does not always induce a valid inequality as we show in the following example. Figure 6 shows a polytope of feasible \(x_1\), \(x_2\) values and the corresponding ReLU image of feasible values for \(y_1\) and \(y_2\), such that \(y_1 =\max \{0, x_1\}\) and \(y_2 = \max \{0, x_2\}\). The polytope is two dimensional, but can also be considered as embedded image of a higher dimensional polytope which is projected onto its variables \(x_1\) and \(x_2\). These two variables correspond to two neurons in one layer of a ReLU neural network. The dimension of the original polytope is then the number of all neurons in that layer. It should be noted that we use these projections to \({\mathbb {R}}^2\) only for the visualization of our method. The goal of our method is to obtain a tighter approximation without computing projections of higher dimensional polytopes. In Fig. 6, the segment (dashed line) between the vertices that maximize \(y_1\) or \(y_2\), respectively, does not induce a valid inequality with respect to the ReLU image.

Fig. 6
figure 6

Feasible set before (blue) and after ReLU application (red) for a different input polytope

Now the idea is to add an inequality to the model which partly cuts off the polytope resulting from the approximation of Ehlers [32], but leaves the ReLU image intact. The cut is parallel to the segment between the vertices that maximize \(y_1\) or \(y_2\), respectively. Depending on the situation, these vertices will either meet the inequality with equality or not. Figure 7 depicts this inequality and shows that adding this constraint considerably improves the approximation of the convex hull. In the following we describe how this constraint can be computed. A linear approximation of the ReLU neural network in question serves as a basis. Naturally, we can use the LP relaxation (which corresponds to the approximation of Ehlers [32]) if the verification problem is formulated as an MIP.

Fig. 7
figure 7

Here we see the ReLU image of the polytope depicted in Fig. 6 colored in red, its convex hull in black, the approximation of Ehlers [32] in orange, the inequality which we want to introduce as a black dashed line and the constraints of the naive approximation as a green dashed line. All sets are limited by the coordinate axes

Assume we want to tighten the approximation for the ReLU output variables \(y_1\) and \(y_2\) which correspond to ReLU input variables \(x_1\) and \(x_2\). All of these variables are contained in the LP relaxation of the neural network. In the final solution it must hold \(y_1 =\max \{0, x_1\}\) and \(y_2 = \max \{0, x_2\}\) due to the ReLU constraints. Let \({\hat{a}}\) and \({\hat{b}}\) be the optimum solutions when maximizing \(x_1\) or \(x_2\), respectively, in the current LP relaxation. Then we write \({\hat{a}}_1\) and \({\hat{a}}_2\) for the values of the variables \(x_1\) and \(x_2\) in the solution \({\hat{a}}\). Analogously we write \({\hat{b}}_1\) and \({\hat{b}}_2\) for the corresponding variable values in solution \({\hat{b}}\). It should be noted that these LP solutions are computed during the execution of OBBT, and can therefore be obtained at no additional cost. Obviously it holds \({\hat{a}}_1 \ge {\hat{b}}_1\) and \({\hat{b}}_2 \ge {\hat{a}}_2\) due to the choice of objective functions. Now we define \(a_1 := \max \{0, {\hat{a}}_1\}\) and analogously \(a_2, \ b_1\) and \(b_2\). We compute new objective coefficients as \(c_1 := b_2 - a_2\) and \(c_2 := a_1 - b_1\), i.e. \(c_1, \ c_2 \ge 0\). The latter holds due to the fact that \(\alpha \ge \beta \) implies \(\max \{0, \alpha \} \ge \max \{0, \beta \}\) for \(\alpha , \beta \in {\mathbb {R}}\). Again, we solve an LP using the current relaxation and maximize the objective function \(c_1 x_1 + c_2 x_2\). We denote the optimum objective value as \(\gamma \) and compute \(\delta := c_1 a_1 + c_2 a_2\). After this computation we can strengthen the LP relaxation by adding the constraint

$$\begin{aligned} c_1 y_1 + c_2 y_2 \le \max \{\gamma , \delta \}. \end{aligned}$$
(5)

Theorem 5

Constraint (5) is a valid inequality with respect to the ReLU image corresponding to \(y_1\) and \(y_2\). That means, constraint (5) can strengthen the LP relaxation of our MIP for the verification problem but cannot cut off any feasible solution.

Proof

We remind that it holds \(y_1 = \max \{0, x_1\}\), \(y_2 = \max \{0, x_2\}\) due to the ReLU constraints, and \(a_1, a_2, b_1, b_2, c_1, c_2 \ge 0\), hence \(\delta \ge 0\). That means, if \((y_1, y_2) = (0, 0)\) we have \(c_1 y_1 + c_2 y_2 = 0 \le \delta \). If \((y_1, y_2) = (x_1, 0)\) it holds \(x_1 \le a_1\) and hence \(c_1 y_1 + c_2 y_2 = c_1 x_1 + 0 \le c_1 a_1 + c_2 a_2 = \delta \). On the other hand, the case \((y_1, y_2) = (0, x_2)\) implies \(x_2 \le b_2\) and subsequently we see \(c_1 y_1 + c_2 y_2 = 0 + c_2 x_2 \le c_1 b_1 + c_2 b_2 = \delta \). Otherwise it holds \((y_1, y_2) = (x_1, x_2)\) which implies \(c_1 y_1 + c_2 y_2 \le \gamma \) and we can conclude the proof. \(\square \)

Thus, the approximation of Ehlers [32] can be improved by adding constraints of type (5) to the LP relaxation of the model. Like this, we obtain a ReLU approximation which is not independent. Although we have to solve only one LP per pair of neurons, applying this method to all possible pairs of neurons would lead to an immense computational cost. Therefore, we select only some pairs of neurons for which it is likely to significantly strengthen the LP relaxation by adding the new inequality to our model. Though, our selection strategy as laid out in Rössig [32] was not able to outperform a baseline selection strategy, which selects neurons in a fixed, predetermined order. Yet, this technique, which we abbreviate as OBBT2, can significantly strenghten the LP relaxation and reduce the number of nodes in the branch-and-bound tree (see Table 10 in the “Appendix”).

6 Primal heuristics

For the problem of neural network verification the use of primal heuristics lies in the quick falsification of incorrect properties. Surprisingly, even a trivial heuristic, which only performs random sampling within the set of feasible inputs, can often find counterexamples to incorrect properties quickly in contrast to standard MIP heuristics.

The idea of the random sampling heuristic as introduced by Bunel et al. [32]) is plain and simple: Given an instance \({\varPi }= (X, Y, F)\) of the verification problem, we randomly pick \(x \in X\) and check whether \(F(x) \in Y\). In case that \(F(x) \notin Y\), we know that \({\varPi }\) is refutable. Moreover, using the MIP formulation as optimization problem, the input vector x is also useful if it leads to a decrease of the primal bound, since this may help to tighten neuron bounds. In general it is not trivial to obtain \(x \in X\), if \(X \subset {\mathbb {R}}^n\) is an arbitrary polytope. However, as mentioned in Remark 3, many of the instances we regard feature a polytope X which is actually a box. In this case, we simply pick \(x_i \in [l_i, u_i]\) uniformly at random for \(i \in [n]\), where \(l_i\), \(u_i\) are the bounds of X for each component. This is performed similarly in [5, 36]. Otherwise, if X is not a box, we solve an LP to obtain \(x \in X\) using a random objective function.

We propose another heuristic that can be used in addition to the random sampling heuristic. It is based on the local search proposed by Dutta et al. [32] for output range analysis of ReLU neural networks. Though, we omit the use of gradient information and fit the heuristic more naturally into the framework of MIP solving. The main idea is to fix all neurons in one of their phases, such that the optimization variant of neural network verification consists only in solving a linear program. We start with a feasible input \(x_0 \in X\) for the neural network and use forward propagation to compute the values of all neurons in the network. Then, for each ReLU neuron, we fix the binary variable d in (1) to zero or one, corresponding to the phase of the neuron that is determined by propagating \(x_0\) through the network. Furthermore, the binary variables in the formulation of the maximum function for objective variable t are also fixed, such that \(t = \max \{ z_1, \ldots , z_k \}\). With all binary variables fixed, the MIP as described in Sect. 4 becomes an LP.

This LP is minimized with respect to variable t as objective function. After the first minimization LP has been solved, we choose a ReLU input variable \({\bar{x}}\) (corresponding to one ReLU neuron) of value zero if possible. For this variable, we switch the fixed value of the corresponding binary variable \({\bar{d}}\) from zero to one or vice versa. Then we optimize again and obtain a new input vector \({\hat{x}}_0 \in X\) for the neural network. After that, we switch the fixing of another binary variable, whose corresponding ReLU input variable has value 0 in the solution. This process is iterated until we find a feasible counterexample, i.e. the optimal value of the LP is smaller than zero, or we reach a predefined iteration limit. In case that none of the ReLU input variables is equal to zero, we have to abort the procedure. It is easy to see that switching the fixings of the binary variables as described, can only reduce the objective value of the optimum LP solution.

In the following we describe, how we combine our LP based heuristic with the random sampling heuristic. First we use the random sampling heuristic to find an input vector \(x_0 \in X\). The random sampling process and forward propagation are very fast, and therefore we try many (e.g. 1000) random inputs to find an input \(x_0 \in X\). Out of all sampled input vectors, we select \(x_0 \in X\) such that it corresponds to the lowest value of objective variable t. The hope is that \(x_0\) can be converted into an actual counterexample by computing a new input vector \({\hat{x}}_0\). This is given by the optimum LP solution after some ReLU phase switches as described. Instance \({\varPi }\) is shown to be indeed refutable, if the value of t is below zero in this optimum LP solution.

Of course, both heuristics can be applied several times throughout the solving process in a branch-and-bound tree which we enable in our implementation. Our experimental evaluation shows that the LP based heuristic works quite successfully. In fact, the mean runtime on our evaluation set of SAT instances, as mentioned in Sect. 8, drops from 330.1 to 71.7 seconds if our LP based heuristic is employed. On the other hand, the mean runtime on our evaluation set of UNSAT instances increases only slightly from 915.3 to 943.7 seconds due to the application of our LP based heuristic.

7 Branching for neural network verification

The verification problem can be solved with a generic branch-and-bound approach as described by Bunel et al. [32]. If the problem is solved as an MIP, specific branching rules for neural network verification can be integrated into the MIP solving process to strongly speed up the process. Initial bounds are necessary for the formulation of the verification problem as an MIP model and can be obtained by one of the approximation methods as introduced in Sect. 5. Many relevant instances of the verification problem cannot be solved if an approximation of the network is computed only once. Specific branching rules can be used to split an instance into simpler ones which can be approximated better. One option is to split the set of feasible input vectors for an instance of the verification problem as in [5, 35].

Given an instance \({\varPi }= (X, Y, F)\) of the verification problem, the design of the domain branching rule is based on the assumption that X is a box. However, the branching rule can also be applied if X is not a box. We assume the existence of bounds \(l_i \le x_i \le u_i\) for all \(x \in X \subset {\mathbb {R}}^n\) and \(i \in [n]\), cf. Remark 3. In case that X is a box, it holds \(X = [l_1, u_1] \times \ldots \times [l_n, u_n]\), otherwise \(X \subseteq [l_1, u_1] \times \ldots \times [l_n, u_n]\). Bunel et al. [32] propose to select \(j \in [n]\) and split the domain of variable \(x_j\) to subdomains \([l_j, \frac{u_j + l_j}{2}]\) and \([\frac{u_j + l_j}{2}, u_j]\).The domains of all other variables \(x_i\), \(i \in [n] \backslash \{ j \}\) are left unchanged, so that we obtain two sub-instances with smaller input domains.

The selection of the branching variable is very important for the performance of the branching rule, cf. [5]. In Bunel et al. [32], the selection depends on the depth in the branch-and-bound tree and follows a fixed order. Bunel et al. [32] implement another selection rule, based on the approximation method of Wong and Kolter [32]. In our implementation we mainly use a selection rule “gradient” which is quite similar to the one used in Wang et al. [32]. For that, we extend the neural network F to another one \({\tilde{F}}\), which encodes also the properties that shall be verified. It has output dimension one, and for a fixed input \(x \in X\), the output is the same as the value of the objective variable t in the MIP formulation as optimization problem (3). We use a max-pooling layer to model the computation of the maximum in (3) in the neural network \({\tilde{F}}\) and refer to Bunel et al. [32] for more details on the construction. We compute the gradient of \({\tilde{F}}\) at the input vectors \(x_1 = (l_1, \ldots , l_n)\), \(x_2 = (\frac{u_1 + l_1}{2}, \ldots , \frac{u_n + l_n}{2})\), and \(x_3 = (u_1, \ldots , u_n)\) and let \( g := \nabla {\tilde{F}}(x_1) + \nabla {\tilde{F}}(x_2) + \nabla {\tilde{F}}(x_3) \in {\mathbb {R}}^n\). For \(i \in [n]\) we compute \(z_i := |g_i| \cdot (u_i - l_i)\) and choose the branching variable \(j \in [n]\) such that \(z_j = \max \{ z_1, \ldots , z_n \}\). The intuition is that verifiability of the instance depends mainly on the values of input neurons with a high (averaged) absolute gradient value and sufficiently big input range.

Another natural possibility is to perform branching over the two phases of a ReLU neuron which is used in [6, 10, 17, 36]. If the verification problem is formulated as an MIP, this corresponds exactly to branching over the binary variables which correspond to the ReLU neurons. Given a ReLU input variable x and the corresponding output variable y, we can branch the constraint \(y = \max \{0, x \}\) into two cases: either \(x \le 0\), \(y = 0\), or \(x > 0\), \(y = x\). The main question is how to select the branching variables. As in Cheng et al. [32], we prioritize ReLU neurons which are located in the front of the neural network in the selection rule “standard”. Similarly to Wang et al. [32], we implement a selection rule “gradient” that picks ReLU neurons which have a large gradient with respect to the outputs of the network. Though, in our experiments the rule “standard” performed better.

8 Computational evaluation

In this section we discuss computational results for various experiments that we conducted on a diverse set of test instances which we shortly present in the following. We empirically compare several methods for the computation of neuron bounds. Furthermore, we report computational results for various configurations of our solver and compare it with the programs of Bunel et al. [32], Wang et al. [32], and Katz et al. [32].

As we are not aware of any publicly available instances \({\varPi }= (X, Y, F)\) of the verification problem where X is not a box (cf. Remark 4), we define such instances to show the capabilities of our solving model. As a basis we use the neural networks of the ACAS Xu system, which were used by Katz et al. [32] to create instances of the verification problem. These are described in “Appendix A”. The ACAS Xu system is designed to prevent collisions of (autonomous) aircrafts. We also perform our evaluations on the original instances published by Katz et al. [32]. Furthermore, we use test instances with neural networks that are trained on the well known MNIST [18] handwritten digit data set. In fact, we use two of the trained neural networks published by Wang et al. [32] and verify robustness of classifications using the \(L_\infty \) norm. Each input neuron may have a value between 0 and 255 and we investigate four different perturbation radii (1, 5, 10, and 20). These networks have two layers, each of them with 24 or 512 neurons, respectively. The smaller one reaches an accuracy of 96.59 %, the one with 512 neurons per layer 98.27 %. In contrast to the neural networks of the ACAS Xu system with input dimension five, the input dimension of the MNIST networks is 784. This difference is especially interesting with respect to the performance of input domain branching as presented in Sect. 7. To evaluate various settings of our solving model, we use two evaluation subsets that contain a diverse selection of all the instances. One contains 13 SAT instances and the other one 23 UNSAT instances. In general, we compute average values for runtime and number of solving nodes as shifted geometric mean which reduces the sensitiviy to outliers, cf. [1, 16].

8.1 Empirical comparison of bound computation approaches

We use our evaluation set of UNSAT instances for a numeric comparison of the bounds which are obtained by various bound computation approaches. For each instance \({\varPi }= (X, Y, F)\) we compute lower and upper bounds \([l_i, u_i]\) for all neurons \(i \in [N]\) (before application of the ReLU function), based on the feasible input domain X. The bounds are computed layerwise from the first to the last layer and branching is not applied. Then we compute the shifted geometric mean of \(\{ u_1 - l_1, \ldots , u_N - l_N \}\) with shift value 1 for each instance. The mean value indicates how good the corresponding bound computation approach is. Clearly, it is desirable that the difference \(u_i - l_i\) is as small as possible for all neurons \(i \in [N]\), and we use the mean value to compare the different approximation methods. See Table 8 in the “Appendix” for detailed results, here we report the overall mean.

Table 1 Averaged differences between neuron bounds in our UNSAT test set for different bound computation methods

As a second measure we report how many neurons can be fixed in their phase by the respective bound computation method. Detailed results on this can be found in Table 9 in the “Appendix”.

Table 2 Average of the number of fixed neurons per instance in our UNSAT test set for different bound computation methods

The results in Tables 1 and 2 show that the best bounds are computed by OBBT on the MIP model. While OBBT2 computes better bounds compared to OBBT on the LP relaxation, the improvements are unfortunately quite minor. We also see that the approaches of Wang et al. [35, 36] are clearly superior to naive interval arithmetic. Yet, they are not competitive with OBBT if regarding only the quality of the computed bounds.

8.2 Comparison of different techniques in our model

In this section we provide an overview of the performance of our solving model in various configurations. The experiments in this section are run for the instances of our evaluation set of UNSAT instances. All results are obtained on cluster nodes with Intel Xeon CPUs E5-2670 which have a clock rate of 2.5 GHz. Each experiment is run exlusively on one cluster node and a memory limit of 32 GB is set. In general we set a time limit of 7200 seconds. If the time limit is hit during the solving process, we assume the time limit as runtime for the corresponding instance.

We report results both for the whole UNSAT test set, and separately for the ACAS and MNIST based instances. Comparing ACAS and MNIST based instances, the main differences are the number of input neurons (5 vs. 784) and the number of layers in the neural networks (6 vs. 2).

For the experiments in this section we use a baseline configuration “no_heur_base” of our solving model. In this configuration, domain branching (with selection rule “gradient”) is used up to a depth of 20 in the branch-and-bound tree. It should be noted that this depth is usually not exceeded if domain branching is used. Furthermore, OBBT is applied to the LP relaxation at each solving node up to a depth of 20 in the branch-and-bound tree. The primal heuristic is enabled only at the root node. Further techniques (e.g. our separator based on the work of Anderson et al. [32]) are not applied.

In Table 3 we report mean runtimes on the UNSAT test set for several different configurations of our solving model. Here we try to give an impression of the differences between the configurations which we tested. We are mostly interested in the computation of the dual bound by various methods and therefore the primal heuristic is only employed at the root node. Table 3 shows that there is a clear difference between solving MNIST and ACAS instances. In fact, the configuration “no_heur_base_genv” is the fastest in total because it performs well on both types of instances. However, it is not the best configuration for either of the two subsets (although close to the best configuration for ACAS instances). The best configurations with respect to the MNIST instances (“no_heur_relu_genv” with respect to number of timeouts, “mnist_base” with respect to mean runtime) do not perform well on the ACAS instances. The configurations in Table 3 are sorted by the total number of timeouts. In the following, we give short descriptions of the configurations in this order. The configuration “no_heur_base_genv” corresponds to the baseline configuration with the additional use of Langrangian variable bounds (LVBs) as described in Sect. 5.3. Obviously, the LVBs are quite beneficial for solving MNIST instances. The configuration “no_heur_sepa0_freq5” sometimes calls an additional separator, as suggested by Anderson et al. [32]. Next in Table 3 follows our baseline configuration which is hence quite good already. Indeed, “no_heur_base_obbt2_nosort” is the best of our configurations that use OBBT2 and it has a higher mean runtime. Regarding the mean runtime compared to the number of timeouts, the configuration “no_heur_base_mip” is not in line with the other configurations. Indeed, this configuration applies OBBT to the MIP model in the beginning of the solving process to compute initial neurons bounds. This comes with a high computational cost, also for rather easy instances. However, for more difficult instances, this stratey is not that bad as indicated by the relatively low number of timeouts.

Table 3 Runtime results on the UNSAT test set using the formulation as optimization problem and various configurations

In the configuration “no_heur_relu_genv”, ReLU branching is combined with OBBT on the LP relaxation and the creation of LVBs. Most notably, this configuration is the only one that solves all MNIST instances in the evaluation set within the time limit. Though, the performance on the ACAS based instances is rather mediocre. Due to the low number of input neurons of the ACAS neural networks, input domain branching is more efficient for these instances. The same holds for the configuration “no_heur_relu” which does not include the LVBs. Eventually, in the configuration “mnist_base” the solving process is limited to the application of standard MIP techniques. Notably enough, this configuration has the lowest mean runtime on the subset of MNIST instances. On the other hand, it times out on all ACAS instances. This vast difference can be attributed to the different number of layers in the neural networks (two for MNIST, six for ACAS), as deeper networks are more difficult to approximate.

8.3 Comparison with other solvers

In this section we provide detailed results of computational experiments which we conducted to investigate the performance of different solvers for neural network verification. Besides our own solving model, which we regard in various configurations, we include the programs of Bunel et al. [32], Wang et al. [32], and Reluplex by Katz et al. [32]. The work of Wang et al. [32] is implemented in two solvers, Neurify and ReluVal. ReluVal is used on the ACAS instances whereas Neurify is applied to the MNIST instances. We always check whether any alleged counterexample presented by some solver is indeed a feasible counterexample for the corresponding instance. In general, we perform these checks with an absolute numerical tolerance of \(10^{-8}\). Several counterexamples are only feasible though, if a higher numerical tolerance is allowed, which we report in that case.

Each experiment is run exclusively on one cluster node with an Intel Xeon Gold 5122 CPU which operates at a clock rate of 3.6 GHz, and we set a memory limit of 32 GB. Besides the program names ReluVal, Neurify and Reluplex, we use the following terms to denote the various solving models. Adv refers to ReluVal or Neurify using their adversary check mode which is focused on finding counterexamples. BaB denotes the branch-and-bound method of Bunel et al. [32]. Eventually, we use NonOpt to describe that our solving model is used with the formulation of the verification problem as feasibility problem. Joint refers to our solving model using the formulation as optimization problem, which can solve conjunction instances (cf. Remark 4). Separate indicates that we solve the verification problem as optimization problem, but conjunction instances are split into several disjunction instances as explained in Remark 4. Conjunction instances have to be splitted likewise for BaB and Reluplex. Based on the results of our evaluation experiments in Sect. 8.2, we choose the configuration “no_heur_sepa0_freq5” as the best for the ACAS instances. Though, in order for a good performance on refutable instances, we do adapt this configuration to execute the primal heuristic also locally up to a depth of eight in the branch-and-bound tree.

Table 4 Mean runtime on ACAS properties 1, 2, 3, 4, and 8

Table 4 contains the runtime results for all ACAS instances of Properties 1, 2, 3, 4, and 8 as defined by Katz et al. [32]. It should be noted that these are all disjunction instances. The mean runtime of Reluplex is at least one order of magnitude larger than the mean runtimes of all other solvers. ReluVal is clearly superior to all other solvers, especially if its adversary check mode is applied. We see that our solving model, which uses SCIP [13] to strongly integrate the bound computations into a MIP framework, performs significantly better than the rather similar approach of Bunel et al. [32].

Table 5 Runtimes on ACAS properties

A similar picture of the performance of the various solvers can be seen in Table 5, which shows the results on the ACAS instances of Properties 5, 6, 7, 9, and 10. Remind that we underestimate the runtime of instances for which the solving process is stopped due to the time limit.

In Tables 6 and 7 we report runtimes for the MNIST instances. We remind that all of these are conjunction instances by definition. For our solving model we use the configuration “mnist_base” as presented in Sect. 8.2. The instances which are based on neural networks with two layers of 24 neurons are solved very quickly by most solvers, as can be seen in Table 6. Though, domain branching is apparently not a good strategy to solve these instances, which feature 784 input neurons. This is shown by the high number of timeouts of the solving method BaB [5] on the MNIST instances in Table 6.

Table 6 Runtimes on the MNIST 24 data set

Clearly, the MNIST instances with 512 neurons per layer are much more challenging. In Table 7 we see that our solving model performs quite good when the approach Joint is taken, i.e. conjunction instances are solved directly as one optimization problem. In several cases Neurify aborts the solving process with no result. For the corresponding instances we assume the time limit of 7200 seconds as runtime in order to compute the mean runtime. If the conjunction instances are split into separate disjunction instances, our solving model mostly fails to find counterexamples within the time limit (see NonOpt and Separate). This can be explained by the fact that the instances, which are obtained after splitting, are solved sequentially one after the other. However, if a verifiable instances is processed first, this may already lead to a timeout. Reluplex is only able to solve two of the easiest instances within the time limit, and BaB also performs considerably worse than Neurify and our solving model. Though for all MNIST instances, it should be noted that the counterexamples which are produced by Neurify are only feasible if one applies a large numerical tolerance of \(10^{-3}\). It is clear to see, that the instances with the lowest perturbation radius of 1 (in \(L_\infty \) norm) are easy to verify. On the other hand, for the instances with a high perturbation radius of 10 or 20, counterexamples are found in most cases. Apparently, the perturbation radius 5 poses the biggest difficulties for the solvers, as the corresponding instances are probably on the borderline between SAT and UNSAT.

Table 7 Runtimes on the MNIST 512 data set, where the last number of the instance name is the perturbation radius

In the “Appendix” we report our self defined instances where the input polytope X is not a box and computational results of our solving model on these, as the other solvers cannot process these instances. See Table 12 for the results which are obtained with the formulation as optimization problem, and Table 13 for the results corresponding to the formulation as feasibility problem.

9 Conclusions

Our solving model shows a solid performance in all categories of our benchmark set. This highlights the success of our approach, which combines MIP solving, using the solver SCIP [13], with specialized bound computation and branching techniques. Especially, we can solve instances with general polytopes as input domains which is not possible with other verification algorithms [5, 17, 36]. Moreover, our solving model clearly outperforms the solvers of Bunel et al. [32] and Katz et al. [32]. Subsequently, Reluplex cannot be regarded as a state-of-the-art solver for neural network verification anymore. While ReluVal and Neurify from Wang et al. [32] show impressive runtime results for most instances, they rely primarily on branching. For big instances, this can become problematic due to limited numerical accuracy and memory capacity.

Moreover, we developed a theoretical framework for the comparison of linear approximation methods and presented the novel approximation technique OBBT2 which is able to improve the linear relaxation of a ReLU neural network. Besides that, we showed how the local search procedure of Dutta et al. [32] can be implemented differently in an MIP solving context such that it serves as a primal heuristic. Additionally, we described a novel formulation of the verification problem as quadratic program. All newly proposed techniques were evaluated computationally within our newly implemented solving model.

Although our implementation is limited to neural networks with ReLU activation function, the approach could be easily extended to work with other piecewise-linear activation functions. Examples for that are the leaky ReLU function or max-pooling layers. In addition, the great flexibility of our solving model allows the integration of further techniques such that future improvements may render it even more efficient.

In conclusion, it can be said that many challenges remain in the field of neural network verification. The further scalability of current verification approaches is still an open task. Besides, new approaches for the falsification of incorrect properties would be highly interesting. Although we present a new heuristic for that (based on ideas of Dutta et al. [32]), better algorithms could probably be developed.