1 Introduction

Deep Neural Networks (DNNs) and Machine Learning (ML) models are increasingly being utilised across various applications due to their exceptional performance and versatility across different domains. Their application spans diverse fields, including financial analysis (Chen et al 2022; Sako et al 2022), stock price prediction (Yang et al 2023), analysis and control of industrial processes (Zhang et al 2023; Pavithra et al 2023; Jahanbakhti et al 2023; Guidotti et al 2023b, a), medical analysis (Dash et al 2023; Gul et al 2023), and recently, interpretation of MRI images within the context of the COVID-19 pandemic (Al-Waisy et al 2023; Dansana et al 2023). However, their application in safety- or security-sensitive environments (Katz et al 2017; Demarchi et al 2022) raises significant concerns. Since seminal works by Goodfellow et al (2015) and Szegedy et al (2014), it has been demonstrated over the past decade that DNNs may exhibit vulnerabilities in terms of their robustness. In particular, minimal alterations to correctly classified input data may lead to unexpected and incorrect responses from the network. Consequently, verification has emerged as a powerful approach to provide formal assurances regarding the behaviour of neural networks (Ferrari et al 2022; Demarchi et al 2022; Wang et al 2021; Katz et al 2019; Bak et al 2020; Kouvaros et al 2021; Singh et al 2019a; Tran et al 2020; Guidotti et al 2021; Henriksen and Lomuscio 2020; Guidotti et al 2019b; Henriksen and Lomuscio 2021; Guidotti et al 2020; Eramo et al 2022; Guidotti 2022; Leofante et al 2023; Guidotti et al 2023c, e, d). Furthermore, significant research efforts have been directed towards modifying networks to conform to specified criteria (Guidotti et al 2019b; Kouvaros et al 2021; Sotoudeh and Thakur 2021; Henriksen et al 2022; Guidotti et al 2019a), as well as exploring training methods that adhere to specific behavioural constraints (Cohen et al 2019; Hu et al 2016; Eaton-Rosen et al 2018; Giunchiglia and Lukasiewicz 2021; Giunchiglia et al 2022).

In this paper, we introduce NeVer2, a comprehensive tool that seamlessly integrates design, training, and verification functionalities for DNNs. At present, NeVer2 offers an environment where:

  • Users can visually design the structure of a neural network as a block diagram using a graphical interface.

  • Network parameters can be trained using a wrapper to the pyTorch library. Both dataset sources and hyperparameters can be supplied via dialog boxes, effectively concealing the intricacies of pyTorch from the user.

  • Properties regarding the network can be associated with it visually, through the addition of special blocks in the graphical interface. The network can then be verified using our backend, pyNeVer (Guidotti et al 2021), in a straightforward “push-button” manner. Additionally, the user can configure pyNeVer through dialog boxes designed to simplify interaction.

  • Networks in the ONNX and pyTorch file formats, as well as existing properties in the VNN-LIB (Demarchi et al 2023) standard, are supported. The VNN-LIB standard is the input format in the annual competition of verification tools for neural networks (VNN-COMP) (Müller et al 2022). If a trained network is available in one of the formats accepted by NeVer2, it can be paired with properties and verified without the need for retraining.

To the best of our knowledge, no other tool combines the design, learning, and verification of DNNs within an open-source platform featuring a graphical interface. Furthermore, while most existing toolsets target users from the verification community, NeVer2 is tailored for domain experts with little to no experience in verification. Its aim is to empower them to design robust networks without being burdened by numerous technical details, which are often irrelevant given the scope of their tasks.

Finally, we present an experimental evaluation across various learning domains and verification tasks to compare the performance of NeVer2 against two prominent tools: \(\alpha \),\(\beta \)-CROWN, the winner of the last three VNN-COMPs, and NNV, another VNN-COMP contestant leveraging fundamental algorithms similar to NeVer2. Our findings reveal that while NeVer2 may exhibit slower performance compared to \(\alpha \),\(\beta \)-CROWN in most cases, it outperforms NNV in terms of speed and also demonstrates capability in handling certain verification tasks that currently exceed the capabilities of both tools.

The remainder of the paper is organized as follows. In Sect. 2, we conduct an extensive survey of related works, encompassing tools for visualization, learning, and verification of DNNs. Section 3 introduces definitions and notation to be utilized throughout the paper. In Sect. 4, we elaborate on the verification algorithms implemented in NeVer2. Section 5 offers an overview of the system architecture, while Sect. 6 presents examples of its usage via the graphical interface. Section 7 presents the results of our experimental evaluation. Finally, we conclude the paper in Sect. 8 with some closing remarks about the capabilities of the tool.

2 Related work

2.1 Learning tools

Tools for configuring a neural network for a specific task typically offer interfaces for defining the network’s layers and their arrangement, as well as algorithms for learning the hyperparameters. The most widely used toolkits include pyTorch (Paszke et al 2019), TensorFlow (Abadi et al 2016), and Keras (Joseph et al 2021), among others tailored for specific case studies or programming languages.

pyTorch, initially developed by Meta AI in 2017, is a Python library that serves as a modernized version of the older Torch library for machine learning. It is freely available and open source, serving as the foundation for numerous popular deep learning applications, including Tesla Autopilot, Uber’s Pyro, and PyTorch Lightning.

TensorFlow, created by Google Brain, is a comprehensive machine learning platform offering both high-level and low-level interfaces for various tasks. It is used as the backbone for numerous commercial AI-enabled products, such as voice recognition, search engines, and email services.

Keras is a high-level neural network Python library that operates atop TensorFlow, providing an optimized interface for defining neural network models and facilitating their learning process. Additionally, it supports multiple libraries as back-ends, including Microsoft’s CNTK (Seide and Agarwal 2016).

2.2 Verification tools

Following the first release of NeVer (Pulina and Tacchella 2010), numerous other automated verification tools have emerged. Since 2020, an international competition (Müller et al 2022) has been held to stimulate development and promote collaboration within the verification community. In this summary, we outline the main tools that have yielded successful results in the competition, offering a comprehensive overview of the techniques developed for verification of neural networks.

Marabou (Katz et al 2019) is a user-friendly Neural Network Verification toolkit that addresses queries about a network’s properties by encoding and solving them as constraint satisfaction problems. It offers both Python and C++ APIs, enabling users to load neural networks and define arbitrary linear properties over them.

ERAN (Singh et al 2019b) is a neural network verifier that leverages abstract interpretations to encode the pre- and post-conditions of a network as Linear Programming (LP) or Mixed Integer Linear Programming (MILP) problems. It offers support for both complete verification, where a yes/no answer can be provided, and incomplete verification, where the property is verified on an over-approximation of the network, potentially resulting in false negatives. ERAN is capable of handling fully-connected, convolutional, and residual network architectures, encompassing various non-linearities such as ReLU, Sigmoid, Tanh, and Maxpool.

MIPVerify (Tjeng et al 2019) is a tool designed to assess the robustness of neural networks using Mixed Integer Linear Programming (MILP). It transforms queries regarding a neural network’s robustness for specific inputs into MILP problems. This approach employs efficient solvers facilitated by tight specifications of ReLU and maximum constraints. Additionally, it uses a progressive bounds tightening strategy, focusing on refining bounds only when it can enhance the problem formulation with additional information.

NNV (Tran et al 2020) is primarily implemented using Matlab and applies reachability analysis techniques for verifying neural networks, particularly focusing on closed-loop neural network control systems in autonomous cyber-physical systems. NNV employs geometric representations enabling a layer-by-layer computation of the exact reachable set for feed-forward DNNs.

Venus (Botoeva et al 2020) is a verification toolkit that incorporates the dependency analysis procedure and enhances it with symbolic interval arithmetic and domain splitting techniques. Domain splitting methods partition the input domain into sub-domains, thereby refining the bound intervals of the nodes. Symbolic interval arithmetic techniques efficiently and accurately approximate these refined intervals.

nnenum (Bak 2021) employs various levels of abstraction to achieve high-performance verification of ReLU networks while maintaining completeness. The analysis combines three types of zonotopes with star set over-approximations and utilizes efficient parallelized ReLU case splitting.

VeriNet (Henriksen and Lomuscio 2020, 2021) is a comprehensive verification toolkit based on Symbolic Interval Propagation (SIP) for feed-forward neural networks. The core algorithm employs SIP to construct a linear abstraction of the network, which is then utilized in an LP-encoding to address the verification problem. To ensure completeness, a refinement phase based on a branch and bound methodology is incorporated.

\(\alpha \),\(\beta \)-CROWN is an efficient neural network verifier based on the linear bound propagation framework. It builds upon prior research on bound-propagation-based verifiers such as CROWN (Zhang et al 2018), \(\alpha \)-CROWN (Xu et al 2021), \(\beta \)-CROWN (Wang et al 2021), and GCP-CROWN (Zhang et al 2022). Their most recent work, GCP-CROWN, represents the most comprehensive formulation of the linear bound propagation framework for neural network verification currently available.

Debona (Brix and Noll 2020) is a verification toolkit developed based on the 2020 edition of VeriNet. In addition to the Error-based Symbolic Interval Propagation (ESIP), which utilizes parallel upper and lower bounds for each neuron, it also supports Reversed Symbolic Interval Propagation (RSIP), which employs independent lower and upper bounds.

CGDTest (Nagisetty 2021) is a testing algorithm for DNNs that aims to identify an input compliant with user-defined constraints. It functions akin to a gradient-descent optimization method: initially, CGDTest interprets user-defined constraints and transforms them into a differentiable constraint loss function. Subsequently, starting from a random input, it leverages gradient descent to adjust it until the termination criteria are satisfied.

MN-BaB (Ferrari et al 2022) is an open-source neural network verifier that uses precise multi-neuron constraints in conjunction with efficient GPU-enabled linear bound-propagation within a branch and bound framework. MN-BaB offers completeness for piece-wise linear activation functions and is capable of handling fully-connected, convolutional, and residual network architectures containing ReLU, Sigmoid, Tanh, and Maxpool non-linearities.

PyRAT (Girard-Satabin et al 2022) is an acronym for Python Reachability Assessment Tool and serves as a neural network verification tool based on abstract interpretation techniques. Due to its tailored approach designed specifically for neural networks and their high dimensionality, PyRAT effectively applies abstract interpretation techniques while fully leveraging tensor operations. It supports a broad spectrum of neural networks and layers, ranging from simple and small tabular problems and networks to complex architectures with convolutional layers and skip connections.

2.3 Visualization tools

Creating graphical visualizations of neural network architectures is often a laborious and challenging task when done manually. NN-SVG (LeNail 2019) and Netron (Roeder 2023) stand out as the most popular tools offering this kind of capability. NN-SVG serves as a parametric designer tailored for generating high-quality examples of feed-forward and convolutional neural network models, primarily catering to researchers. Conversely, Netron functions as a visualizer for various formats of neural network models, providing comprehensive insights into the architecture and parameters of all layers within a network. More recently, the tool AIFiddle (Chappat 2023) has emerged with a modern interface, enabling users to design neural networks with a three-dimensional representation of data. Additionally, it offers features such as training on pre-defined datasets and in-depth analysis of the training procedure.

3 Background

In this section, we provide the primary background and context regarding neural networks and their verification. We explain our notation, define the behavior of neural networks, and outline the task of neural network verification.

3.1 Basic notation and definitions

We denote n-dimensional vectors of real numbers \(x \in \mathbb {R}^n\)—also points or samples — with lowercase letters like xyz. We write \(x = (x_1, x_2, \ldots , x_n)\) to denote a vector with its components along the n coordinates. We denote \(x \cdot y\) the scalar product of two vectors \(x, y \in \mathbb {R}^n\) defined as \(x \cdot y = \sum _{i=1}^n x_i y_i\). The norm \(\Vert x \Vert \) of a vector is defined as \(\Vert x \Vert = x \cdot x\). We denote sets of vectors \(X \subseteq \mathbb {R}^n\) with uppercase letters like XYZ. A set of vectors X is bounded if there exists \(r \in \mathbb {R}, r > 0\) such that \(\forall x, y \in X\) we have \(d(x, y) < r\) where d is the Euclidean norm \(d(x, y) = \Vert x - y \Vert \). A set X is open if for every point \(x \in X\) there exists a positive real number \(\epsilon _x\) such that a point \(y \in \mathbb {R}^n\) belongs to X as long as \(d(x,y) < \epsilon _x\). The complement of an open set is a closed set—intuitively, one that includes its boundary, whereas open sets do not; closed and bounded sets are compact. A set X is convex if for any two points \(x,y \in X\) we have that also \(z \in X~\forall z = (1 - \lambda )x + \lambda y\) with \(\lambda \in [0,1]\), i.e., all the points falling on the line passing through x and y are also in X. Notice that the intersection of any family, either finite or infinite, of convex sets is convex, whereas the union, in general, is not. Given any non-empty set X, the smallest convex set \(\mathcal {C}(X)\) containing X is the convex hull of X and it is defined as the intersection of all convex sets containing X. A hyperplane \(H \subseteq \mathbb {R}^n\) can be defined as the set of points

$$\begin{aligned} \qquad \quad H = \{x \in \mathbb {R}^n \mid a_1x_1 + a_2x_2 + \ldots + a_n x_n = b\} \end{aligned}$$

where \(a \in \mathbb {R}^n\), \(b \in \mathbb {R}\) and at least one component of a is non-zero. Let \(f(x) = a_1x_1 + a_2x_2 + \ldots + a_n x_n - b\) be the affine form defining H. The closed half-spaces associated with H are defined as

$$\begin{aligned} \qquad \quad H_{+}(f) = \{x \in X \mid f(x) \ge 0 \} \\ \qquad \quad H_{-}(f) = \{x \in X \mid f(x) \le 0 \} \end{aligned}$$

Notice that both \(H_{+}(f)\) and \(H_{-}(f)\) are convex. A polyhedron in \(P \subseteq \mathbb {R}^n\) is a set of points defined as \(P= \bigcap _{i=1}^p C_i\) where \(p \in \mathbb {N}\) is a finite number of closed half-spaces \(C_i\). A bounded polyhedron is a polytope: from the definition, it follows that polytopes are convex and compact in \(\mathbb {R}^n\).

3.2 Neural networks

Given a finite number p of functions \(f_1: \mathbb {R}^n \rightarrow \mathbb {R}^{n_1}, \ldots , f_p: \mathbb {R}^{n_{p-1}} \rightarrow \mathbb {R}^{m}\)—also called layers—we define a feed forward neural network as a function \(\nu : \mathbb {R}^n \rightarrow \mathbb {R}^m\) obtained through the compositions of the layers, i.e., \(\nu (x) = f_p(f_{p-1}( \ldots f_1(x) \ldots ))\). The layer \(f_1\) is called input layer, the layer \(f_p\) is called output layer, and the remaining layers are called hidden layers. For \(x \in \mathbb {R}^n\), we consider only two types of layers:

  • \(f(x) = Ax + b\) with \(A \in \mathbb {R}^{m \times n }\) and \(b \in \mathbb {R}^m\) is an affine layer implementing the linear mapping \(f: \mathbb {R}^n \rightarrow \mathbb {R}^m\);

  • \(f(x) = (\sigma _1(x_1), \ldots , \sigma _n(x_n))\) is a functional layer \(f: \mathbb {R}^n \rightarrow \mathbb {R}^n\) consisting of n activation functions—also called neurons; usually \(\sigma _i = \sigma \) for all \(i \in [1,n]\), i.e., the function \(\sigma \) is applied component-wise to the vector x.

In this work, we consider two widely adopted activation functions \(\sigma : \mathbb {R} \rightarrow \mathbb {R}\): the Rectified Linear Unit (ReLU) function, defined as \(\sigma (r) = max(0,r)\), and the logistic function, defined as \(\sigma (r) = \frac{1}{1 + e^{-r}}\). Although not discussed here, convolutional layers with one or more filters can be represented as affine mappings (Gehr et al 2018).

For a neural network \(\nu : \mathbb {R}^n \rightarrow \mathbb {R}^m\), the task of classification involves assigning one out of m labels to every input vector \(x \in \mathbb {R}^n\): an input x is assigned to class k if \(\nu (x)_k > \nu (x)_j\) for all \(j \in [1,m]\) and \(j \ne k\). It is important to mention that in the majority of neural network applications, a single SoftMax neuron is typically appended after the m outputs to yield a single value corresponding to the chosen class. However, existing verification benchmarks do not require the presence of this neuron; instead, they impose conditions directly on the m outputs. The task of regression aims to approximate a functional mapping from \(\mathbb {R}^n\) to \(\mathbb {R}^m\). In this context, neural networks comprising affine layers coupled with either ReLUs or logistic layers offer universal approximation capabilities (Hornik et al 1989).

3.3 Verification task

Given a neural network \(\nu : \mathbb {R}^n \rightarrow \mathbb {R}^m\), our objective is to algorithmically verify its adherence to specified post-conditions on the output, provided it satisfies certain pre-conditions on the input. To ensure practical implementation on digital hardware, input domains must be bounded. Hence, even data from potentially unbounded physical processes are typically normalized within small ranges in practical applications. Consequently, we can assume without loss of generalization that the input domain of \(\nu \) is a bounded set \(I \subset \mathbb {R}^n\). This assumption leads to the corresponding output domain being a bounded set \(O \subset \mathbb {R}^m\) because: (i) affine transformations of bounded sets remain bounded sets, (ii) ReLU is a piece-wise affine transformation of its input, (iii) the output of logistic functions is always bounded in the set [0, 1], and (iv) the composition of bounded functions remains bounded. We stipulate that the logic formulas defining pre- and post-conditions should be interpretable as finite unions of bounded sets in the input and output domains.

Formally, given p bounded sets \(X_1, \ldots , X_p\) in I such that \(\Pi = \bigcup _{i=1}^p X_i\) and s bounded sets \(Y_1, \ldots , Y_s\) in O such that \(\Sigma = \bigcup _{i=1}^s Y_i\), we wish to prove that

$$\begin{aligned} \forall x \in \Pi \Rightarrow \nu (x) \in \Sigma . \end{aligned}$$
(1)

While this query cannot represent certain properties related to neural networks, such as invertibility or equivalence, it is able to represent the general task of testing resilience against adversarial perturbations. For instance, considering a network \(\nu : I \rightarrow O\), where \(I \subset \mathbb {R}^n\) and \(O \subset \mathbb {R}^m\), which performs a classification task, an input vector \({\hat{x}} \in I\), and a corresponding output \({\hat{y}} \in O\) correctly classified with label \(\lambda \), the formal definition of the safety property is as follows:

$$\begin{aligned} \begin{gathered} \forall x. \forall y. (\Vert x - {\hat{x}} \Vert _{\infty } \le \varepsilon \wedge y = \nu (x)) \implies \\ \Vert y - {\hat{y}} \Vert _{\infty } < \delta \end{gathered} \end{aligned}$$
(2)

where \(\varepsilon \) and \(\delta \) are, respectively, the maximum perturbation admitted on the input vector and the maximum acceptable variance on the output vector such as the output label is still \(\lambda \), and \(\Vert x(y) - {\hat{x}}({\hat{y}}) \Vert _{\infty }\) is the Chebyshev norm measuring the difference between the original and perturbed vectors. Here \(\Pi \) corresponds to the \(\ell _{\infty }\)-norm around a given point with radius \(\varepsilon \) and \(\Sigma \) to the \(\ell _{\infty }\)-norm around the corresponding output, with radius \(\delta \).

Due to the presence of universal quantifiers, proving Equation (2) is challenging. As consequence, as all verification tools and benchmarks participating in VNN-COMP (Müller et al 2022), we focus on verifying an unsafety property defined as follows:

$$\begin{aligned} \begin{aligned} \exists x. \exists y. (\Vert x - {\hat{x}} \Vert _{\infty } \le \varepsilon \wedge y = \nu (x)) \implies \\ \Vert y - {\hat{y}} \Vert _{\infty } \ge \delta \end{aligned} \end{aligned}$$
(3)

It should be clear that, if we are able to certify that a solution to Eq. (3) exists, then Equation (2) is falsified and therefore the network is proven to be unsafe. That is, the solution of Equation (3) serves as a counterexample for the given safety property.

4 Verification methodology

To enable algorithmic verification of neural networks, we consider the abstract domain \(\langle \mathbb {R}^n \rangle \subset 2^{\mathbb {R}^n}\) of polytopes defined in \(\mathbb {R}^n\) to abstract (families of) bounded sets into (families of) polytopes. We provide corresponding abstract transformers for affine and functional layers and we prove that their composition provides a consistent over-approximation of corresponding concrete networks.

Definition 1

(Abstraction) Given a bounded set \(X \subset \mathbb {R}^n\), an abstraction is defined as a function \(\alpha : 2^{\mathbb {R}^n} \rightarrow \langle \mathbb {R}^n \rangle \) that maps X to a polytope P such that \(\mathcal {C}(X) \subseteq P\).

The function \(\alpha \) maps a bounded set X to a corresponding polytope in the abstract space such that the polytope always contains the convex hull of X. As shown in Zheng (2019), we can always start with an axis-aligned regular n simplex consisting of \(n+1\) facets—e.g., the triangle in \(\mathbb {R}^2\) and the tetrahedron in \(\mathbb {R}^3\)—and then refine the abstraction as needed by adding facets, i.e., adding half-spaces to make the abstraction more precise.

Definition 2

(Concretization) Given a polytope \(P \in \langle \mathbb {R}^n \rangle \) a concretization is a function \(\gamma : \langle \mathbb {R}^n \rangle \rightarrow 2^{\mathbb {R}^n}\) that maps P to the set of points contained in it, i.e., \(\gamma (P) = \{ x \in \mathbb {R}^n \mid x \in P \}\).

The function \(\gamma \) simply maps a polytope P to the corresponding (convex and compact) set in \(\mathbb {R}^n\) comprising all the points contained in the polytope. As opposed to abstraction, the result of concretization is uniquely determined. We extend abstraction and concretization to finite families of sets and polytopes, respectively, as follows. Given a family of p bounded sets \(\Pi = \{X_1, \ldots , X_p \}\), the abstraction of \(\Pi \) is a set of polytopes \(\Sigma = \{P_1, \ldots , P_s\}\) such that \(\alpha (X_i) \subseteq \bigcup _{i=1}^s P_i\) for all \(i \in [1,p]\); when no ambiguity arises, we abuse notation and write \(\alpha (\Pi )\) to denote the abstraction corresponding to the family \(\Pi \). Given a family of s polytopes \(\Sigma = \{P_1, \ldots , P_s\}\), the concretization of \(\Sigma \) is the union of the concretizations of its elements, i.e., \(\bigcup _{i=1}^s \gamma (P_i)\); also in this case, we abuse notation and write \(\gamma (\Sigma )\) to denote the concretization of a family of polytopes \(\Sigma \).

Given our choice of abstract domain and a concrete network \(\nu : I \rightarrow O\) with \(I \subset \mathbb {R}^n\) and \(O \subset \mathbb {R}^m\), we need to show how to obtain an abstract neural network \({\tilde{\nu }}: \langle I \rangle \rightarrow \langle O \rangle \) that provides a sound over-approximation of \(\nu \). To frame this concept, we introduce the notion of consistent abstraction.

Definition 3

(Consistent abstraction) Given a mapping \(\nu : \mathbb {R}^n \rightarrow \mathbb {R}^m\), a mapping \({\tilde{\nu }}: \langle \mathbb {R}^n \rangle \rightarrow \langle \mathbb {R}^m \rangle \), abstraction function \(\alpha : 2^{\mathbb {R}^n} \rightarrow \langle \mathbb {R}^m \rangle \) and concretization function \(\gamma : \langle \mathbb {R}^m \rangle \rightarrow 2^{\mathbb {R}^m}\), the mapping \({\tilde{\nu }}\) is a consistent abstraction of \(\nu \) over a set of inputs \(X \subseteq I\) exactly when

$$\begin{aligned} \{ \nu (x) \mid x \in X \} \subseteq \gamma ({\tilde{\nu }}(\alpha (X))) \end{aligned}$$
(4)

The notion of consistent abstraction can be readily extended to families of sets as follows. The mapping \({\tilde{\nu }}\) is a consistent abstraction of \(\nu \) over a family of sets of inputs \(X_1 \ldots X_p\) exactly when

$$\begin{aligned} \{ \nu (x) \mid x \in \cup _{i=1}^p X_i \} \subseteq \gamma ({\tilde{\nu }}(\alpha (X_1, \ldots , X_p))) \end{aligned}$$
(5)

where we abuse notation and denote with \({\tilde{\nu }}(\cdot )\) the family \(\{ {\tilde{\nu }}(P_1), \ldots , {\tilde{\nu }}(P_s) \}\) with \(\{P_1, \ldots , P_s\} = \alpha (X_1, \ldots X_p)\)

To represent polytopes and define the computations performed by abstract layers we resort to a specific subclass of generalized star sets, introduced in Bak and Duggirala (2017) and defined as follows—notation and proofs are derived from Tran et al (2019).

Definition 4

(Generalized star set) Given a basis matrix \(V \in \mathbb {R}^{n \times m}\) obtained arranging a set of m basis vectors \(\{v_1, \ldots v_m\}\) in columns, a point \(c \in \mathbb {R}^n\) called center and a predicate \(R: \mathbb {R}^m \rightarrow \{\top , \bot \}\), a generalized star set is a tuple \(\Theta = (c,V,R)\). The set of points represented by the generalized star set is given by

$$\begin{aligned}{}[ \! [ \Theta ] \! ] \equiv \{z \in \mathbb {R}^n \mid z = Vx + c \text{ s.t. } R(x_1, \ldots , x_m) = \top \} \end{aligned}$$
(6)

In the following we denote \([\![ \Theta ]\!]\) also as \(\Theta \). Depending on the choice of R, generalized star sets can represent different kinds of sets, but we consider only those such that \(R(x):= Cx \le d\), where \(C \in \mathbb {R}^{p \times m}\) and \(d \in \mathbb {R}^p\) for \(p \ge 1\), i.e., R is a conjunction of p linear constraints as in Tran et al (2019); we further require that the set \(Y = \{y \in \mathbb {R}^m \mid C y \le d\}\) is bounded.

Proposition 1

Given a generalized star set \(\Theta = (c,V,R)\) such that \(R(x):= Cx \le d\) with \(C \in \mathbb {R}^{p \times m}\) and \(d \in \mathbb {R}^p\), if the set \(Y = \{y \in \mathbb {R}^m \mid C y \le d\}\) is bounded, then the set of points represented by \(\Theta \) is a polytope in \(\mathbb {R}^n\), i.e., \(\Theta \in \langle \mathbb {R}^n \rangle \).

This definition allows us to represent polytopes as generalized star sets. Henceforth, we will refer to generalized star sets adhering to these constraints simply as “stars”. The most straightforward abstract layer to obtain is the one abstracting affine transformations. Since affine transformations of polytopes are still polytopes, we just need to define how to apply an affine transformation to a star—the definition is adapted from Tran et al (2019).

Definition 5

(Abstract affine mapping) Given a star set \(\Theta = (c,V,R)\) and an affine mapping \(f: R^n \rightarrow R^m\) with \(f = Ax + b\), the abstract affine mapping \({\tilde{f}}: \langle R^n \rangle \rightarrow \langle R^m \rangle \) of f is defined as \({\tilde{f}}(\Theta ) = ({\hat{c}},{\hat{V}},R)\) where

$$\begin{aligned} \qquad \quad {\hat{c}} = Ac + b \qquad {\hat{V}} = AV \end{aligned}$$

Intuitively, the center and the basis vectors of the input star \(\Theta \) are affected by the transformation of f, while the predicates remain the same.

Proposition 2

Given an affine mapping \(f: \mathbb {R}^n \rightarrow \mathbb {R}^m\), the corresponding abstract mapping \({\tilde{f}}: \langle \mathbb {R}^n \rangle \rightarrow \langle \mathbb {R}^m \rangle \) provides a consistent abstraction over any bounded set \(X \subset \mathbb {R}^n\), i.e., \(\{ f(x) \mid x \in X \} \subseteq \gamma ({\tilde{f}}(\alpha (X)))\) for all \(X \subset \mathbb {R}^n\).

4.1 ReLU abstraction algorithms

The algorithm in Fig. 1 (Guidotti et al 2021) defines the abstract mapping of a functional layer with n ReLU activation functions and adapts the methodology proposed in Tran et al (2019). The function compute_layer takes as input an indexed list of N stars \(\Theta _1, \ldots , \Theta _N\) and an indexed list of n positive integers called refinement levels. For each neuron, the refinement level tunes the grain of the abstraction: level 0 corresponds to the coarsest abstraction that we consider—the greater the level, the finer the abstraction grain. In the case of ReLUs, all non-zero levels map to the same (precise) refinement, i.e., a piece-wise affine mapping. The output of function compute_layer is still an indexed list of stars, that can be obtained by independently processing the stars in the input list. For this reason, the for loop starting at line 3 can be parallelized to speed up actual implementations.

Fig. 1
figure 1

Abstraction of the ReLU activation function

Given a single input star \(\Theta _i \in \langle R^n \rangle \), each of the n dimensions is processed in turn by the for loop starting at line 5 and involving the function compute_relu. Notice that the stars obtained processing the j-th dimension are fed again to compute_relu in order to process the \(j+1\)-th dimension. For each star given as input, the function compute_relu first computes the lower and upper bounds of the star along the j-th dimension by solving two linear-programming problems—function get_bounds at line 11. Independently from the abstraction level, if \(lb_j \ge 0\) then the ReLU acts as an identity function (line 13), whereas if \(ub_j \le 0\) then the j-th dimension is zeroed (line 14). The \(*\) operator takes a matrix M, a star \(\Gamma = (c, V, R)\) and returns the star (McMVR). In this case, M is composed of the standard orthonormal basis in \(\mathbb {R}^n\) arranged in columns, with the exception of the j-th dimension which is zeroed.

When \(lb_j < 0\) and \(ub_j > 0\) we consider the refinement level. For any non-zero level, the input star is “split” into two new stars, one considering all the points \(z < 0\) (\(\Theta _{low}\)) and the other considering points \(z \ge 0\) (\(\Theta _{upp}\)) along dimension j. Both \(\Theta _{low}\) and \(\Theta _{upp}\) are obtained by adding to the input star input[k] the appropriate constraints. If the analysis at lines 17–18 is applied throughout the network, and the input abstraction is precise, then the abstract output range will also be precise, i.e., it will coincide with the concrete one: we call complete the analysis of NeVer2 in this case. The number of resulting stars is worst-case exponential, therefore the complete analysis may result computationally infeasible.

Table 1 List of the concrete layer classes available in pyNeVer, grouped by functionality

The linear-programming problem we need to solve in the get_bounds solver can be formalized as the following:

$$\begin{aligned}&\qquad \quad (min/max) \, z_j = {\textbf {V}}[j, :] {\textbf {x}} + c[j]\\&\qquad \quad with \quad \textbf{C} {\textbf {x}} \le \textbf{d} \end{aligned}$$

The problem must be solved as minimization and maximization to provide the lower bound and the upper bound respectively. It should be noted that the size of the problem increases with the number of variables of the predicate of the star of interest. As a consequence the get_bounds function runtime increases as more over-approximation steps are applied to the stars of interest, whereas for “splitted” stars the size of the problem remains the same. Given a ReLU mapping \(f: \mathbb {R}^n \rightarrow \mathbb {R}^n\), the corresponding abstract mapping \({\tilde{f}}: \langle \mathbb {R}^n \rangle \rightarrow \langle \mathbb {R}^n \rangle \) defined in Fig. 1 provides a consistent abstraction over any bounded set \(X \subset \mathbb {R}^n\), i.e., \(\{ f(x) \mid x \in X \} \subseteq \gamma ({\tilde{f}}(\alpha (X)))\) for all \(X \subset \mathbb {R}^n\).

5 Verification backend

pyNeVer (Guidotti et al 2021) serves as the backend for NeVer2. It is conceptualized as a modular API for managing DNNs, encompassing tasks from their building and training to verification. It comprises seven packages, each dedicated to either providing models of neural networks or implementing strategies for their conversion, training, and verification. To ensure a precise semantics of the internal model and to abstract away implementation details, we structured our own internal network representation as a graph, where nodes correspond to distinct layers. Additionally, we provide a representation for the datasets used in the learning phase. The system’s primary capabilities revolve around abstraction, training, and verification. These functionalities are predominantly organized using Strategy patterns, defining general interfaces for network operations, with specialized subclasses offering support for these operations. Furthermore, to harness the capabilities of contemporary learning frameworks, we devised a set of conversion strategies to and from our internal representation and the representations accepted by the learning frameworks.

5.1 Representation

The internal representation of a neural network is managed through two abstract base classes: NeuralNetwork and LayerNode. In essence, NeuralNetwork serves as a container for LayerNode objects organized within it as a graph. For internal purposes, a list of AlternativeRepresentation objects is maintained—refer to subsection 5.2 for more details. In the current implementation, the only concrete subclass of NeuralNetwork is SequentialNetwork, representing networks where the corresponding graph is a list, meaning each layer is connected only to the next one. More intricate network topologies can be easily incorporated by creating other concrete subclasses of the abstract NeuralNetwork class. The concrete subclasses of LayerNode correspond to the network layers currently supported. Presently, based on the VNN-LIB (Demarchi et al 2023) specifications, the available layers are detailed in Table 1, which suffice to encode sequential DNNs with the most commonly used activation functions and layers. Notably, our representation is not “executable”, that is, it lacks the capability to compute the output of a DNN given the input. As a consequence, our nodes contain only sufficient information to generate corresponding executable representations in different learning frameworks or to support encoding for verification purposes.

5.2 Conversion

The design of a model, aimed at generalizing those utilized in various learning frameworks, is based on the Adapter design pattern. We have introduced the abstract class AlternativeRepresentation, which is then specialized by ONNXNetwork and PyTorchNetwork to encode ONNX and pyTorch models, respectively. These concrete subclasses encapsulate the actual network model within the corresponding learning framework and facilitate the interchangeability of their formats, such as in the case of ONNX.

The capabilities for conversion between our internal representation and the concrete subclasses of AlternativeRepresentation are provided by the subclasses of ConversionStrategy. This can also be regarded as an implementation of the Builder pattern. ConversionStrategy defines an interface comprising two functions: one for converting from our internal representation to another specific representation, and the other for performing the inverse operation. The concrete subclasses of ConversionStrategy implement these functions for the corresponding concrete subclasses of AlternativeRepresentation. As new types of learning frameworks or architectures are integrated into pyNeVer, additional concrete subclasses of AlternativeRepresentation and ConversionStrategy will be introduced to support conversion.

5.3 Datasets

Datasets are managed through a versatile Dataset class, facilitating the direct encoding of Torch datasets for MNIST and fMNIST into the concrete classes MNISTDataset and FMNISTDataset. Additionally, the GenericFileDataset concrete class may be used to load any user-defined dataset, requiring specifications such as the dataset’s separator character, data type, and target index, which indicates the index distinguishing inputs from outputs within each row. A dataset Transform, comprising a series of functions to be applied to dataset elements, such as normalization or flattening, is an optional parameter, allowing for extensive customization using this specific dataset class.

5.4 Abstraction

pyNeVer is a verifier which leverages abstract interpretation using star sets (Tran et al, 2019; Demarchi et al, 2022). To establish the abstract representation of a neural network, we use the same conceptual framework as the concrete network. This is achieved through the class AbstNeuralNetwork, which acts as a container for AbsLayerNode objects. Concrete subclasses such as AbsFullyConnectedNode, AbsReLUNode, AbsSigmoidNode, and AbsTanhNode define the algorithms for propagating bounded sets through the layers. For representing star sets, we utilize the Star class, which serves as the primary component for representing an abstract domain through a set of inequalities and an affine transformation. Additionally, the StarSet class is employed to create a collection of star elements for propagation throughout the network. The propagation algorithms, as detailed in Demarchi et al (2022), currently support only Fully Connected layers and ReLU, Sigmoid, or Tanh activation functions through the forward method of AbsLayerNode instances.

5.5 Training

To facilitate network training, we devised a training strategy that requires a NeuralNetwork and a Dataset instance. The result of the application of this strategy, presenting a singular function called train, yields a trained, concrete NeuralNetwork object. Presently, we have implemented a single procedure, PyTorchTraining, which relies on pyTorch as the training backend and is tailored to our internal representation. To support diverse training procedures and backends, it is possible to implement new subclasses of the abstract class TrainingStrategy and tailor them accordingly.

5.6 Verification

The ultimate objective of pyNeVer is to leverage abstract interpretation for verifying a specified property. Built upon the VNN-LIB standard (Demarchi et al 2023), our verification framework features the abstract class VerificationStrategy, which represents a generic interface comprising a single function. This function requires a NeuralNetwork instance alongside a Property, and returns a Boolean value indicating whether the property is verified or not, along with a counterexample (if available). The abstract class Property presents two subclasses: NeVerProperty and LocalRobustnessProperty. NeVerProperty embodies a general property conforming to the VNN-LIB standard and, as consequence, can be also parsed through reading an SMT-LIB (Barrett et al 2010) file. It comprises the input bounds and the output unsafe region(s). Conversely, LocalRobustnessProperty serves as a “template” property, encoding the search for an adversarial example corresponding to a specific data sample. Concrete subclasses of VerificationStrategy include NeVerVerification, which constitutes our principal contribution detailed in Guidotti et al (2021) and Demarchi et al (2022), along with a refinement-based variant named NeVerVerificationRef.

6 System overview

From a user’s perspective, the primary feature of NeVer2 is to offer a graphical interface for interacting with the functionalities of both pyTorch and pyNeVer. The graphical user interface (GUI) of NeVer2 is constructed using the PyQt6 graphical library, with its main architecture outlined in Fig. 2. The application is structured as a canvas where nodes representing the network layers can be displayed and organized. A sidebar on the left provides buttons for drawing the supported layers, while another sidebar on the right can be opened on demand to display detailed information on specific layers. The training and verification windows can be accessed via the menu bar, which also includes placeholders for future features. In the following, we provide a detailed description of the environment and the resources available. It is worth noting that the graphical interface of NeVer2 is shared with another tool of ours, CoCoNet (Demarchi et al 2023), which is used for visualizing and converting DNNs in various formats.

Fig. 2
figure 2

UML Class Diagram representing the main software components of NeVer2. Using the PyQt API we leverage the QGraphicsView and QGraphicsScene interfaces to build a workspace in the QMainWindow. On the other hand, the class Scene serves as a controller for the creation and display of graphics blocks and as an interface to pyNeVer components

6.1 Building

In Fig. 3, a screenshot of NeVer2’s graphical interface is displayed. The interface shows input and output blocks for defining the network input and corresponding labels (first and last blocks). Additionally, it shows the definition of a fully connected layer with 50 neurons (second block), which is automatically added sequentially to the network, followed by a ReLU activation layer (third block). To update the layer’s parameters before adding a new one, including the input block for specifying the input dimension, it is necessary to click on the block’s “Save" button. The “Restore defaults” option resets the values to their default settings without overwriting. At this stage, the neural network designed graphically is supported by the internal representation outlined in Sect. 5.1. This representation performs necessary checks to ensure a correct representation. For instance, attempting to add incompatible layers, such as convolutional layers with a single-dimension shape, results in error messages explaining why the connection is not possible. Once the network is finalized, it can be saved in the ONNX or pyTorch file formats by navigating to the “File... \(\rightarrow \) Save/Save as...” menu.

Fig. 3
figure 3

Screenshot of the NeVer2 interface. The main component is the canvas, where layers are depicted as blocks. On the left side, there is a list of available layers. In this screenshot, we have added a fully connected layer followed by a ReLU layer

6.2 Training

Suppose we wish to train a neural network on the ACAS XU dataset (Katz et al 2017). In Fig. 4, we have clicked on the menu “Learning... \(\rightarrow \) Train...” and can see the window for setting up the procedure. Initially, we select the dataset as a “Custom data source”. As mentioned in Sect. 5.3, we directly provide access to the MNIST and fMNIST datasets. However, any dataset can be loaded through our interface as long as it is formatted as a text file. When importing a custom data source, the user is prompted to enter the expected data type, the delimiter character, and the target index. The default values for the data type and delimiter character, matching those used in ACAS XU, are “float” and the comma character, respectively.

Fig. 4
figure 4

Screenshot of the training window in NeVer2. Here we see the required parameters and the selection of the dataset as a custom data source

While some networks may include pre-processing layers for, e.g., normalization, in VNN-LIB we strongly discourage this behavior because the properties and verification algorithms are defined on already normalized networks. For this reason, it is possible to apply a transform to the dataset following pyTorch’s style. In Fig. 5, we demonstrate how a custom transform can be applied by combining one or more functions, such as normalization and flattening for the ACAS XU dataset. It is important to note that the normalization parameters are required from the user when selecting the corresponding transform function. We also provide two transforms whose parameters are already specifically tailored for convolutional and fully connected networks for the MNIST and fMNIST datasets.

Fig. 5
figure 5

Screenshot of the dataset transform builder in NeVer2. The left list shows the available transforms, which can be composed in the right list using the two keys in the middle

Finally, we can initiate the training procedure by specifying the remaining parameters in the window. The learning algorithm utilizes the Adam optimizer (Kingma and Ba 2015) for pyTorch networks, which means that the model is internally converted to pyTorch before training. The window provides selectors for the Optimizer and the Scheduler, although there is only one option available at this time. Both the optimizer and the scheduler have additional parameters accessible on the right side of the dialog. In Fig. 6, we can see the parameters related to the Adam optimizer. Next, we can choose the Loss function (either Cross Entropy or MSE loss) and the Precision metric (either Inaccuracy or MSE loss). Then, we specify the number of training Epochs, the portion of the dataset to be used as the Validation set, and the Training batch and Validation batch sizes. Additionally, we can utilize the CUDA cores of the GPU, if supported, set an early stopping criterion using the Train patience, adjust the directory for storing Checkpoints, and control the Verbosity level.

Fig. 6
figure 6

Screenshot of the filled training window in NeVer2

6.3 Verification

After training the network, we can specify VNN-LIB properties concerning both the input and the output. We show in Fig. 7 the input bounds for property P3 of the ACAS XU benchmark. Utilizing the “Generic SMT” option available in the input block, we simply select “Add property” to access a text area where we can input plain SMT-LIB constraints. This visualization is also the default when opening a property file.

Fig. 7
figure 7

Screenshot of a “Generic SMT” property definition in NeVer2

Fig. 8
figure 8

Property definition alternatives in NeVer2 with Box (left) and Polyhedral (right) interfaces. a Screenshot of a “Box” property definition in NeVer2. The number of lower and upper bounds must be consistent with the network input. b Screenshot of a “Polyhedral” property definition in NeVer2. Here each variable can be bounded separately

While the standard representation of properties in VNN-LIB follows the specific format presented, we also offer two more user-friendly interfaces to enable non-experts to define their own properties. In Fig. 8a and Fig. 8b, we present the same input bounds as depicted in Fig. 7, but defined using the “Box” option for lower and upper bounds, or the “Polyhedral” option for bounding variable-by-variable with constraints, respectively. Note that in these representations, bounding values are truncated for readability. Once a property is defined, a corresponding block is added to the canvas and connected to the input or output node, allowing visualization and modification of the property.

Finally, we can verify the property by selecting “Verification... \(\rightarrow \) Verify...” from the menu. This action opens the verification dialog shown in Fig. 9, where we can choose the verification strategy based on different verification algorithms detailed in Demarchi et al (2022) and briefly described in the following.

Complete analysis With this algorithm, whenever a neuron is unstable for dimension j, the input star is “split” into two new stars: one considering all points where \(z < 0\) (\(\Theta _{low}\)), and the other considering points where \(z \ge 0\) (\(\Theta _{upp}\)) along dimension j. Both \(\Theta _{low}\) and \(\Theta _{upp}\) are derived by adding appropriate constraints to the input star input[k]. If the analysis at lines \(20 - 21\) of Algorithm 1 is applied throughout the network, and the input abstraction is precise, then the abstract output will also be precise, meaning it will coincide with the concrete one. In this case, we refer to the analysis of NeVer2 as “complete”. However, due to the worst-case exponential growth in the number of resulting stars, the complete analysis may become practically infeasible.

Over-approximate analysis With this algorithm, the ReLU function is abstracted using the over-approximation method proposed in Tran et al (2019). This approach is considerably less conservative compared to others, such as those based on zonotopes or abstract domains, and yields a tighter abstraction. If this analysis is applied across the entire network, the resulting output star will be a sound over-approximation of the concrete output range. In this case, we refer to the analysis of NeVer2 as “over-approximate”. Although the number of stars remains constant throughout the analysis, each unstable neuron introduces a new predicate variable, which linearly increases the size of the program to be solved by get_bounds.

Mixed analysis In Guidotti et al (2021), a novel approach is introduced to manage varying levels of abstraction during the analysis. Algorithm 1 is devised to control the abstraction at the individual neuron level, allowing for each neuron to have its own refinement level. This algorithm strikes a balance between the complete and over-approximate ones. To try to minimize the approximation error, neurons within each layer are ranked based on the area of their linear relaxation. Intuitively, neurons with wider bounds contribute to a broader area and consequently a larger approximation error. We then split the star along the neuron with the widest bounds and propagate the approximate method across the remaining neurons in the layer, ensuring that each layer undergoes at most a single split. This significantly reduces computational costs, as the growth becomes quadratic in the number of layers, and the complexity increase due to the approximation is contained. We refer to the analysis of NeVer2 in this case as “mixed”.

The verification dialog records the outcomes of the verification process and returns either “True” or “False” based on whether the network is deemed safe or not.

Fig. 9
figure 9

Screenshot of the verification window in NeVer2. The three alternatives refer to different abstract propagation algorithms that are either faster or more precise

7 Evaluation and benchmarks

In this evaluation, we assess NeVer2 using a variety of verification tasks drawn from prior research studies, including some case studies proposed by us. It is worth noting that while the verification community has made significant strides in developing innovative methodologies, there remains a shortage of widely accepted benchmarks. Among the few available, the ACAS XU benchmark, introduced in 2017, remains the most prominent (Katz et al 2017). In our selection process, we aimed to choose tasks relevant to practical applications while ensuring that the neural networks involved were sufficiently small for existing verification techniques to be effectively applied.

Table 2 Details of the benchmarks on which we evaluate the performances of NeVer2 and other systems
Table 3 Result of the performance evaluation of NeVer2, NNV and \(\alpha \),\(\beta \)-CROWN

In Table 2, we present the selection of benchmarks considered in our evaluation. We include the ACAS XU benchmark both as a reference point and because it features the “deepest” models, comprising five layers with 30 neurons each. The safety properties for testing are the standard ones outlined in Katz et al (2017). Additionally, we introduce the ACC case study from our prior work Demarchi et al (2022). This case study involves training three network architectures with a total neuron count ranging from 30 to 90, and we assess five safety properties for each network. We also include three Reinforcement Learning benchmarks, namely Cartpole, Lunar Lander, and Dubins Rejoin, sourced from VNN-COMP 2022 (Müller et al 2022). These benchmarks feature networks of increasing complexity, with neuron counts ranging from 128 to 768, and entail a set of 100 local robustness properties, except for the Dubins Rejoin network, which has 96 properties. Lastly, we introduce the Drone hovering Reinforcement Learning benchmark, where the task is to control a drone to hover at a specified height by adjusting the RPM of its four rotors. We present eight architectures with neuron counts ranging from 48 to 448, along with two local robustness properties for each architecture (Demarchi 2023).

Fig. 10
figure 10

Graphical analysis of the performance evaluation of NeVer2, NNV and \(\alpha \),\(\beta \)-CROWN using cactus plots. For each benchmark class we report the CPU time took by the verification algorithms for solving an instance. As in Table 3, for NeVer2 we consider only “True” answers as valid data when using abstract algorithms

Table 3 and Fig. 10 present the results of our experimental evaluation, conducted on a cluster comprising identical PCs featuring 12th Gen Intel Core i7-127000KF processors @3.61GHz and 32GB of RAM, running Ubuntu Linux 20.04.6 LTS. All experiments were executed with a timeout of 15 min, which exceeds the VNN-COMP timeout of 5 min to accommodate the substantial difference in computing infrastructure between our setup and that of VNN-COMP. We compare the performance of NeVer2 across three settings: Over-approx, Mixed, and Complete, against the solvers NNV and \(\alpha \),\(\beta \)-CROWN in a complete verification setting. We focus on these tools because NNV is the sole VNN-COMP contestant employing reachability analysis using stars, while \(\alpha \),\(\beta \)-CROWN has emerged as the winner in the last three VNN-COMP editions. This comparative analysis enables us to position NeVer2 in relation to other state-of-the-art tools.

In Fig. 10, we present cactus plots illustrating the performance of the five algorithms across each benchmark class outlined in Table 2. In cactus plots, the results of each solver on each benchmark (specific network and property) are arranged independently from the other solvers. Consequently, points at the same point of the x-axis may not correspond to the same benchmark, and benchmarks where solvers time out are excluded from consideration. Conceptually, a cactus plot provides an indication of “how far a solver can progress” relative to a family of benchmarks. Generally, the further the cactus arm extends to the right and the lower it is on the y-axis, the better the solver’s performance. In Fig. 10, we illustrate the outcomes of algorithms that returned either “True” or “False” in a complete verification setting, and only “True” for the incomplete verification settings of NeVer2. This distinction is made because in incomplete verification, “False” is not guaranteed to be a correct answer due to potential approximation errors. To complement our experimental findings, we present Table 3, which highlights the number of solved instances per benchmark. This tabular format is chosen as cactus plots are not conducive to displaying such detailed information.

Upon reviewing the plots, it becomes evident that the abstract algorithms employed by NeVer2 consistently yield responses in less time compared to the complete algorithm, whenever applicable. Conversely, the complete algorithm consistently delivers a conclusive answer, successfully identifying both safe and unsafe networks. Specifically, in both the ACC and Drones benchmarks, all properties are deemed unsafe, leading to the absence of records for abstract algorithms in the plots. For Cartpole, abstract algorithms verify 95 benchmarks, yet the difference in time between abstract and complete algorithms is negligible, making it challenging to distinguish between them. In Lunar Lander, abstract algorithms verify 18 instances, coinciding with the markers for \(\alpha \),\(\beta \)-CROWN in the plot. Under the 15-min timeout, the complete algorithm manages to verify all instances in the ACC and Cartpole classes, nearly all instances in the Drones and Lunar Lander classes, and more than half of the ACAS XU instances. However, the Dubins Rejoin class poses the greatest challenge, with only a few instances solved within the timeout period. This difficulty is expected due to the exponential growth in the number of stars corresponding to the increasing number of neurons, particularly when dealing with layers featuring 256 neurons right from the start, which significantly impact performance.

Upon evaluating the performances of NNV and \(\alpha \),\(\beta \)-CROWN, it becomes evident that NeVer2 performs quite well compared to NNV, albeit being slower than \(\alpha \),\(\beta \)-CROWN. However, it demonstrates promising results particularly in scenarios involving over-approximation and mixed algorithms. When examining the Cartpole benchmark, we can evaluate the setup overhead of the tools due to its simplicity. Here, NNV demonstrates that its MATLAB implementation incurs less setup overhead compared to NeVer2 and \(\alpha \),\(\beta \)-CROWN. The performance gap is nearly two orders of magnitude; however, it is important to note that the runtimes are still relatively small for all the tools in the comparison. Additionally, it is worth mentioning that on this benchmark, NNV scales slightly worse than both NeVer2 and \(\alpha \),\(\beta \)-CROWN. The findings from the case studies ACAS XU, Drones, and Lunar Lander exhibit notable similarities: \(\alpha \),\(\beta \)-CROWN, leveraging rapid bound propagation algorithms and GPU-based optimizations, effectively manages nearly every instance and typically outperforms both NNV and NeVer2 in terms of speed and instance resolution. NeVer2 consistently demonstrates faster performance than NNV, except in cases where its longer setup time hinders its efficacy. For completeness, it is worth noting that \(\alpha \),\(\beta \)-CROWN prematurely halted on two instances of ACAS XU due to a memory leak. The Dubins Rejoin benchmark clearly demonstrates the superiority of \(\alpha \),\(\beta \)-CROWN: while NeVer2 occasionally achieves comparable performance using incomplete abstract methods, \(\alpha \),\(\beta \)-CROWN consistently delivers conclusive results. Finally, the ACC case study presents a relatively straightforward benchmark. However, both NNV and \(\alpha \),\(\beta \)-CROWN struggled to address it due to the inability to represent the required pre-conditions, which cannot be represented as a hyper-rectangle in the input domain, that is, as the “standard” robustness property precondition. In the case of NNV, it is due to a problem with the property parser, since the verification algorithm itself is pretty similar to the one of NeVer2 and should be capable of successfully handling these benchmarks.

8 Conclusions

In this paper, we have presented NeVer2, the sole system presently integrating design, training, and verification functionalities for a significant subset of DNNs. Despite being a research prototype, NeVer2 facilitates the verification of small-to-medium scale networks which hold practical utility in control applications. Both NeVer2 and its verification backend, pyNeVer, are engineered to be easily extendable; their code is clear and well-documented to foster further contributions and extensions from the research community.

We assessed the performance of NeVer2 by comparing it with two other tools showcased in VNN-COMP: \(\alpha \),\(\beta \)-CROWN and NNV. The comparison results indicate that, while NeVer2 is slower than \(\alpha \),\(\beta \)-CROWN—the winner of the last three VNN-COMPs—it outperforms NNV, the sole other VNN-COMP contestant employing similar algorithms and data structures. Additionally, NeVer2 demonstrates the ability to handle more intricate safety and robustness specifications, as highlighted by the ACC case study.

Regarding verification capabilities, NeVer2 is applicable only to a subset of state-of-the-art verification benchmarks featuring feed-forward neural networks with ReLU activation functions. Furthermore, at this time NeVer2 does not support all operators available in pyTorch, meaning that certain DNNs trainable in pyTorch may not be visualized or verified in NeVer2. Nevertheless, commonly used operators are already accessible, and the set can be expanded to encompass additional ones.

NeVer2 and pyNeVer are open-source and can be freely downloaded for research and educational purposes from:

http://www.neuralverification.org/