Keywords

1 Introduction

Verifying that neural networks satisfy desirable properties has become crucial for ensuring the safety of learning-enabled autonomous systems. However, most existing approaches that provide guarantees on the satisfaction of a specification are designed for adversarial input uncertainties, and offer only qualitative assessments. Complete methods return whether or not the property is satisfied but are often not scalable, while sound methods return either that a property is satisfied or that the answer is unknown, due to over-approximation errors. For instance, the analyzers DeepZ [27], DeepPoly [28] and Verinet [13] propagate respectively zonotopes, polyhedra, and symbolic intervals through the layers of a neural network, to ensure that certain specifications are met. In addition, many analyzers have considered producing also robustness bounds of networks, as specifically done by CROWN [35], FCROWN [16] and CNN-Cert [5].

In contrast, quantitative verification has been little explored for neural networks, despite providing a better understanding of the system by refining information about property satisfaction. This is especially true for probabilistic verification. Some authors have considered estimating the statistics of the output of neural networks, given a multivariate probabilistic law for its inputs. This approach has been used in particular for assessing the robustness of neural networks [31, 36] and for probabilistically certifying their correctness under adversarial attacks [2, 3, 14, 20, 34]. But these estimates, using improved sampling methods, do not give guaranteed bounds. Even fewer articles have considered guaranteed probabilistic bounds. In [32], the authors describe the analyzer PROVEN, which provides probability certificates of neural network robustness when the input perturbation is given by a probabilistic distribution, based on abstractions. For networks with ReLU activation function, methods have been developed in [21, 22] to find the probability of the output or of the input-output relationships. In [8], the authors consider an ellipsoid input space with Gaussian random variables and compute confidence ellipsoids for the outputs of ReLU networks, using semi-definite programming. In [29], the authors consider truncated multivariate Gaussian distribution inputs and abstract them by probabilistic stars (ProbStars), a variation of the star set abstraction [1] recently introduced in the context of reachability analysis. They propagate them in a guaranteed manner in a network, and estimate the probability of violating a safety property on the output by computing each probstar’s probability. In a way, ProbStar is a hybrid method, relying on guaranteed set-based computations, but estimating the probabilities in a non guaranteed manner.

The works mentioned above rely on the hypothesis of either set-based or probabilistic information on the inputs. However, in real-world systems, precise models representative of the data are not always available. For instance, several probabilistic models may be plausible for describing of a problem, or a probabilistic model may be known but with uncertain parameters. Therefore, we need to consider both aleatory information and epistemic uncertainty. Imprecise probabilities [4, 30] offer a framework that unifies probabilistic and set-based information. This framework includes a wide variety of mathematical models, among which probability boxes (p-boxes in short) [9], which characterize an uncertain random variable by all probability distributions consistent with lower and upper bounds on its cumulative distribution function (CDF in short). A p-box can be seen as interval bounds on a probability distribution. An Interval-based discrete over-approximation of p-boxes, Interval Dempster-Shafer structure [25] (DSI in short) has been proposed. Algorithms for arithmetic operations on DSI can be derived [9, 33], which can be seen as a unification of standard interval analysis with traditional probability theory, allowing probability bound analysis on arithmetic expressions. It gives the same answer as interval analysis does when only range information is available. And it gives sound bounds on the distribution function, as a sound counterpart of a Monte Carlo simulation, when information is precise enough to fully specify input distributions and their dependencies. However, DSI arithmetic is expensive and suffers from the conservativeness of interval arithmetic, on which it relies. Probabilistic affine forms have been proposed as an alternative [6, 7], improving both precision and efficiency by combining affine forms or zonotopes and DSI structures.

In this work, we first extend for the analysis of ReLU neural networks, the Interval Dempster Shafer arithmetic in Sect. 3 and probabilistic affine arithmetic in Sect. 4. We then introduce in Sect. 5 a new abstraction, Zonotopic Dempster Shafer structures, which exhibits much better computational properties (complexity and tightness of the approximations). This new abstraction is directly related to the general notion of random sets [24] which generalize one-dimensional Dempster-Shafer structures such as the DSI. Finally, in Sect. 6 we demonstrate on benchmarks from the state of the art that our approach improves in terms of tradeoff between tightness and efficiency compared to the most closely related work [29], while being able to handle much more general classes of uncertainties on the inputs and providing fully guaranteed results.

2 Problem Statement

We consider an L-layer feedforward ReLU network with input \(x^0 \in \mathbb {R}^{h_0}\) and output \(y = f(x^0) = x^L \in \mathbb {R}^{h_L}\), with f being the composition of L layers, \(f = f^{L-1} \circ \ldots \circ f^0\). The k-th layer of the ReLU network is defined by \(f^k : \mathbb {R}^{h_k} \rightarrow \mathbb {R}^{h_{k+1}}\) of the form \( x^{k+1} =f^k(x^k)= \sigma (A^k x^k +b^k),\) where \(A^k \in \mathbb {R}^{h_{k+1} \times h_k}\) is the weight matrix, \(b^k \in \mathbb {R}^{h_{k+1}}\) is the bias, and \(\sigma (x^k_j) := \max (0,x^k_j)\) is the component-wise ReLU function, where \(x^k_j\) is the jth component of \(x^k \in \mathbb {R}^{h_k}\).

For a multi-dimensional distribution \(X \in \mathbb {R}^n\), we note \(P(X \le x) := \mathbb {P}(X_1 \le x_1 \wedge X_2 \le x_2 \ldots \wedge X_n \le x_n)\), and in what follows, we systematically use \(\le \) as the componentwise order. We are interested in the following two problems, extending, in particular to a larger class of inputs, the quantitative verification properties of [29]:

Problem 1

(Probability Bounds Analysis). Given a ReLU network f and a constrained probabilistic input set \(\mathcal{X}= \{ X \in \mathbb {R}^{h_0} \mid C X \le d \wedge \underline{F}(x) \le \mathbb {P}(X \le x) \le \overline{F}(x), \forall x \}\) where \(\underline{F}\) and \(\overline{F}\) are two cumulative distribution functions, compute a constrained probabilistic output set \(\mathcal{Y}\) guaranteed to contain \(\{f(X), X \in \mathcal{X} \}\).

Problem 2

(Quantitative Property Verification). Given a ReLU network f, a constrained probabilistic input set \(\mathcal{X}\) and a linear safety property \(Hy \le w\), bound the probability of the network output vector y satisfying this property.

We illustrate throughout the paper the approach on the toy example below. The details on the analyzes of this example, which results are stated in further sections, are provided in the Appendix of preprint [10].

Example 1

We consider the ReLU network defined by the matrices of weights and biases: \( A_1 = \begin{bmatrix} 1 & -1 \\ 1 & 1. \end{bmatrix}, b_1 = \begin{bmatrix} 0.0 \\ 0.0 \end{bmatrix}, A_2 = \begin{bmatrix} 1 & -1 \\ 1 & 1 \end{bmatrix}, b_2 = \begin{bmatrix} 0.0 \\ 0.0 \end{bmatrix}. \) We take only one ReLU layer and an affine output layer. After the ReLU layer, we note \(x^1 = \sigma (A_1 x^0 + b_1) = \sigma (x^0_1-x^0_2,x^0_1+x^0_2)\), and after the output layer \(x^2 = A_2 x^1 + b_2\).

The problem is to verify the network against the unsafe output set \(x^2_1 \le -2 \wedge x^2_2 \ge 2\) for an input \(x^0=(x^0_1,x^0_2) \in [-2,2] \times [-1,1]\). This writes \(H x^2 \le w\) with \(H = \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}, w = \begin{bmatrix} -2 & - 2 \end{bmatrix}\).

3 Analysis with Interval Dempster-Shafer Structures

Probability-Boxes and Interval Dempster-Shafer Arithmetic. We characterize a real-valued random variable X by its cumulative probability distribution function (CDF in short) \(F: \mathbb {R}\rightarrow [0,1]\) defined by \(F(x) = \mathbb {P}(X \le x)\). A p-box [9] is defined by a pair of CDF:

Definition 1

(P-box). Given two CDF \(\overline{F}, \underline{F}\), the p-box \([\underline{F},\overline{F}]\) represents the set of distribution functions F such that \(\underline{F}(x) \le F(x) \le \overline{F}(x)\) for all \(x \in \mathbb {R}\).

P-boxes can be combined in mathematical calculations, but analytical solutions are usually not available. Interval Dempster-Shafer structures [25] provide a simple way to soundly over-approximate the set of cdfs using a discrete representation, for which the arithmetic operations can be converted into a series of elementary interval calculations.

Definition 2

(Interval Dempster-Shafer Structure). An interval Dempster-Shafer structure (DSI in short) is a finite set of intervals, named focal elements, associated with a probability, written \(d=\bigl \{\langle \boldsymbol{x_1},w_1\rangle ,\langle \boldsymbol{x_2},w_2\rangle ,\dots ,\langle \boldsymbol{x_n},w_n\rangle \bigr \}\), where \(\boldsymbol{x_i}\) is an interval and \(w_i\in (0,1]\) is its probability, with \(\sum _{k=1}^nw_k=1\).

Proposition 1

(CDF of an Interval Dempster-Shafer Structure). A DSI \(d=\bigl \{\langle \boldsymbol{x_1},w_1\rangle ,\langle \boldsymbol{x_2},w_2\rangle ,\dots ,\langle \boldsymbol{x_n},w_n\rangle \bigr \}\) defines the discrete p-box \(\bigl [\underline{F_d},\overline{F_d}\bigr ]\) representing the sets of distributions such that \(\underline{F_d}(u) \le \mathbb {P}(X \le u) \le \overline{F_d}(u)\) with \(\underline{F_d}(u)=\sum _{\overline{\boldsymbol{x_i}}< u}w_i\) and \(\overline{F_d}(u)=\sum _{\underline{\boldsymbol{x_i}}\le u}w_i\), noting \(\underline{\boldsymbol{x_i}}\) (resp. \(\overline{\boldsymbol{x_i}}\)) the greatest lower bound (resp. least upper bound) of set \(\boldsymbol{x_i}\).

Conversely, discrete upper and lower approximations of distribution functions can be constructed, for instance using the inverse CDF as in [33]. Given a discretization size N, they define a DSI with N focal elements where all weights are equal to 1/N. The focal elements \(\boldsymbol{x_i}\) are defined evaluating the quantiles or inverse cdfs for uniformly spaced probability levels \(p_i = \frac{i-1}{N}\) for \(i = 1, \ldots , N + 1\), by \(\boldsymbol{x_i} = [\overline{F}^{-1}_d(p_i),\underline{F}^{-1}_d(p_{i+1})]\) where \(F^{-1}_d(p)=\inf \{x \mid F_d(x) \ge p \}\).

The arithmetic operators on DSI structures [9, 33] compute guaranteed enclosures of all possible distributions of an output variable if the input p-boxes enclose the input distributions. Let two random variables X and Y represented by DSI structures \(d_X=\{\langle \boldsymbol{x_i},w_i\rangle ,\,i\in [1,n]\}\) and \(d_Y=\{\langle \boldsymbol{y_j},w'_j\rangle ,\ j\in [1,m]\}\), and Z be the random variable such that \(Z=X+Y\) (the algorithms for other arithmetic operations are similar). In particular, they define algorithms for the extreme cases of unknown dependence and independence between X and Y.

Definition 3

(Probabilistic Dependence and Dependence Graph). Two random variables \(X_1\) and \(X_2\) are independent if and only if their CDF can be decomposed as \(F(x_1,x_2)= F_1(x_1)F_2(x_2)\). Otherwise, the random variables are called correlated. The probabilistic dependence graph G over a set of n variables \(X_1, \ldots , X_n\) is an undirected graph where the \(X_i\) are the vertices and there exists an edge \((X_i,X_j)\) in the graph iff variables \(X_i\) and \(X_j\) are correlated.

The addition of DSI independent variables is obtained as a discrete convolution of the two input distributions [9, 33]:

Definition 4

(Addition of Independent DSIs). If X and Y are independent random variables, then the DSI for \(Z = X \oplus Y\) is \(d_Z=\{\langle \boldsymbol{z_{i,j}},r_{i,j}\rangle ,\ i\in [1,n],\ j\in [1,m]\}\) such that: \( \forall i\in [1,n],\ j\in [1,m],\ \boldsymbol{z_{i,j}}=\boldsymbol{x_i} + \boldsymbol{y_j}\text { and }r_{i,j}=w_i\times w'_j . \)

The number of focal elements grows exponentially with the number of such operations. In order to keep the computation tractable, the number of focal elements is usually bounded, at the cost of some over-approximations.

Different algorithms have been proposed for the addition of DSIs with unknown dependence, relying on the Fréchet-Hoeffding copula bounds or on linear programming, most of them produce the same result [9, 23, 33].

DSI Analysis of Neural Networks. We now define a sound probability bounds analysis of ReLU neural networks.

Modelling the Network Inputs. Consider an \(h_0\)-dimensional uncertain input vector \(x^0=(x^0_1,\ldots ,x^0_{h_0})\), which can be represented as a vector \(d^0 = (d^0_1, \ldots ,d^0_{h_0})\) of \(h_0\) DSI, each with the same number n of focal elements for simplicity of presentation: where \(\boldsymbol{x^0_{i,j}}\in \mathbb {I}\mathbb {R}\) is an interval and \(w^0_{i,j}\in [0,1]\) is the associated probability, with \(\sum _{j=1}^n w^0_{i,j}=1\), for all \(i \in 1,\ldots ,h_0\). A dependence graph is assumed to be known between the components of the input vector. Affine Transform of a Vector of DSI Structures. Given a vector of random variables \(X=(X_1,\ldots ,X_k)\) represented as a vector \(d = (d_1, \ldots ,d_k)\) of DSI structures, and a dependence graph G, we define a DSI \(d^y = \sum _{j=1}^k a_j d_j + b\) which includes the result of \(Y = \sum _{j=1}^k a_j X_j + b\) on the DSI d by:

  • we note \(a_j d_j\) where \(a_j \in \mathbb {R}\) and \(d_j\) is a DSI \(\left\{ \langle \boldsymbol{x_{j,i}},w_{j,i}\rangle \mid i=1,\ldots ,n\right\} \), the result of the multiplication of a constant by a DSI: \(\left\{ \langle a_j \boldsymbol{x_{j,i}},w_{j,i}\rangle \mid i=1,\ldots ,n\right\} \),

  • we arbitrarily choose to compute the sum \(\sum _{j=1}^k a_j d_j\) as \(((a_1 d_1 + a_2 d_2) + a_3 d_3) + \ldots a_k d_k) \), applying for the j-th sum the right operators depending of the dependence between \(X_{j+1}\) and \(X_1\) to \(X_j\),

  • we note \(d + b\) where d is a DSI \(\left\{ \langle \boldsymbol{x_i},w_i\rangle \mid i=1,\ldots ,n\right\} \) and \(b \in \mathbb {R}\) the result of the addition of a constant to a DSI: \(\left\{ \langle b+ \boldsymbol{x_i},w_i\rangle \mid i=1,\ldots ,n\right\} \),

  • the dependence graph is updated by adding an edge between Y and all \(X_j\) such that \(a_j\) is non zero

Interpreting the action of the ReLU function \(Y=\max (0,X)\) means enforcing the constraints \(Y \ge X\) and \(Y \ge 0\). This means that Y is obtained by intersecting the focal elements of the representation of X with \([0,\infty )\):

Lemma 1

(ReLU of a DSI). Given a random variable X represented by the DSI \(d= \{\langle \boldsymbol{x_i},w_i\rangle ,\,i\in [1,n]\}\), then the CDF of \(Y=\sigma (X)=\max (0,X)\) is included in the DSI \( \{\langle \boldsymbol{y_i},w_i\rangle ,\,i\in [1,n]\} \text{ with } y_i = [\max (0,\underline{x_i}), \max (0,\overline{x_i})]. \)

This leads to Algorithm 1 for the analysis of an L-layer ReLU network with the notations of Sect. 2.

Algorithm 1
figure b

. ReLU feedforward neural network analysis by DSI arithmetic

Output The output after propagation in the network consists in:

  • the vector of DSI \(d^L\) characterizing the network output (solving Problem 1)

  • interval bounds noted \(cdf(H d^L,w)\) on the probability \(\mathbb {P}(H x^L \le w)\) (solving Problem 2). Let \([\underline{P_m},\overline{P_m}]\), with m ranging over the rows of H and w, be the interval for the probability \(\mathbb {P}( \sum _{i=1}^{h_L} h_{mi} x^L_i \le w_m)\). It is obtained applying Proposition 1 to compute the CDF at w on each component of the vector \(H d^L\). We define \(cdf(H d^L,w) = [\min _m \underline{P_m}, \min _m \overline{P_m}].\)

The DSI computation encodes the marginal distribution of each component of a vector \(x^i\) as a DSI. The probability of a conjunction \(H x^L\) is thus computed considering each inequality independently and expressing that the probability of the conjunction is lower or equal than the probability of each term.

Analysis of the Toy Example. Consider Example 1. A classical interval analysis of the network from the input set \(x^0=(x^0_1,x^0_2) \in [-2,2] \times [-1,1]\) yields the output ranges \(x^2_1 \in [-3,3]\) and \(x^2_2 \in [0,6]\). As these have non empty intersection with the property \(x^2_1 \le -2 \wedge x^2_2 \ge 2\), this analysis does not allow to conclude.

Uniform Distribution on Inputs Abstracted by DSI with 2 Focal Elements. Let us now suppose that we additionally know that the 2 components of the input follow a uniform distribution. We first choose a discretization of the inputs by DSI with 2 focal elements, \(d^0_1 = \{ \langle [-2,0], 0.5 \rangle , \langle [0,2], 0.5 \rangle \}\) and \(d^0_2 = \{ \langle [-1,0], 0.5 \rangle , \langle [0,1], 0.5 \rangle \}\). Let us suppose the inputs independent, the first output after the first affine layer, \(d_{y_1} = d^0_1 - d^0_2 \), computed following Definition 4, is \( \{ \langle [-2,1] , 0.25\rangle , \langle [-3,0] , 0.25\rangle , \langle [0,3], 0.25 \rangle , \langle [-1,2], 0.25 \rangle \}. \) In order to limit the complexity of computation, the result of each operation on DSI can be reduced by a sound overapproximation with a fixed number of focal elements. This can be done by joining some focal elements and adding the corresponding weights. For instance here, when reducing to 2 focal elements by joining the first 2 and the last 2 focal elements, this results in \(d_{y_1} = \{ \langle [-3,1] , 0.5\rangle ; \langle [-1,3] , 0.5\rangle \}\). Then, applying to \(d_{y_1}\) the ReLU function by Lemma 1 produces \( d^1_1 = \{ \langle [0,1] , 0.5 \rangle , \langle [0,3], 0.5 \rangle \}\). The other output \(x^1_2\) of the first layer has the same DSI representation. After the output layer, the first output is \(d^2_1 = d^1_1 - d^1_2 = \{ \langle [-3,1], 0.5 \rangle , \langle [-1,3], 0.5 \rangle \}\). Here \(x^1_1\) and \(x^1_2\) can no longer be considered as independent as they both are correlated to \(x^0_1\) and \(x^0_2\), the subtraction of their DSI representation is computed accordingly. The second output is \(d^2_2= d^1_1 + d^1_2 = \{ \langle [0,4], 0.5 \rangle , \langle [0,6] , 0.5 \rangle \}\).

Take now the property \(x^2_1 \le -2 \wedge x^2_2 \ge 2\). Using Proposition 1, we deduce from \(d^2_1\) and \(d^2_2\) that \(\mathbb {P}(x^2_1 \le -2) \in [0, 0.5]\) and \(\mathbb {P}(x^2_2 \ge 2) \in [0.0, 1.0]\), from which \(\mathbb {P}(x^2_1 \le -2 \wedge x^2_2 \ge 2) \in [0, 0.5]\). Consider \(\mathbb {P}(x^2_1 \le -2) \) evaluated using \(d^2_1 = \{ \langle [-3,1], 0.5 \rangle , \langle [-1,3], 0.5 \rangle \}\). Its lower bound is obtained using Proposition 1 by \(\underline{P}(-2)=\sum _{\overline{\boldsymbol{x_i}}< -2}w_i = 0\), as the upper bounds of the 2 focal elements \([-3,1]\) and \([-1,3]\) are both greater than -2. The upper bound is \(\overline{P}(-2)=\sum _{\underline{\boldsymbol{x_i}}\le u}w_i = 0.5\), as the lower bound of \([-3,1]\) is lower than -2, which is not the case for \([-1,3]\).

Increasing the Number of Focal Elements. refines the over-approximation of the input distributions and the sets of CDF obtained for the outputs. For instance for 100 focal elements, in the case inputs can be considered as independent, we obtain \(\mathbb {P}(x^2_1 \le -2) \in [0, 0.07]\) and \(\mathbb {P}(x^2_2 \ge 2) \in [0.05, 0.52]\). In the case of inputs with unknown correlation, \(\mathbb {P}(x^2_1 \le -2) \in [0, 0.26]\) and \(\mathbb {P}(x^2_2 \ge 2) \in [0, 1]\). However, the supports of the sets of distribution remain unchanged and are equal to the ranges obtained through interval analysis. Indeed, the affine layers introduce some conservatism due to the wrapping effect of the intervals used as focal elements. Additionally, joint distribution are not naturally represented in the DSI framework, making it difficult to accuractely verify general properties.

4 Analysis with Probabilistic Zonotopes

Probabilistic affine forms [6, 7] are affine forms where the symbolic variables or noise symbols are constrained by DSI structures instead of being simply bounded in [−1,1]. This can be seen as a simple way to encode affine correlations between uncertain variables abstracted by p-boxes, or a quantitative version of affine forms. We first introduce the probabilistic affine forms, presented as probabilistic zonotopes, which represent vectors of probabilistic affine forms. Then we propose an analysis of neural networks relying on these probabilistic zonotopes.

Affine Forms, Zonotopes and Probabilistic Zonotopes. An affine form is a linear expression \(\alpha _0+\sum _{j=1}^p \alpha _j \varepsilon _j \) with real coefficients \( \alpha _j\) and symbolic variables \(\varepsilon _j \) called noise symbols which values range in [−1,1]. A zonotope is the geometric concretization of a vector of affine forms:

Definition 5

(Zonotope). An n-dimensional zonotope \(\mathcal Z\) with center \(c \in R^n\) and a vector \(\varGamma = \begin{bmatrix} g_1 \ldots g_p \end{bmatrix} \in \mathbb {R}^{n,p}\) of p generators \(g_j \in \mathbb {R}^n\) for \(j = 1,\ldots ,p\) is defined as \( \mathcal{Z} = \langle c, \varGamma \rangle = \{ c + \varGamma \varepsilon \mid \Vert {\varepsilon } \Vert _{\infty } \le 1 \}\).

We note \(\gamma _i(\mathcal{Z}) = c_i +\sum _{j=1}^p g_{ij} [-1,1] \) the range of its i-th component.

Zonotopes are closed under affine transformations:

Proposition 2

(Affine Transforms of a Zonotope). For \(A \in \mathbb {R}^{m,n} \) and \(b \in \mathbb {R}^{m}\) we define \( A \mathcal{Z} + b = \langle Ac+b, A \varGamma \rangle \) as the m-dimensional resulting zonotope.

Definition 6

(Probabilistic Zonotope). For \(\varepsilon \) a vector of random variables of \(\mathbb {R}^p\), a zonotope \(\mathcal{Z}= \langle c, \varGamma \rangle \) with \(c \in \mathbb {R}^n\) and \(\varGamma \in \mathbb {R}^{n,p}\) can be interpreted as a probabilistic zonotope noted \(p\mathcal{Z}(\varepsilon ) = \langle c, \varGamma ,\varepsilon \rangle \) representing the \(n-\)dimensional random variable \(Z =c + \varGamma \varepsilon \). Let \(d_{\varepsilon }\) be a \(p-\)dimensional vector of DSI structures with support in \([-1,1]^p\) and G a dependence graph on the components \({\varepsilon }_1, \ldots , .{\varepsilon }_p\). The marginal of each component of \(p\mathcal{Z}(d_{\varepsilon } )\) is the affine transform on DSI structures: \(c^i + \sum _{j=1}^p g_{ij} d_{\varepsilon _j} , \; i = 1, \ldots ,n\) computed as in Sect. 3.

Zonotopes represent affine relations that hold between uncertain quantities. In the case of probabilistic zonotopes, imprecise affine relations hold:

Example 2

Let \( {x_1}=1+\varepsilon _1-\varepsilon _2\), \(x_2=-\frac{1}{2}\varepsilon _1+\frac{1}{4} \varepsilon _2\), \(d_{\varepsilon _1}=\{\langle [-1,0],\frac{1}{2} \rangle , \langle [0,1],\frac{1}{2} \rangle \}\), \(d_{\varepsilon _2}=\{\langle [-\frac{1}{10},0],\frac{1}{2} \rangle , \langle [0,\frac{1}{10}],\frac{1}{2} \rangle \}\), Then \(x_1+2 {x_2}=1-\frac{1}{2}\varepsilon _2\), represented by the DSI \(d=1-\frac{1}{2} d_{\varepsilon _2}=\{\langle [\frac{19}{20},1],\frac{1}{2} \rangle , \langle [1,\frac{21}{20}], \frac{1}{2} \rangle \}\), by (a simple version of) the Affine Transform of a Vector of DSI Structures. Thus the lower probability that \(x_1+2x_2\le \frac{21}{20}\) is 1; and the upper probability that \(x_1+2x_2 < \frac{19}{20}\) is 0. But \(x_1+2x_2\le 1\) has upper probability \(\frac{1}{2}\) and lower probability 0 and is thus an imprecise relation.

Probabilistic Zonotopes for the Analysis of Neural Networks. We detail hereafter Algorithm 2 using probabilistic zonotopes.

Algorithm 2
figure c

. Neural network analysis by Probabilistic Zonotopes

Input and Initialization. The input of the algorithm is the same as in Sect. 3, the uncertain input is modelled as a vector \(d^0 = (d^0_1, \ldots ,d^0_{h_0})\) of \(h_0\) DSI. We can then define \(\boldsymbol{x}^0 \in \mathbb {I}\mathbb {R}^{h_0}\) the \(h_0\)-dimensional box obtained as the support of \(d^0\), computed for each DSI as the union of its focal elements with non-zero weight. Finally, we define \(p\mathcal{Z}^0(\varepsilon ) = \langle c^0, \varGamma ^0 ,d_{\varepsilon } \rangle \) in Line 1 of Algorithm 2 by:

  • \(\mathcal{Z}^0= \langle c^0, \varGamma ^0 \rangle ,\) is built from the box \(\boldsymbol{x}^0\),

  • \(d_{\varepsilon }\) is the vector of DSI obtained by rescaling \(d^0\) between -1 and 1.

Propagation in the Layers. The propagation in the affine layers can be expressed directly as affine transform on the zonotope by Proposition 2, and later interpreted as a probabilistic zonotope. Proposition 3 introduces the ReLU transformer proposed in [26], encoded in zonotope matrix form. The ReLU transform is applied componentwise (on each row) and a new noise symbol (and thus a new column in the generator matrix) is added whenever an over-approximation is needed, that is when the input is not either always positive or negative.

Proposition 3

(ReLU Transform of a Zonotope). Let \( \mathcal{Z} = \langle c, \varGamma \rangle \) with \(\varGamma = (g_{ij})_{i,j} \in \mathbb {R}^{n,p}\) be a zonotope, we note \([l_i,u_i]=\gamma _i(\mathcal{Z})\) the range of its i-th component. The result of applying componentwise the ReLU activation function is a zonotope \( \mathcal{Z'} = \langle c', \varGamma ' \rangle \) where \(c' \in \mathbb {R}^n\) and \(\varGamma ' \in \mathbb {R}^{n,p+n}\), with \(c'_i = \lambda _i c_i + \mu _i\) and

$$ \varGamma ' = \begin{bmatrix} \lambda _1 g_{11} & \ldots & \lambda _1 g_{1p} & \mu _1 & 0 & \ldots & 0 \\ \lambda _2 g_{21} & \ldots & \lambda _2 g_{2p} & 0 & \mu _2 & \ldots & 0 \\ \ldots \\ \lambda _n g_{n1} & \ldots & \lambda _n g_{np} & 0 & 0 & \ldots & \mu _n \\ \end{bmatrix}, (\lambda _i, \mu _i) = \left\{ \begin{array}{ll} (1,0) & \text{ if } l_i \ge 0, \\ (0,0) & \text{ if } u_i \le 0, \\ ( \frac{u_i}{u_i-l_i}, - \frac{u_i l_i}{2(u_i-l_i) } ) & \text{ otherwise. } \end{array} \right. $$

Output. The output zonotope after the L layers is \(\mathcal{Z}^{L}= \langle c^L, \varGamma ^L \rangle \) with \(c^L \in \mathbb {R}^{h_L}\) and \(\varGamma ^L \in \mathbb {R}^{h_L,\sum _{k=0}^L h_k}\). At line 5 of Algorithm 2, the probabilistic zonotope \(p\mathcal{Z}^{L}(d_{\varepsilon })\) is converted into a vector of DSI, following Definition 6. In this interpretation as a probabilistic zonotope, we must define the DSI structures corresponding to the \(\sum _{k=1}^L h_k\) new noise symbols introduced by the ReLU transformers. A sound although conservative interpretation is to take the interval \([-1,1]\) as DSI for them. This corresponds to considering that there is no available information about the distribution of the variable represented by these new noise symbols.

At line 6, the transform \(H \mathcal{Z}^{L}\), interpreted as a probabilistic zonotope, is converted in a vector of DSI and used to bound the probability \(\mathbb {P}(Hy \le w)\).

Analysis of the Toy Example. We consider again Example 1.

Deterministic Zonotopes Analysis. From the input sets \(x^0 \in [-2,2] \times [-1,1]\), the zonotopic interpretation is initialized with the affine forms \(x^0_1 = 2 \varepsilon _1\), \(x^0_2 = \varepsilon _2\) with \(\varepsilon _1, \varepsilon _2 \in [-1,1]\), encoded: \(\mathcal{Z}^0= \langle c^0, \varGamma ^0 \rangle \text{ with } c^0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \varGamma ^0 = \begin{bmatrix} 2 & 0 \\ 0 & 1 \end{bmatrix}. \) Using the affine transforms on zonotopes and Proposition 3 for the ReLU layer with \((\lambda ,\mu )=(0.5,0.75)\) for both neurons, we obtain after the second affine layer:

$$ \mathcal{Z}^2 = A_2 \mathcal{Z}^1 + b_2 = \langle \begin{bmatrix} 0 \\ 1.5 \end{bmatrix} , \begin{bmatrix} 0 & -1 & 0.75 & -0.75 \\ 2 & 0 & 0.75 & 0.75 \end{bmatrix} \rangle \subseteq \begin{bmatrix} [-2.5,2.5] \\ [-2,5] \end{bmatrix} $$

The first output \(x^2_1\) is bounded in a tighter interval than with interval propagation ([-3,3]), the second ouput \(x^2_2\) is incomparable to the interval computation ([0,6]).

Probabilistic Zonotopes Analysis. Let us now suppose that the inputs \(x^0_1\) and \(x^0_2\) follow a uniform law, which can be abstracted as in Sect. 3 with DSI structures \(d^0_1\) and \(d^0_2\). Algorithm 2 produces the same input zonotope and propagation through the network as above. Let us discretize the inputs with 2 focal elements. The rescaling of the DSI \(d^0_1\) and \(d^0_2\) between -1 and 1 yields \(d_{\varepsilon _1} = \{ \langle [-1,0], 0.5 \rangle , \langle [0,1], 0.5 \rangle \} \) and \(d_{\varepsilon _2} = \{ \langle [-1,0], 0.5 \rangle , \langle [0,1], 0.5 \rangle \}\).

The concretization of the final probabilistic zonotope \( p\mathcal{Z}^2(d_{\varepsilon })\) to a vector of DSI writes: \(d^2_1 = - d_{\varepsilon _2} + 0.75 d_{\varepsilon _3} - 0.75 d_{\varepsilon _4}\) and \(d^2_2 = 1.5 + 2 d_{\varepsilon _1} + 0.75 d_{\varepsilon _3} + 0.75 d_{\varepsilon _4}\), where \(d_{\varepsilon _3}\) and \(d_{\varepsilon _4}\) are the DSI corresponding to the noise symbols introduced in the analysis by the ReLU function, with unknown distribution in \([-1,1]\). We get \(d^2_1 = \{ \langle [-2.5,1.5], 0.5 \rangle , \langle [-1.5,2.5], 0.5 \rangle \}\) and \(d^2_2 = \{ \langle [-2.,3.], 0.5 \rangle , \langle [0.,5.], 0.5 \rangle \}\) and deduce \(\mathbb {P}(x^2_1 \le -2) \in [0, 0.5]\) and \(\mathbb {P}(x^2_2 \ge 2) \in [0, 1]\).

The supports of the DSI are equal to the range obtained by the classical zonotopic analysis, thus incomparable to the support of the DSI obtained by Algorithm 1. The results are more generally not strictly comparable to those of DSI computation. For instance here with 100 focal elements, we have \(\mathbb {P}(x^2_1 \le -2) \in [0, 0.26]\) and \(\mathbb {P}(x^2_2 \ge 2) \in [0, 0.76]\) both in the case of independent inputs \(x^0_1\) and \(x^0_2\) and unknown correlation, which is better than DSI in the case of unknown correlation, while DSI are better for independent inputs. The reason why the results do not depend here on the correlation between inputs is that \(d_{\varepsilon _1}\) does not appear in the expression of \(d^2_1\) and \(d_{\varepsilon _2}\) in the expression of \(d^2_2\), so that the information of correlation between inputs is not used.

5 Analysis with Zonotopic Dempster-Shafer Structures

In Sect. 4, a unique initial zonotope is built and propagated in the network. This propagation is exact through affine layers, but can be highly conservative for nonlinear operations such as the activation functions. In Algorithm 3, we suppose that the inputs are independent and perform the zonotopic propagation at a finer grain, on each tuple of focal elements of the inputs. This can be seen as using zonotopic focal elements to represent the input vector of a layer, instead of interval focal elements to represent each component of the input vector.

Algorithm 3
figure d

. Neural network analysis by Dempster-Shafer zonotopic layers

Input and Initialization. The input is the same as in Sects. 3 and 4: the uncertain input is modelled as a vector \(d^0 = (d^0_1, \ldots ,d^0_{h_0})\) of \(h_0\) DSI. Assuming the input components as independent, we perform the convolution of the distributions of the input components to build a DSZ abstraction of the input vector: we construct one zonotope per possible \(h_0\)-tuple of focal elements representing the input vector of DSI \(d^0\), with weight the product of the weights of each interval focal elements: we define for each \((i_1,i_2,\ldots ,i_{h_0}) \in [1,n]^{h_0}\) the zonotope \(\mathcal{Z}^0_{i_1 \ldots i_{h_0}} = \langle c^0_{i_1\ldots i_{h_0}}, \varGamma ^0_{i_1\ldots i_{h_0}} \rangle ,\) built from the box \(\boldsymbol{x}^0_{1 i_1} \times \boldsymbol{x}^0_{2 i_2}\times \ldots \times \boldsymbol{x}^0_{h_0 i_{h_0}}\) and define the input \(d_\mathcal{Z}^0\) as a Dempster Shafer structure with zonotopic focal elements (DSZ in short): \( d_\mathcal{Z}^0 =\bigl \{\langle \mathcal{Z}^0_{i_1 \ldots i_{h_0}} , w^0_{1,i_1} w^0_{2,i_2} \ldots w^0_{h_0,i_{h_0}} \rangle , \,(i_1,i_2,\ldots ,i_{h_0}) \in [1,n]^{h_0} \bigr \}. \)

The number of focal elements does not have to be identical for each component of the input vector \(d^0\), this choice was made here for simplicity of notation.

The propagation in the layers then consists in propagating each zonotope focal elements. Note that the number of focal elements remains constant through the propagation in the layers because all convolutions were computed at initialization, only the zonotopes size evolves with the layer dimensions.

Output. The DSZ \(d_\mathcal{Z}^L\) is projected on the output vector, defined for each \(i \in [1,h_L]\) by the DSI \( d^L_i = \bigl \{ \langle \gamma _i(\mathcal{Z}^0_{i_1 \ldots i_{h_0}}) , w^0_{1,i_1} w^0_{2,i_2} \ldots w^0_{h_0,i_{h_0}} , \,(i_1,i_2,\ldots ,i_{h_0}) \in [1,n]^{h_0} \rangle \bigr \} .\)

The property can be assessed by evaluating the set of joint cumulative distributions represented by the DSZ \(H d_\mathcal{Z}^L\), by generalizing the definition of Proposition 1 from interval to zonotopic focal elements:

Proposition 4

(CDF of a Zonotopic Dempster-Shafer Structure). Let X be a random variable in \(\mathbb {R}^n\) and \(d_\mathcal{Z}=\bigl \{ \langle \mathcal{Z}_1, w_1 \rangle ,\langle \mathcal{Z}_2, w_2 \rangle ,\dots ,\langle \mathcal{Z}_u, w_u \rangle \bigr \}\) be a DSZ with \(\mathcal{Z}_k = \langle c_k, \varGamma _k \rangle \) with \(c_k \in \mathbb {R}^n\) and \(\varGamma _k \in R^{n,p}\) and \(w_k \in ]0,1]\) and \(\sum _{k=1}^u w_u = 1\). The DSZ \(d_\mathcal{Z}\) defines a discrete p-box representing the sets of joint cumulative distribution functions such that for \(v \in \mathbb {R}^n\),

$$\sum _{k \in [1,u], \, \mathcal{Z}_k \subset \downarrow v} w_k = \underline{P}_v \le \mathbb {P}(X \le v) \le \overline{P}_v = \sum _{k \in [1,u], {\mathcal{Z}_k \cap \downarrow v \ne \emptyset }} w_k $$

where \(\downarrow v\) is the set of points in \(\mathbb {R}^n\) less or equal than v in the componentwise ordering. Practically, we can use the ranges or projections of each component of \(\mathcal{Z}_k\) to get a conservative over-approximation of the p-box:

$$ \underline{P}_v \ge \sum _{k \in [1,u], \, \bigwedge _{ i \in [1,n]} \overline{\gamma _i(\mathcal{Z}_k)} < v_i } w_k \; \wedge \; \overline{P}_v \le \sum _{k \in [1,u], \bigwedge _{ i \in [1,n]} \underline{\gamma _i(\mathcal{Z}_k)} \le v_i } w_k $$

This proposition can be derived from the notion of cdf of a random set of [24]. The bounds obtained by Proposition 4 are always at least as good than by first converting the DSZ as a vector and then applying Proposition 1.

DSZ Analysis of the Toy Example. We consider again Example 1. with 2 focal elements for each input, we have \(d^0_1 = \{ \langle [-2,0], 0.5 \rangle , \langle [0,2], 0.5 \rangle \}\) and \(d^0_2 = \{ \langle [-1,0], 0.5 \rangle , \langle [0,1], 0.5 \rangle \}\). At Line 1 of Algorithm 3, \(d_\mathcal{Z}^0\) is a DSZ structure with 4 zonotopic focal elements, each with weight 0.25: \( \mathcal{Z}^0_{11}= \langle \begin{bmatrix} -1 \\ - 0.5 \end{bmatrix} , \begin{bmatrix} 1 & 0 \\ 0 & 0.5 \end{bmatrix} \rangle , \, \mathcal{Z}^0_{12}= \langle \begin{bmatrix} -1 \\ 0.5 \end{bmatrix} , \begin{bmatrix} 1 & 0 \\ 0 & 0.5 \end{bmatrix} \rangle , \, \mathcal{Z}^0_{21}= \langle \begin{bmatrix} 1 \\ -0.5 \end{bmatrix} , \begin{bmatrix} 1 & 0 \\ 0 & 0.5 \end{bmatrix} \rangle , \, \mathcal{Z}^0_{22}= \langle \begin{bmatrix} 1 \\ 0.5 \end{bmatrix} , \begin{bmatrix} 1 & 0 \\ 0 & 0.5 \end{bmatrix} \rangle \). After the output layer, the 4 zonotopic elements, each with weight 0.25, are: \( \mathcal{Z}^2_{11}= \langle \begin{bmatrix} \frac{1}{6} \\ \frac{1}{6} \end{bmatrix} , \begin{bmatrix} \frac{1}{3} & -\frac{1}{6} & \frac{1}{3} \\ \frac{1}{3} & -\frac{1}{6} & \frac{1}{3} \end{bmatrix} \rangle , \, \mathcal{Z}^2_{12}= \langle \begin{bmatrix} -\frac{1}{6} \\ \frac{1}{6} \end{bmatrix} , \begin{bmatrix} -\frac{1}{3} & -\frac{1}{6} & -\frac{1}{3} \\ \frac{1}{3} & \frac{1}{6} & \frac{1}{3} \end{bmatrix} \rangle , \, \mathcal{Z}^2_{21}= \langle \begin{bmatrix} \frac{5}{6} \\ \frac{13}{6} \end{bmatrix} , \begin{bmatrix} \frac{1}{3} & -\frac{5}{6} & -\frac{1}{3} \\ \frac{10}{6} & -\frac{1}{6} & \frac{1}{3} \end{bmatrix} \rangle , \,\) \(\mathcal{Z}^2_{22}= \langle \begin{bmatrix} -\frac{5}{6} \\ \frac{13}{6} \end{bmatrix} , \begin{bmatrix} - \frac{1}{3} & -\frac{5}{6} & \frac{1}{3} \\ \frac{10}{6} & \frac{1}{6} & \frac{1}{3} \end{bmatrix} \rangle \). From these and their projected ranges for \(x^2_1\) and \(x^2_2\), we deduce (see Appendix of preprint [10]) using Proposition 4 that \(\mathbb {P}(x^2_1 \le -2) \in [0,0.25]\) and \(\mathbb {P}(x^2_2 \ge 2) \in [0,0.5]\) and the conjunction \(\mathbb {P}(Hy \le w) \in [0.0, 0.25]\).

6 Evaluation

Implementation. We implemented our approachFootnote 1 using the Julia library ProbabilityBoundsAnalysis.jl [11] for Interval Dempster Shafer abstraction and arithmetic. In this library, the focal elements of a DSI structure all have same weight. The result is reduced after each arithmetic operation to keep a constant number of focal elements. We rely on this DSI implementation, but our DSZ implementation does not present the same restrictions. The focal elements are bounded, but a flag allows the user to specify that a distribution may have unbounded support, and this knowledge is used to produce a sound CDF estimation for unbounded distributions. In our current implementation, we do not use this possibility, but we believe that the work presented here can be extended to unbounded support. Timings for our analysis are on a MacBook Pro 2.3 GHz Intel Core i9 with 8 cores (the implementation is not parallel, although obviously parallelizable).

Comparing DSI, Probabilistic Zonotopes and DSZ on the Toy Example. We compare in Table 1 our 3 abstractions in the case of independent inputs, varying the number of focal elements and the input distributions: U(n) denotes a uniform distribution represented with n focal elements, and N(n) a truncated normal law in the same range with n focal elements.

On this example, the DSZ analysis is by far more precise, followed by the DSI and finally the probabilistic zonotopes. Refining the input discretization with more focal elements tightens the output of all analyzes, but only the DSZ converges to actually tight bounds. In particular, for DSI and probabilistic zonotopes, the support of the output distribution is unchanged when the input is refined. The computation times are, on this example, of the same order of magnitude for all three analyzes, slightly higher for DSZ, and lower for probabilistic zonotopes. The reason for the probabilistic zonotopes to have lower cost is that affine transforms are computed on the zonotopes, and the costly operations between DSI are delayed until the final representation as a DSI. It is not surprising that the DSZ have slightly higher cost, because of the exponential number of zonotopic focal elements. However, the computation is obviously parallelizable.

Table 1. Probability bounds for the toy example, independent inputs.

We can also note the strong impact of the input hypotheses on the results, advocating the need of such an approach which can account in a same framework and computation for large classes of inputs. For instance, changing the input distribution from a uniform to a Gaussian truncated to same support produces very different probability bounds. In Table 1, we supposed the inputs independent. For instance, the DSI analysis for 100 focal elements and a uniform law, produces for independent inputs \(\mathbb {P}(x^2_1 \le -2) \in [0, 0.07]\) and \(\mathbb {P}(x^2_2 \ge 2) \in [0.05, 0.52]\), while for inputs with unknown dependence, \(\mathbb {P}(x^2_1 \le -2) \in [0, 0.26]\) and \(\mathbb {P}(x^2_2 \ge 2) \in [0, 1]\).

For independent inputs, the DSZ is the best choice among our approaches. In the case of correlated inputs, it is hard to conclude from such a simple example between DSI and probabilistic zonotopes. In the context of discrete dynamical systems where probabilistic zonotopes were proposed [6, 7], they were much better than DSI both in terms of efficiency and accuracy. The context of neural networks is less favorable, but it is probable that probabilistic zonotopes can be more interesting than DSI for larger networks. However, our focus is to explore in the future the encoding of multivariate probabilistic distributions as input distributions, and lift this current restriction on the DSZ analysis.

Comparing DSZ to Probstar [29]. We now compare our approach to the results of the closely related approach [29] on their two benchmark examples. Similarly to [29], we consider the inputs as independent. ACAS Xu We consider the ACAS Xu networks benchmark, where the networks have 5 inputs and 5 outputs, with the same input configurations and properties (\(P_2: y_1 > y_2 \wedge y_1 > y_3 \wedge y_1 > y_4 \wedge y_1 > y_5\), \(P_3/P_4: y_1 < y_2 \wedge y_1 < y_3 \wedge y_1 < y_4 \wedge y_1 < y_5\)) as in [29]. The lower and upper bounds on the inputs, lb and ub, depend on the property, and are used in [29] to define probabilistic input sets by Gaussian distributions with mean \(m=(ub+lb)/2\) and standard deviation \((ub-m)/a\), where \(a=3\), truncated between lb and ub. In our work, after creation of the input DSI from the above Gaussian distribution, we truncate all focal elements so that the support of the DSI is restricted to the input range [lbub]. In [29], an argument is used to deduce bounds for the probability for non truncated distributions. We could use a similar argument here but we focus on the results for the truncated distributions, and compare our results to the interval \([US-Prob-LB,US-Prob-UB]\) with the notations of [29].

We choose for our approach an over-approximation of the input distributions using a different number of focal elements for each component of the vector input, based on the relative widths of the input intervals. We represent these as vectors of number of focal elements, taking [5, 80, 50, 6, 5] for Property 2, [5, 20, 1, 6, 5] for Properties 3 and 4. In Table 2, we compare our results with the Probstar approach with two parametrizations: \(p_f=0\) corresponds to an exact set-based propagation, while \(p_f=e^{-5}\) corresponds to the level of over-approximation in propagation most widely used in  [29].

Table 2. Probability bounds for the ACAS Xu example.

The running times for Probstar in Table 2 are those of  [29], hence not computed on the same computer. We reproduced Property 2 on Net 1–6 with Probstars on our MacBook: for \(p=0\), it takes 3614 s on 8 cores, 5045 s on 4 cores, 12542 on 1 core, to be compared to the 1424 s in Table 2; for \(p=e^{-5}\), it takes 425 s on 8 cores, 489 on 4 cores, 1489 on 1 core, to be compared to the 206 s in Table 2) and to the 46 s with DSZ.

The tightness of the enclosures of the DSZ is comparable to Probstars with \(p=e^{-5}\), for an analysis being generally an order of magnitude faster. The results look consistent between the 2 analyzes for Property 2. Properties 3 and 4 (originally from [15]) are true on the whole input range, which can be proven by classical set-based analysis, and our approach accordingly produces a probability equal to 1. The approach of [29] produces more precise, ’exact’ results, when \(p=0\), than our approach. However, only the set-based propagation is exact, there is also a part of probabilistic estimation. For instance, when reproducing Property 2 on Net 1–6 with Probstars, we obtained for \(p=0\), the probabilities 1.56119e-05 with 8 cores, 6.76052e-06 with 4 cores, 7.22045e-06 with 1 core, to be compared to the ’exact’ 1.87224e-05 in the table. In contrast, our approach produces fully guaranteed bounds while allowing a much richer classes of inputs.

In Table 2, we manually chose the number of focal element per input component. Although we refrained from optimizing too much, choosing for instance the same discretization for different networks, this impairs the practicality of the approach. As a first answer, we implemented a simple loop to automatically refine the discretization starting from a very rough one, by some basic sensitivity analysis. For instance, for Property 2 and net-1-6, the total refinement process with as stopping criterion the width of the probability interval lower than 0.05 takes 112 s and leads to the number of focal elements [5, 81, 38, 5, 5] and a probability in [0, 0.0276], with bounds twice tighter than Probstars with \(p=e^{-5}\).

Rocket Lander. Let us now consider the rocket lander example of [29], with the same inputs and properties. The networks have here 9 inputs. Taking the vector of focal elements [7, 12, 10, 17, 9, 7, 1, 1, 2, 1, 1] produces the results of Table 3

Table 3. Comparing probability bounds for the rocket lander example.

Again, the timings for Probstars are those of [29]; we executed for instance on our MacBook the analysis of Property 1 on network 0 with 4 cores, the running times were 1351.2 s for \(p_f=1e{-5}\) and 12127 s for \(p_f=0\).

7 Conclusion

A central notion for dealing with multivariate probabilistic distributions is that of a copula [18], and in particular Sklar’s theorem which links multivariate cdf with the cdf of its marginals. Multiple authors have considered generalizing Sklar’s theorem to imprecise probabilities, [17, 19], with e.g. applications in [12] to the analysis of non-linear dynamical systems. In this work, we developed the case of multidimensional imprecise probabilites described by the independence copula. Future work includes the tractable treatment of other copulas in our framework. Finally, we focused here on ReLU-based networks, but the approach is by no means restricted to this activation function.