1 Introduction

In the last decade, deep learning [61] has demonstrated state-of-the-art performance in natural language processing, image recognition, game playing, computational biology, and numerous other fields [5, 26, 35, 74, 81, 141, 142]. Despite its remarkable success, deep learning still faces significant challenges that restrict its applicability in domains involving safety-critical tasks or inputs with high variability.

One critical limitation lies in the well-known challenge faced by deep neural networks (DNNs) when attempting to generalize to novel input domains. This refers to their tendency to exhibit suboptimal performance on inputs significantly different from those encountered during training. Throughout the training process, a DNN is exposed to input data sampled from a specific distribution over a designated input domain (referred to as “in-distribution” inputs). The rules derived from this training may falter in generalizing to novel, unencountered inputs, due to several factors: (1) the DNN being invoked in an out-of-distribution (OOD) scenario, where there is a mismatch between the distribution of inputs in the training data and that in the DNN’s operational data; (2) certain inputs not being adequately represented in the finite training dataset (such as various, low-probability corner cases); and (3) potential “overfitting” of the decision rule to the specific training data.

The importance of establishing the generalizability of (unsupervised) DNN-based decisions is evident in recently proposed applications of deep reinforcement learning (DRL) [87]. Within the framework of DRL, an agent, implemented as a DNN, undergoes training through repeated interactions with its environment to acquire a decision-making policy achieving high performance concerning a specific objective (“reward”). DRL has recently been applied to numerous real-world tasks [30, 73, 86, 88, 103, 105,106,107, 159, 176]. In many DRL application domains, the learned policy is anticipated to perform effectively across a broad spectrum of operational environments, with a diversity that cannot possibly be captured by finite training data. Furthermore, the consequences of inaccurate decisions can be severe. This point is exemplified in our examination of DRL-based Internet congestion control (discussed in Sect. 4.3). Good generalization is also crucial for non-DRL tasks, as we shall illustrate through the supervised-learning example of Arithmetic DNNs.

We introduce a methodology designed to identify DNN-based decision rules that exhibit strong generalization across a range of distributions within a specified input domain. Our approach is rooted in the following key observation. The training of a DNN-based model encompasses various stochastic elements, such as the initialization of the DNN’s weights and the order in which inputs are encountered during training. As a result, even when DNNs with the same architecture undergo training to perform an identical task on the same training data, the learned decision rules will typically exhibit variations. Drawing inspiration from Tolstoy’s Anna Karenina [153], we argue that “successful decision rules are all alike; but every unsuccessful decision rule is unsuccessful in its own way”. To put it differently, we believe that when scrutinizing decisions made by multiple, independently trained DNNs on a specific input, consensus is more likely to occur when their (similar) decisions are accurate.

Given the above, we suggest the following heuristic for crafting DNN-based decision rules with robust generalization across an entire designated input domain: independently train multiple DNNs and identify a subset that exhibits strong consensus across all potential inputs within the specified input domain. This implies, according to our hypothesis, that the learned decision rules of these DNNs generalize effectively to all probability distributions over this domain. Our evaluation, as detailed in Sect. 4, underscores the tremendous effectiveness of this methodology in distilling a subset of decision rules that truly excel in generalization across inputs within this domain. As our heuristic aims to identify DNNs whose decisions unanimously align for every input in a specified domain, the decision rules derived through this approach consistently achieve high levels of generalization, across all benchmarks.

Since our methodology entails comparing the outputs of various DNNs across potentially infinite input domains, the utilization of formal verification is a natural choice. In this regard, we leverage recent advancements in the formal verification of DNNs [3, 14, 16, 20, 43, 96, 121, 143, 170]. Given a verification query comprised of a DNN N, a precondition P, and a postcondition Q, a DNN verifier is tasked with determining whether there exists an input x to N such that P(x) and Q(N(x)) both hold.

To date, DNN verification research has primarily concentrated on establishing the local adversarial robustness of DNNs, i.e., identifying small input perturbations that lead to the DNN misclassifying an input of interest [55, 62, 97]. Our approach extends the scope of DNN verification by showcasing, for the first time (as far as we are aware), its utility in identifying DNN-based decision rules that exhibit robust generalization. Specifically, we demonstrate how, within a defined input domain, a DNN verifier can be employed to assign a score to a DNN that indicates its degree of agreement with other DNNs throughout the input domain in question. This, in turn, allows an iterative process for the gradual pruning of the candidate DNN set, retaining only those that exhibit strong agreement and are likely to generalize successfully.

To assess the effectiveness of our methodology, we concentrate on three widely recognized benchmarks in the field of deep reinforcement learning (DRL): (i) Cartpole, where a DRL agent learns to control a cart while balancing a pendulum; (ii) Mountain Car, which requires controlling a car to escape from a valley; and (iii) Aurora, designed as an Internet congestion controller. Aurora stands out as a compelling case for our approach. While Aurora is designed to manage network congestion in a diverse range of real-world Internet environments, its training relies solely on synthetically generated data. Therefore, for the deployment of Aurora in real-world scenarios, it is crucial to ensure the soundness of its policy across numerous situations not explicitly covered by its training inputs.

Additionally, we consider a benchmark from the realm of supervised learning, namely, DNN-based arithmetic learning, in which the goal is to train a DNN to correctly perform arithmetic operations. Arithmetic DNNs are a natural use-case for demonstrating the applicability of our approach to a supervised learning (and so, non-DRL) setting, and since generalization to OOD domains is a primary focus in this context and is perceived to be especially challenging [101, 156]. We demonstrate how our approach can be employed to assess the capability of Arithmetic DNNs to execute learned operations on ranges of real numbers not encountered in training.

The results of our evaluation indicate that, across all benchmarks, our verification-driven approach effectively ranks DNN-based decision rules based on their capacity to generalize successfully to inputs beyond their training distribution. In addition, we present compelling evidence that our formal verification method is superior to competing methods, namely gradient-based optimization methods and predictive uncertainty methods. These findings highlight the efficacy of our approach. Our code and benchmarks are publicly available as an artifact accompanying this work [10].

The rest of the paper is organized in the following manner. Section 2 provides background on DNNs and their verification procedure. In Sect. 3 we present our verification-driven approach for identifying DNN-driven decision rules that generalize successfully to OOD input domains. Our evaluation is presented in Sect. 4, and a comparison to competing optimization methods is presented in Sect. 5. Related work is covered in Sect. 6, limitations are covered in Sect. 7, and our conclusions are provided in Sect. 8. We include appendices with additional information regarding our evaluation.

Note. This is an extended version of our paper, titled “Verifying Generalization in Deep Learning” [9], which appeared at the Computer Aided Verification (CAV) 2023 conference. In the original paper, we presented a brief description of our method, and evaluated it on two DRL benchmarks, while giving a high-level description of its applicability to additional benchmarks. In this extended version, we significantly enhance our original paper along multiple axes, as explained next. In terms of our approach, we elaborate on how to strategically design a DNN verification query for the purpose of executing our methods, and we also elaborate on various distance functions leveraged in this context. We also incorporate a section on competing optimization methods, and showcase the advantages of our approach compared to gradient-based optimization techniques. We significantly enhance our evaluation in the following manner:

  1. (i)

    we demonstrate the applicability of our approach to supervised learning, and specifically to Arithmetic DNNs (in fact, to the best of our knowledge, we are the first to verify Arithmetic DNNs); and

  2. (ii)

    we enhance the previously presented DRL case study to include additional results and benchmarks.

We believe these additions merit an extended paper, which complements our original, shorter one [9].

2 Background

Deep Neural Networks (DNNs) [61] are directed graphs comprising several layers, that subsequently compute various mathematical operations. Upon receiving an input, i.e., assignment values to the nodes of the DNN’s first (input) layer, the DNN propagates these values, layer after layer, until eventually reaching the final (output) layer, which computes the assignment of the received input. Each node computes the value based on the type of operations to which it is associated. For example, nodes in weighted-sum layers, compute affine combinations of the values of the nodes in the preceding layer to which they are connected. Another popular layer type is the rectified linear unit (ReLU) layer, in which each node y computes the value \(y=\text {ReLU}\,{}(x)=\max (x,0)\), in which x is the output value of a single node from the preceding layer. For more details on DNNs and their training procedure, see [61]. Fig. 1 depicts an example of a toy DNN. Given input \(V_1=[2, 1]^T\), the second layer of this toy DNN computes the (weighted sum) \(V_2=[7,-6]^T\). Subsequently, the ReLU  functions are applied in the third layer, resulting in \(V_3=[7,0]^T\). Finally, the DNN’s single output is accordingly calculated as \(V_4=[14]\).

Fig. 1
figure 1

A toy DNN

Deep Reinforcement Learning (DRL) [87] is a popular paradigm in machine learning, in which a reinforcement learning (RL) agent, realized as a DNN, interacts with an environment across multiple time-steps \(t\in \{0,1,2,\ldots \}\). At each discrete time-step, the DRL agent observes the environment’s state \(s_{t} \in \mathcal {S}\), and selects an action \(N(s_t)=a_{t} \in \mathcal {A}\) accordingly. As a result of this action, the environment may change and transition to its next state \(s_{t+1}\), and so on. During training, at each time-step, the environment also presents the agent with a reward \(r_t\) based on its previously chosen action. The agent is trained by repeatedly interacting with the environment, with the goal of maximizing its expected cumulative discounted reward \(R_t=\mathbb {E}\big [\sum _{t}\gamma ^{t}\cdot r_t\big ]\), where \(\gamma \in \big [0,1\big ]\) is a discount factor, i.e., a hyperparameter that controls the accumulative effect of past decisions on the reward. For additional details, see [65, 68, 137, 148, 149, 175].

Supervised Learning (SL) is another popular machine learning (ML) paradigm. In SL, the input is a dataset of training data comprising pairs of inputs and their ground-truth labels \((x_i, y_i)\), drawn from some (possibly unknown) distribution \(\mathcal {D}\). The dataset is used to train a model to predict the correct output label for new inputs drawn from the same distribution.

Arithmetic DNNs. Despite the success of DNNs in many SL tasks, they (surprisingly) fail to generalize for the simple SL task of attempting to learn arithmetic operations [156]. When trained to perform such tasks, they often succeed for inputs sampled from the distribution on which they were trained, but their performance significantly deteriorates when tested on inputs drawn OOD, e.g., input values from another domain. This behavior is indicative of Arithmetic DNNs tending to overfit their training data rather than systematically learning from it. This is observed even in the context of simple arithmetic tasks, such as approximating the identity function, or learning to sum up inputs. A common belief is that the limitations of the classic learning processes, combined with DNNs’ over-parameterized nature, prevent them from learning to generalize arithmetic operations successfully [101, 156].

DNN Verification. A DNN verifier [76] receives the following inputs: (i) a (trained) DNN N; (ii) a precondition P on the inputs of the DNN, effectively limiting the possible assignments to be part of a domain of interest; and (iii) a postcondition Q on the outputs of the DNN. A sound DNN verifier can then respond in one of the following two ways: (i) SAT , along with a concrete input \(x'\) for which the query \(P(x') \wedge Q(N(x'))\) is satisfied; or (ii) UNSAT , indicating no such input \(x'\) exists. Typically, the postcondition Q encodes the negation of the DNN’s desirable behavior for all inputs satisfying P. Hence, a SAT result indicates that the DNN may err, and that \(x'\) is an example of an input in our domain of interest, that triggers a bug; whereas an UNSAT result indicates that the DNN always performs correctly.

For example, let us revisit the DNN in Fig. 1. Suppose that we wish to verify that for all non-negative inputs the toy DNN outputs a value strictly smaller than 25, i.e., for all inputs \(x=\langle v_1^1,v_1^2\rangle \in \mathbb {R}^2_{\ge 0}\), it holds that \(N(x)=v_4^1 < 25\). This is encoded as a verification query by choosing a precondition restricting the inputs to be non-negative, i.e., \(P= ( v^1_1\ge 0 \wedge v_1^2\ge 0)\), and by setting \(Q=(v_4^1\ge 25)\), which is the negation of our desired property. For this specific verification query, a sound verifier will return SAT , alongside a feasible counterexample such as \(x=\langle 1, 3\rangle \), which produces \(v_4^1=26 \ge 25\). Hence, this property does not hold for the DNN described in Fig. 1. To date, a plethora of DNN verification engines have been put forth [4, 55, 69, 76, 97, 162], mostly used in the context of validating the robustness of a general DNN to local adversarial perturbations.

3 Quantifying Generalizability via Verification

Our strategy for evaluating a DNN’s potential for generalization on out-of-distribution inputs is rooted in the “Karenina hypothesis”: while there might be numerous (potentially infinite) ways to generate incorrect results, correct outputs are likely to be quite similar .Footnote 1 Therefore, to pinpoint DNN-based decision rules that excel at generalizing to new input domains, we propose the training of multiple DNNs and assessing the learned decision models based on the alignment of their outputs with those of other models in the domain. As we elaborate next, this scoring procedure can be conducted using a backend DNN verifier. We show how to effectively distill DNNs that successfully generalize OOD, by iteratively filtering out models that tend to disagree with their peers.

3.1 Our Iterative Procedure

To facilitate our reasoning about the agreement between two DNN-based decision rules over an input domain, we introduce the following definitions.

figure a

Intuitively, a distance function allows to quantify the (dis)agreement level between the decisions of two DNNs, when fed the same input. We elaborate later on examples of various distance functions that were used.

figure b

This definition captures the notion that for every possible input in our domain \(\Psi \), DNNs \(N_{1}\) and \(N_{2}\) produce outputs that are (at most) \(\alpha \)-distance apart from each other. Small \(\alpha \) values indicate that the \(N_1\) and \(N_2\) produce “close” values for all inputs in the domain \(\Psi \), whereas a large \(\alpha \) values indicate that there exists an input in \(\Psi \) for which there is a notable divergence between both decision models.

To calculate \(\text {PDT}\,\) values, our method utilizes verification to perform a binary search aiming to find the maximum distance between the outputs of a pair of DNNs; see Alg. 1.

figure c

After being calculated, the Pairwise disagreement thresholds can subsequently be aggregated to measure the overall disagreement between a decision model and a set of other decision models, as defined next.

figure d

Intuitively, a disagreement score of a single DNN decision model measures the degree to which it tends to disagree, on average, with the remaining models.

Iterative Scheme. Leveraging disagreement scores, our heuristic employs an iterative process (see Alg. 2) to choose a subset of models that exhibit generalization to out-of-distribution scenarios—as encoded by inputs in \(\Psi \). At first, k DNNs \(\{N_1, N_2,\ldots ,N_k\}\) are trained independently on the training data. Next, a backend verifier is invoked in order to calculate, per each of the \({k \atopwithdelims ()2} \) DNN pairs, their respective pairwise-disagreement threshold (up to some accuracy, \(\epsilon \)). Next, our algorithm iteratively: (i) Calculates the disagreement score of each model in the remaining model subset; (ii) Identifies models with (relatively) high DS scores; and (iii) Removes them from the model set (Line 9 in Alg. 2). We also note that the algorithm is given an upper bound (M) on the maximum difference, as informed by the user’s domain-specific knowledge.

Termination. The procedure terminates after it exceeds a predefined number of iterations (Line 3 in Alg. 2), or alternatively, when all remaining models “agree” across the input domain \(\Psi \), as indicated by nearly identical disagreement scores (Line 7 in Alg. 2).

figure e

DS Removal Threshold. There are various possible criteria for determining the DS threshold above for which models are removed, as well as the number of models to remove in each iteration (Line 8 in Alg. 2). In our evaluation, we used a simple and natural approach, of iteratively removing the \(p\%\) models with the highest disagreement scores, for some choice of p (\(p= 25\%\) in our case). A thorough discussion of additional filtering criteria (all of which proved successful, on all benchmarks) is relegated to Appendix D.

3.2 Verification Queries

Next, we elaborate on how we encoded the queries, which we later fed to our backend verification engine (Line 4 in Alg. 1), in order to compute the PDT scores for a DNN pair.

Given a DNN pair, \(N_1\) and \(N_2\), we execute the following stages:

  1. 1.

    Concatenate \(N_1\)and \(N_2\) to a new DNN \(N_3=[N_1; N_2]\), which is roughly twice the size of each of the original DNNs (as both \(N_1\) and \(N_2\) have the same architecture). The input of \(N_3\) is of the same original size as each single DNN and is connected to the second layer of each DNN, consequently allowing the same input to flow throughout the network to the output layers of \(N_1\) and \(N_2\). Thus, the output layer of \(N_3\) is a concatenation of the outputs of both \(N_1\) and \(N_2\). A scheme depicting the construction of a concatenated DNN appears in Fig. 2.

  2. 2.

    Encode a precondition P which represents the ranges of value assignments to the input variables. As we mentioned before, the value-range bounds are supplied by the system designer, based on prior knowledge of the input domain. In some cases, these values can be predefined to match a specific OOD setting evaluated. In others, these values can be extracted based on empirical simulations of the models post-training. For additional details, we refer the reader to Appendix C.

  3. 3.

    Encode a postcondition Q which encapsulates (for a fixed slack \(\alpha \)) and a given distance function \(d: \mathcal {O}\times \mathcal {O}\mapsto \mathbb {R^+}\), that for an input \(x'\in \Psi \) the following holds:

    $$\begin{aligned} d(N_{1}(x'),N_{2}(x')) \ge \alpha \end{aligned}$$

    Examples of distance functions include:

    1. (a)

      \(L_{1}\) norm:

      $$\begin{aligned}d(N_{1}, N_{2}) = {{\,\mathrm{arg\,max}\,}}_{x\in \Psi }(|N_{1}(x) - N_{2}(x)|)\end{aligned}$$

      This distance function is used in our evaluation of the Aurora and Arithmetic DNNs benchmarks.

    2. (b)

      \(\mathbf {{\textbf {condition-distance}} (``\text {c-distance}'')}\): This function returns the maximal \(L_{1}\) norm of two DNNs, for all inputs \(x \in \Psi \) such that both outputs \(N_{1}(x)\), \(N_{2}(x)\) comply to constraint \(\textbf{c}\).

      $$\begin{aligned}\text {c-distance}(N_{1}, N_{2}) \triangleq \max _{x\in \Psi \text { s.t. } N_{1}(x),N_{2}(x) \vDash c}(|N_{1}(x) - N_{2}(x)|)\end{aligned}$$

      This distance function is used in our evaluation of the Cartpole and Mountain Car benchmarks. In these cases, we defined the distance function to be:

      $$\begin{aligned} d(N_{1}, N_{2}) =\min _{c, c'} (\text {c-distance} (N_{1}, N_{2}), \text {c'-distance}(N_{1}, N_{2})) \end{aligned}$$
Fig. 2
figure 2

To calculate the PDT scores, we generated a new DNN that is the concatenation of each pair of DNNs (sharing the same input)

4 Evaluation

Benchmarks. We extensively evaluated our method using four benchmarks: (i) Cartpole; (ii) Mountain Car; (iii) Aurora; and (iv) Arithmetic DNNs. The first three are DRL benchmarks, whereas the fourth is a challenging supervised learning benchmark. Our evaluation of DRL systems spans two classic DRL settings, Cartpole [21] and Mountain Car [108], as well as the recently proposed Aurora congestion controller for Internet traffic [73]. We also extensively evaluate our approach on Arithmetic DNNs, i.e., DNNs trained to approximate mathematical operations (such as addition, multiplication, etc.).

Setup. For each of the four benchmarks, we initially trained multiple DNNs with identical architectures, varying only the random seed employed in the training process. Subsequently, we removed from this set all the DNNs but the ones that achieved high reward values (in the DRL benchmarks) or high precision (in the supervised-learning benchmark) in-distribution, in order to rule out the chance that a decision model exhibits poor generalization solely because of inadequate training. Next, we specified out-of-distribution input domains of interest for each specific benchmark and employed Alg. 2 to choose the models deemed most likely to exhibit good generalization on those domains according to our framework. To determine the ground truth regarding the actual generalization performance of different models in practice, we applied the models to inputs drawn from the considered OOD domain, and ranked them based on empirical performance (average reward/maximal error, depending on the benchmark). To assess the robustness of our results, we performed the last step with different choices of probability distributions over the inputs in the domain.

Verification. All queries were dispatched using Marabou [77, 165]—a sound and complete DNN verification engine, which is capable of addressing queries regarding a DNN’s characteristics by converting them into SMT-based constraint satisfaction problems. The Cartpole benchmark included 48, 000 queries (24, 000 queries per each of the two platform sides), all of which terminated within 12 hours. The Mountain Car benchmark included 10, 080 queries, all of which terminated within one hour. The Aurora benchmark included 24, 000 verification queries, out of which all but 12 queries terminated within 12 hours; and the remaining ones hit the time-out threshold. Finally, the Arithmetic DNNs benchmark included 2, 295 queries, running with a time-out value of 24 hours; all queries terminated, with over \(96\%\) running in less than an hour, and the longest non-DRL query taking slightly less than 13.8 hours. All benchmarks ran on a single CPU, and with a memory limit of either 1 GB (for Arithmetic DNNs) or 2 GB (for the DRL benchmarks). We note that in the case of the Arithmetic DNNs benchmark—Marabou internally used the Guorobi LP solverFootnote 2 as a backend engine when dealing with these queries.

Results. The findings support our claim that models chosen using our approach are expected to significantly outperform other models for inputs drawn from the OOD domain considered. This is the case for all evaluated settings and benchmarks, regardless of the chosen hyperparameters and filtering criteria. We note that although our approach can potentially also remove some of the successful models, in all benchmarks, and across all evaluations, it managed to remove all unsuccessful models. Next, we provide an overview of our evaluation. A comprehensive exposition and additional details can be found in the appendices. Our code and benchmarks are publicly available online [10].

4.1 Cartpole

Fig. 3
figure 3

Cartpole: in-distribution setting (blue) and OOD setting (red)

Cartpole [58] is a widely known RL benchmark where an agent controls the motion of a cart with an inverted pendulum (“pole”) affixed to its top. The cart traverses a platform, and the objective of the agent is to maintain balance for the pole for as long as possible (see Fig. 3).

Agent and Environment. The agent is provided with inputs, denoted as \(s=(x, v_{x}, \theta , v_{\theta })\), where x represents the cart’s position on the platform, \(\theta \) represents the angle of the pole (with \(|\theta |\) approximately 0 for a balanced pole and \(|\theta |\) approximately \(90^\circ \) for an unbalanced pole), \(v_{x}\) indicates the cart’s horizontal velocity, and \(v_{\theta }\) denotes the pole’s angular velocity.

In-Distribution Inputs. During the training process, the agent is encouraged to balance the pole while remaining within the boundaries of the platform. In each iteration, the agent produces a single output representing the cart’s acceleration (both sign and magnitude) for the subsequent step. Throughout the training, we defined the platform’s limits as \([-2.4, 2.4]\), and the initial position of the cart as nearly static and close to the center of the platform (as depicted on the left-hand side of Fig. 3). This was accomplished by uniformly sampling the initial state vector values of the cart from the range \([-0.05, 0.05]\).

(OOD) Input Domain. We examine an input domain with larger platforms compared to those utilized during training. Specifically, we extend the range of the x coordinate in the input vectors to cover [-10, 10]. The bounds for the other inputs remain the same as during training. For additional details, see Appendices A and C.

Evaluation. We trained a total of \(k=16\) models, all of which demonstrated high rewards during training on the short platform. Subsequently, we applied Alg. 2 until convergence (requiring 7 iterations in our experiments) on the aforementioned input domain. This resulted in a collection of 3 models. We then subjected all 16 original models to inputs that were drawn from the new, OOD domain. The generated distribution was crafted to represent a novel scenario: the cart is now positioned at the center of a considerably longer, shifted platform (see the red-colored cart depicted in Fig. 3).

All remaining parameters in the OOD environment matched those used for the original training. Figure 4 presents the outcomes of evaluating the models on 20, 000 OOD instances. Out of the initial 16 models, 11 achieved low to mediocre average rewards, demonstrating their limited capacity to generalize to this new distribution. Only 5 models attained high reward values on the OOD domain, including the 3 models identified by our approach; thus indicating that our method successfully eliminated all 11 models that would have otherwise exhibited poor performance in this OOD setting (see Fig. 5). For more information, we refer the reader to Appendix E.

Fig. 4
figure 4

Cartpole: models’ average rewards in different distributions

Fig. 5
figure 5

Cartpole: Alg. 2’s results, per iteration: the bars represent the ratio of good and bad models in the surviving set (left y-axis), while the curve indicates the number of surviving models (right y-axis). Our technique selected models {6,7,9}

4.2 Mountain Car

For our second experiment, we evaluated our method on the Mountain Car [128] benchmark, in which an agent controls a car that needs to learn how to escape a valley and reach a target (see Fig. 6).

Agent and Environment. The car (agent) is placed in a valley between two hills (at \(x\in [-1.2, 0.6]\)), and needs to reach a flag on top of one of the hills. The state, \(s=(x, v_{x})\) represents the car’s location (along the x-axis) and velocity. The agent’s action (output) is the applied force: a continuous value indicating the magnitude and direction in which the agent wishes to move. During training, the agent is incentivized to reach the flag (placed at the top of a valley, originally at \(x=0.45\)). For each time-step until the flag is reached, the agent receives a small, negative reward; if it reaches the flag, the agent is rewarded with a large positive reward. An episode terminates when the flag is reached, or when the number of steps exceeds some predefined value (300 in our experiments). Good and bad models are distinguished by an average reward threshold of 90.

Fig. 6
figure 6

Mountain Car: figure a depicts the setting in which the agents were trained, and figure b depicts the harder, OOD setting

In-Distribution Inputs. During training (in-distribution), the car is initially placed on the left side of the valley’s bottom, with a low, random velocity (see Fig. 6a). We trained \(k=16\) agents (denoted as \(\{1, 2, \ldots 16\}\)), which all perform well, i.e., achieve an average reward higher than our threshold, in-distribution. This evaluation was conducted over 10, 000 episodes.

(OOD) Input Domain. According to the scenarios used by the training environment, we specified the (OOD) input domain by: (i) extending the x-axis, from \([-1.2, 0.6]\) to \([-2.4, 0.9]\); (ii) moving the flag further to the right, from \(x=0.45\) to \(x=0.9\); and (iii) setting the car’s initial location further to the right of the valley’s bottom, and with a large initial negative velocity (to the left). An illustration appears in Fig. 6b. These new settings represent a novel state distribution, which causes the agents to respond to states that they had not observed during training: different locations, greater velocity, and different combinations of location and velocity directions.

Evaluation. Out of the \(k=16\) models that performed well in-distribution, 4 models failed (i.e., did not reach the flag, ending their episodes with a negative average reward) in the OOD scenario, while the remaining 12 succeeded, i.e., reached a high average reward when simulated on the OOD data (see Fig. 7). The large ratio of successful models is not surprising, as Mountain Car is a relatively easy benchmark.

Fig. 7
figure 7

Mountain Car: the models’ average rewards in different distributions

To evaluate our algorithm, we ran it on these models, and the aforementioned (OOD) input domain, and checked whether it removed the models that (although successful in-distribution) fail in the new, harder, setting. Indeed, our method was able to filter out all unsuccessful models, leaving only a subset of 5 models (\(\{2,4,8,10,15\}\)), all of which perform well in the OOD scenario. For additional information, see Appendix F.

4.3 The Aurora Congestion Controller

In the third benchmark, we applied our methodology to an intricate system that enforces a policy for the real-world task of Internet congestion control. Congestion control aims to determine, for each traffic source in a communication network, the appropriate rate at which data packets should be dispatched into the network. Managing congestion is a notably challenging and fundamental issue in computer networking [95, 110]; transmitting packets too quickly can result in network congestion, causing data loss and delays. Conversely, employing low sending rates may result in the underutilization of available network bandwidth. Developed by [73], Aurora is a DNN-based congestion controller trained to optimize network performance. Recent research has delved into formally verifying the reliability of DNN-based systems, with Aurora serving as a key example [11, 46]. Within each time-step, an Aurora agent collects network statistics and determines the packet transmission rate for the next time-step. For example, if the agent observes poor network conditions (e.g., high packet loss), we expect it to decrease the packet sending rate to better utilize the bandwidth. We note that Aurora handles a much harder task than the previous RL benchmarks (Cartpole and Mountain Car): congestion controllers must gracefully respond to diverse potential events, interpreting nuanced signals presented by Aurora’s inputs. Unlike in prior benchmarks, determining the optimal policy in this scenario is not a straightforward endeavor.

Agent and Environment. Aurora receives as input an ordered set of t vectors \(v_{1}, \ldots ,v_{t}\), that collectively represent observations from the previous t time-steps (each of the vectors \(v_{i}\in \mathbb {R}^3\) includes three distinct values that represent statistics on the network’s condition, as detailed in Appendix G). The agent has a single output indicating the change in the packet sending rate over the following time-step. In line with [11, 46, 73], we set \(t=10\) time-steps, hence making Aurora’s inputs of dimension \(3t=30\). During training, Aurora’s reward function is a linear combination of the data sender’s packet loss, latency, and throughput, as observed by the agent (see [73] for more details).

In-Distribution Inputs. During training, Aurora executes congestion control on basic network scenarios—a single sender node sends traffic to a single receiver node across a single network link. Aurora undergoes training across a range of options for the initial sending rate, link bandwidth, link packet-loss rate, link latency, and the size of the link’s packet buffer. During the training phase, data packets are initially sent by Aurora at a rate that corresponds to \(0.3-1.5\) times the link’s bandwidth, leading mostly to low congestion, as depicted in Fig. 8a.

Fig. 8
figure 8

Aurora: illustration of in-distribution and OOD settings

(OOD) Input Domain. In our experiments, the input domain represented a link with a limited packet buffer, indicating that the network can only store a small number of packets (with most surplus traffic being discarded), resulting in the link displaying erratic behavior. This is reflected in the initial sending rate being set to up to 8 times (!) the link’s bandwidth, simulating the potential for a significant reduction in available bandwidth (for example, due to competition, traffic shifts, etc.). For additional details, see Appendix G.

Evaluation. We executed our algorithm and evaluated the models by assessing their disagreement upon this extensive domain, encompassing inputs that were not encountered during training, and representing the aforementioned conditions (depicted in Fig. 8b).

Experiment (1): High Packet Loss. In this experiment, we trained more than 100 Aurora agents in the original (in-distribution) environment. From this pool, we chose \(k=16\) agents that attained a high average reward in the in-distribution setting (see Fig. 9a), as evaluated over 40, 000 episodes from the same distribution on which the models were trained. Subsequently, we assessed these agents using out-of-distribution inputs within the previously outlined domain. The primary distinction between the training distribution and the new (OOD) inputs lies in the potential occurrence of exceptionally high packet loss rates during initialization.

Our assessment of out-of-distribution inputs within the domain reveals that while all 16 models excelled in the in-distribution setting, only 7 agents demonstrated the ability to effectively handle such OOD inputs (see Fig. 9b). When Algorithm 2 was applied to the 16 models, it successfully identified and removed all 9 models that exhibited poor generalization on the out-of-distribution inputs (see Fig. 10). Additionally, it is worth mentioning that during the initial iterations, the four models chosen for exclusion were \(\{1, 2, 6, 13\}\)—which constitute the poorest-performing models on the OOD inputs (see Appendix G).

Experiment (2): Additional Distributions over OOD Inputs. To further demonstrate that our method is apt to retain superior-performing models and eliminate inferior ones within the given input domain, we conducted additional Aurora experiments by varying the distributions (probability density functions) over the OOD inputs. Our assessment indicates that all models filtered out by Algorithm 2 consistently exhibited low reward values also for these alternative distributions (see Fig. 30 and Fig. 31 in Appendix G). These results highlight an important advantage of our approach: it applies to all inputs within the considered domain, and so it applies to all distributions over these inputs. We note again that our model filtering process is based on verification queries in which the imposed bounds can represent infinitely many distribution functions, on these bounds. In other words, our method, if correct, should also apply to additional OOD settings, beyond the ones we had originally considered, which share the specified input range but may include a different probability density function (PDF) over this range.

Fig. 9
figure 9

Aurora Experiment (1): the models’ average rewards when simulated on different distributions

Fig. 10
figure 10

Aurora: Alg. 2’s results, per iteration. Our technique selected models {7,9,16}

Additional Experiments. We additionally created a fresh set of Aurora models by modifying the training process to incorporate substantially longer interactions (increasing from 50 to 400 steps). Subsequently, we replicated the aforementioned experiments. The outcomes, detailed in Appendix G, affirm that our approach once again effectively identified a subset of models capable of generalizing well to distributions across the OOD input domain.

4.4 Arithmetic DNNs

In our last benchmark, we applied our approach to supervised-learning models, as opposed to models trained via DRL. In supervised learning, the agents are trained using inputs that have accompanying “ground truth” results, per data point. Specifically, we focused here on an Arithmetic DNNs benchmark, in which the DL models are trained to receive an input vector, and to approximate a simple arithmetic operation on some (or all) of the vector’s entries. We note that this supervised-learning benchmark is considered quite challenging [101, 156].

Agent and Environment. We trained a DNN for the following supervised task. The input is a vector of size 10 of real numbers, drawn uniformly at random from some range [lu]. The output is a single scalar, representing the sum of two hidden (yet consistent across the task) indices of the input vector; in our case, the first 2 input indices, as depicted in Fig. 11. Differently put, the agent needs to learn to model the sum of the relevant (initially unknown) indices, while learning to ignore the rest of the inputs. We trained our networks for 10 epochs over a dataset consisting of 10, 000 input vectors drawn uniformly at random from the range \([l=-10, u=10]\), using the Adam optimization algorithm [79] with a learning rate of \(\gamma = 0.001\) and using the mean squared error (MSE) loss function. For additional details, see Appendix B.

Fig. 11
figure 11

A toy example of a DNN that performs simple arithmetic. The DNN receives a d-dimensional input and learns to output a single value which constitutes the sum of the first two inputs of the vector, while ignoring the remaining (d-2) inputs

In-Distribution Inputs. During training, we presented the models with input values sampled from a multi-modal uniform distribution [−10,10]\(^{10}\), resulting in a single output in the range [−20,20]. As expected, the models performed well over this distribution, as depicted in Fig. 46a of the Appendix.

(OOD) Input Domain. A natural OOD distribution includes any d-dimensional multi-modal distribution, in which each input is drawn from a range different than \([l=-10, u=10]\)—and hence, can necessarily be assigned values on which the model was not trained initially. In our case, we chose the multi-modal distribution of \([l=-1,000, u=1,000]\) \(^{10}\). Unlike the case for the in-distribution inputs, there was a high variance among the performance of the models in this novel, unseen OOD setting, as depicted in Fig. 46b of the Appendix.

Evaluation. We originally trained \(n=50\) models. After validating that all models succeed in-distribution, we generated a pool of \(k=10\) models. This pool was generated by collecting the five best and five worst models OOD (based on their maximal normalized error, over the same 100, 000 points sampled OOD). We then executed our algorithm and checked whether it was able to identify and remove all unsuccessful models, which consisted of half of the original model pool. Indeed, as can be seen in Fig. 12, all bad models were filtered out within three iterations. After convergence, three models remained in the model pool, including model {8}—which constitutes the best model, OOD. This experiment was successfully repeated with additional filtering criteria (see Fig. 47 in Appendix H).

Fig. 12
figure 12

Arithmetic DNNs: Alg. 2’s results, per iteration. Our technique selected models {5,8,10}

4.5 Averaging the Selected Models

To improve performance even further, it is possible to create (in polynomial time) an ensemble of the surviving “good” models, instead of selecting a single model. As DNN robustness is linked to uncertainty, and due to the use of ensembles as a prominent approach for uncertainty prediction, it has been shown that averaging ensembles may improve performance [85]. For example, in the Arithmetic DNNs benchmark, our approach eventually selected three models ({5}, {8}, and {9}, as depicted in Fig. 12). Subsequently, we generated an ensemble comprised of these three DNN models. Now, when the ensemble evaluates a given input, that input is first independently passed to each of the ensemble members; and the final ensemble prediction is the average of each of the members’ original outputs. We then sampled 5, 000 inputs drawn in-distribution (see Fig. 13a) and 5, 000 inputs drawn OOD (see Fig. 13b), and compared the average and maximal errors of the ensemble on these sampled inputs to that of its constituents. In both cases, the ensemble had a maximal absolute error that was significantly lower than each of its three constituent DNNs, as well as a lower average error (with the sole exception of the average error OOD, which was the second-smallest error, by a margin of only 0.06). Although the use of ensembles is not directly related to our approach, it demonstrates how our technique can be extended and built upon additional robustness techniques, for improving performance even further.

Fig. 13
figure 13

Arithmetic DNNs: Ensemble results. After Alg. 2 selected models {5}, {8}, and {9}, we generated an ensemble {5, 8, 9} and sampled 5, 000 inputs in-distribution (a) and OOD (b). We note that we multiplied the errors attained in the in-distribution experiment by 100 in order to normalize the selected range

4.6 Analyzing the Eliminated Models

We conducted an additional analysis of the eliminated models, in order to compare the average PDT scores of eliminated “good” models to those of eliminated “bad” ones. For each of the five benchmarks, we divided the eliminated models into two separate clusters, of either “good” or “bad” models (note that the latter necessarily includes all bad models, as in all our benchmarks we return strictly “good” models). For each cluster, we calculated the average PDT score for all the DNN pairs. The results, summarized in Table 1, demonstrate a clear decrease in the average PDT score among the cluster of DNN pairs comprising successful models, compared to their peers. This trend is observed across all benchmarks, resulting in an average PDT score difference of between \(21.2\%\) to \(63.2\%\), between the clusters, per benchmark. We believe that these results further support our hypothesis that good models tend to make similar decisions.

5 Comparison to Gradient-Based Methods & Additional Techniques

The methods presented in this paper build upon DNN verification (e.g., Line 4 in Alg. 1) in order to solve the following optimization problem: given a pair of DNNs, an input domain, and a distance function, what is the maximal distance between their outputs? In other words, verification is rendered to find an input that maximizes the difference between the outputs of two neural networks, under certain constraints. Although DNN verification requires significant computational resources [75], nonetheless, we demonstrate that it is crucial in our setting. To support this claim, we show the results of our method when verification is replaced with other, more scalable, techniques, such as gradient-based algorithms (“attacks”) [84, 98, 133]. In recent years, these optimization techniques have become popular due to their simplicity and scalability, albeit the trade-off of inherent incompleteness and reduced precision [13, 169]. As we demonstrate next, when using gradient-based methods (instead of verification), at times, suboptimal PDT values were computed. This, in turn, resulted in retaining unsuccessful models, which were successfully removed when using DNN verification.

Table 1 Evaluating the eliminated models throughout all benchmarks. Columns (from left to right) represent the benchmarks: Cartpole, Mountain Car, Aurora (short training), Aurora (long training), and Arithmetic DNNs. The first row (in Italic) represents the average PDT score for pairs of “good” models, and the second row (in Bold italic) represents the average PDT score for pairs of “bad” models. The third row represents the ratio between the average PDT scores of both clusters, per benchmark

5.1 Comparison to Gradient-Based Methods

For our comparison, we generated three gradient attacks:

  • Gradient attack # 1: a Non-Iterative Fast Gradient Sign Method (FGSM) [70] attack, used when optimizing linear constraints, e.g., \(L_{1}\) norm, as in the case of Aurora and Arithmetic DNNs;

  • Gradient attack # 2: an Iterative PGD [100] attack, also used when optimizing linear constraints. We note that we used this attack in cases where the previous attack failed.

  • Gradient attack # 3: a Constrained Iterative PGD [100] attack, used in the case of encoding non-linear constraints (e.g., c-distance functions; see Sect. 3), as in the case of Cartpole and Mountain Car. This attack is a modified version of popular gradient attacks, that were altered in order for them to succeed in our setting.

Next, we formalize these attacks as constrained optimization problems.

5.2 Formulation

Given an input domain \(\mathcal {D}\), an output space \(\mathcal {O}=\mathbb {R}\), and a pair of neural networks \(N_1: \mathcal {D} \rightarrow \mathbb {R}\) and \(N_2: \mathcal {D} \rightarrow \mathbb {R}\), we wish to find an input \(\varvec{x}\in \mathcal {D}\) that maximizes the difference between the outputs of these neural networks.

Formally, in the case of the \(L_{1}\) norm, we wish to solve the following optimization problem:

$$\begin{aligned} \begin{array}{ll} \underset{x}{{\textbf {max}}} &{}\quad |N_1(\varvec{x}) - N_2(\varvec{x})|\\ {\textbf {s.t.}} &{}\quad \varvec{x}\in \mathcal {D} \end{array} \end{aligned}$$

5.2.1 Gradient Attack # 1

In cases where only input constraints are present, a local maximum can be obtained via conventional gradient attacks, that maximize the following objective function:

$$\begin{aligned}L(\varvec{x}) = |N_1(\varvec{x}) - N_2(\varvec{x})|\end{aligned}$$

by taking steps in the direction of its gradient, and projecting them into the domain \(\mathcal {D}\), that is:

$$\begin{aligned}&\varvec{x_0} \in \mathcal {D} \\&\varvec{x}_{t+1} = [\varvec{x}_t + \epsilon \cdot \nabla _{\varvec{x}} L(\varvec{x}_t)]_{\mathcal {D}} \end{aligned}$$

Where \([\cdot ]_\mathcal {D}: \mathbb {R}^n \rightarrow \mathcal {D}\) projects the result onto \(\mathcal {D}\), and \(\epsilon \) being the step size. We note that \([\cdot ]_\mathcal {D}\) may be non-trivial to implement, however for our cases, in which each input of the DNN is encoded as some range, i.e., \(\mathcal {D} \equiv \{\varvec{x}\ |\ x\in \mathbb {R}^n \,\, \forall i\in [n]: l_i \le x_i \le u_i \}\), this can be implemented by clipping every coordinate to its appropriate range, and \(\varvec{x_0}\) can be obtained by taking \(\varvec{x_0} = \frac{\varvec{l} + \varvec{u}}{2}\).

In our context, the gradient attacks maximize a loss function for a pair of DNNs, relative to their input. The popular FGSM attack (gradient attack # 1) achieves this by moving in a single step toward the direction of the gradient. This simple attack has been shown to be quite efficient in causing misclassification [70]. In our setting, we can formalize this (projected) FGSM as follows:

figure f

In the context of our algorithms, we define \(\mathcal {D}\) by two functions: INIT , which returns an initial value from \(\mathcal {D}\); and PROJECT , which implements \([\cdot ]_\mathcal {D}\).

5.2.2 Gradient Attack # 2

A more powerful extension of this attack is the PGD algorithm, which we refer to as gradient attack # 2. This attack iteratively moves in the direction of the gradient, often yielding superior results when compared to its single-step (FGSM) counterpart. The attack can be formalized as follows:

figure g

We note that the case for using PGD in order to minimize the objective function is symmetric.

5.2.3 Gradient Attack # 3

In some cases, the gradient attack needs to optimize a loss function that represents constraints on the outputs of the DNN pairs as well. For example, in the case of the Cartpole and Mountain Car benchmarks, we used the c-distance function. In this scenario, we may need to encode constraints of the form:

$$\begin{aligned}&N_1(\varvec{x}) \le 0\\&N_2(\varvec{x}) \le 0 \end{aligned}$$

resulting in the following constrained optimization problem:

$$\begin{aligned} \begin{array}{ll} \underset{x}{{\textbf {max}}} &{}\quad |N_1(\varvec{x}) - N_2(\varvec{x})| \\ {\textbf {s.t.}} &{}\quad \varvec{x}\in \mathcal {D} \\ &{} \quad N_1(\varvec{x}) \le 0 \\ &{} \quad N_2(\varvec{x}) \le 0 \end{array} \end{aligned}$$

However, conventional gradient attacks are typically not geared for solving such optimizations. Hence, we tailored an additional gradient attack (gradient attack # 3) that can efficiently bridge this gap, and optimize the aforementioned constraints by combining our Iterative PGD attack with Lagrange Multipliers [129] \(\varvec{\lambda } \equiv (\lambda ^{(1)}, \lambda ^{(2)})\), hence allowing to penalize solutions for which the constraints do not hold. To this end, we introduce a novel objective function:

$$\begin{aligned} L_{-}(\varvec{x},\varvec{\lambda }) = |N_1(\varvec{x}) - N_2(\varvec{x})| - \lambda ^{(1)} \cdot \text {ReLU}(N_1(\varvec{x})) - \lambda ^{(2)} \cdot \text {ReLU}(N_2(\varvec{x})) \end{aligned}$$

resulting in the following optimization problem:

$$\begin{aligned} \begin{array}{ll} \underset{x}{{\textbf {max}}} \, \underset{\lambda }{{\textbf {min}}} &{}\quad L_{-}(\varvec{x},\varvec{\lambda }) \\ {\textbf {s.t.}} &{}\quad \varvec{x}\in \mathcal {D} \\ &{}\quad \lambda ^{(1)} \ge 0 \\ &{}\quad \lambda ^{(2)} \ge 0 \end{array} \end{aligned}$$

Next, we implemented a Constrained Iterative PGD algorithm that approximates a solution to this optimization problem:

figure h
Table 2 Summary of the gradient attack comparison

5.3 Results

We ran our algorithm on all original DRL benchmarks, with the sole difference being the replacement of the backend verification engine (Line 4 in Alg. 1) with the described gradient attacks. The first two attacks (i.e., FGSM and Iterative PGD) were used for both Aurora batches (“short” and “long” training), and the third attack (Constrained Iterative PGD) was used in the case of Cartpole and Mountain Car, as for these benchmarks we required the encoding of a distance function with constraints on the DNNs’ outputs as well. We note that in the case of Aurora, we ran the Iterative PGD attack only when the weaker attack failed (hence, only on the models from Experiment (1)). Our results, summarized in Table 2, demonstrate the advantages of using formal verification, compared to competing, gradient attacks. These attacks, although scalable, resulted in various cases to suboptimal PDT values, and in turn, retained unsuccessful models that were successfully removed when using verification. For additional results, we also refer the reader to Figs. 14,  15, and  16.

5.4 Comparison to Sampling-Based Methods

In yet another line of experiments, we again replaced the verification sub-procedure of our technique, and calculated the PDT scores (Line 4 in Alg. 1) with sampling heuristics instead. We note that, as any sampling technique is inherently incomplete, this can be used solely for approximating the PDT scores.

In our experiment, we sampled 1, 000 inputs from the OOD domain, and fed them to all DNN pairs, per each benchmark. Based on the outputs of the DNN pairs, we approximated the PDT scores, and ran our algorithm in order to assess if scalable sampling techniques can replace our verification-driven procedure. Our experiment raised two main concerns regarding the use of sampling techniques instead of verification.

Fig. 14
figure 14

Aurora: Gradient attack # 1 (single-step FGSM): Results of models filtered using PDT scores approximated by gradient attacks (instead of a verification engine) on short-trained Aurora models, using the MAX criterion (and terminating in advance if the disagreement scores are no larger than 2). In contrast to our verification-driven approach, the final result contains a bad model. Compare to Fig. 34

Fig. 15
figure 15

Aurora: Gradient attack # 2 (Iterative PGD): Results of models filtered using PDT scores approximated by gradient attacks (instead of a verification engine) on short-trained Aurora models, using the MAX criterion (and terminating in advance if the disagreement scores are no larger than 2). In contrast to our verification-driven approach, the final result contains a bad model. Compare to Fig. 34

Fig. 16
figure 16

Cartpole: Gradient attack # 3 (Constrained Iterative PGD): Results of models filtered using PDT scores approximated by gradient attacks (instead of a verification engine) on the Cartpole models. Each row, from top to bottom, contains results using a different filtering criterion (and terminating in advance if the disagreement scores are no larger than 2): PERCENTILE (compare to Figs. 5 and  20), MAX (compare to Fig. 21), and COMBINED (compare to Fig. 22). In all cases, the algorithm returned at least one bad model (and usually more than one), resulting in models with lower average rewards than the models returned with our verification-based approach

First, in many cases, sampling could not result in constrained outputs. For instance, in the Mountain Car benchmark, we use the c-distance function (see Sect. 3.2), which requires outputs with multiple signs. However, even extensive sampling cannot guarantee this—over a third (!) of all Mountain Car DNN pairs had non-negative outputs, for all 1, 000 OOD samples, hence requiring approximation of the PDT scores even further, based only on partial outputs. On the other hand, encoding the c-distance conditions in SMT is straightforward in our case, and guarantees the required constraints.

The second setback of this approach is that, as in the case of gradient attacks, sampling may result in suboptimal PDT scores, that skew the filtering process to retain unwanted models. For example, in our results (summarized in Table 3), in both the Mountain Car and Aurora (short-training) benchmarks the algorithm returned unsuccessful (“bad”) models in some cases, while these models are effectively removed when using verification. We believe that these results further motivate the use of verification, instead of applying more scalable and simpler methods.

Table 3 A summary of Alg. 2’s results, per each of the five benchmarks: Cartpole, Mountain Car, Aurora (short & long training), and Arithmetic DNNs

5.5 Comparison to Predictive Uncertainty Methods

In yet another experiment, we evaluated whether our verification-driven approach can be replaced with predictive uncertainty methods [1, 115]. These methods are online techniques, that assess uncertainty, i.e., discern whether an encountered input aligns with the training distribution. Among these techniques, ensembles [39, 52, 82] are a popular approach for predicting the uncertainty of a given input, by comparing the variance among the ensemble members; intuitively, the higher the variance is for a given input, the more “uncertain” the models are with regard to the desired output. We note that in Sect. 4.5 we demonstrate that after using our verification-driven approach, ensembling the resulting models may improve the overall performance relative to each individual member. However, now we set to explore whether ensembles can not only extend our verification-driven approach, but also replace it completely. As we demonstrate next, ensembles, like gradient attacks and sampling techniques, are not a reliable replacement for verification in our setting. For example, in the case of Cartpole, we generated all possible k-sized ensembles (we chose \(k=3\) as this was the number of selected models via our verification-driven approach, see Fig. 5), resulting in \( {n \atopwithdelims ()k}={16 \atopwithdelims ()3}=560\) ensemble combinations. Next, we randomly sampled 10, 000 OOD inputs (based on the specification in Appendix C) and utilized a variance-based metric (inspired by [94]) to identify ensemble subsets exhibiting low output variance on these OOD-sampled inputs. However, even the subset represented by the ensemble with the lowest variance, included the “bad” model \(\{8\}\) (see Fig. 4), which was successfully removed in our equivalent verification-driven technique. We believe that this too demonstrates the merits of our verification-driven approach.

6 Related Work

Due to its widespread occurrence, the phenomenon of adversarial inputs has gained considerable attention [48, 60, 109, 117, 118, 150, 179]. Specifically, The machine learning community has dedicated substantial effort to measure and enhance the robustness of DNNs [32, 34, 53, 66, 91, 100, 125, 139, 140, 164, 173]. The formal methods community has also been looking into the problem, by devising methods for DNN verification, i.e., techniques that can automatically and formally guarantee the correctness of DNNs [3, 17, 36, 37, 40,41,42, 51, 57, 59, 62, 63, 71, 72, 76, 80, 97, 104, 111, 120, 132, 138, 143, 144, 147, 152, 157, 158, 160, 166, 170, 171, 177]. These techniques include SMT-based approaches (e.g., [69, 75, 77, 83]) as used in this work, methods based on MILP and LP solvers (e.g., [28, 43, 93, 151]), methods based on abstract interpretation or symbolic interval propagation (e.g., [55, 154, 162, 163]), as well as abstraction-refinement (e.g., [14, 15, 45, 114, 121, 143, 174]), size reduction [122], quantitative verification [20], synthesis [3], monitoring [96], optimization [16, 146], and also tools for verifying recurrent neural networks (RNNs) [72, 177].

In addition, efforts have been undertaken to offer verification with provable guarantees [71, 132], verification of DNN fairness [157], and DNN repair and modification after deployment [40, 59, 144, 158, 171].

We also note that some sound and incomplete techniques [24, 152] have put forth an alternative strategy for DNN verification, via convex relaxations. These techniques are relatively fast, and can also be applied by our approach, which is generally agnostic to the underlying DNN verifier. In the specific case of DRL-based systems, various non-verification approaches have been put forth to increase the reliability of such systems [2, 54, 127, 161, 178]. These techniques rely mostly on Lagrangian Multipliers [90, 131, 145].

In addition to DNN verification techniques, another approach that guarantees safe behavior is shielding [6, 25], i.e., incorporating an external component (a “shield”) that enforces the safe behavior of the agent, according to a given specification on the input/output relation of the DNN in question.

Classic shielding approaches [6, 25, 123, 124, 168] focus on simple properties that can be expressed in Boolean LTL formulas. However, proposals for reactive synthesis methods within infinite theories have also emerged recently [31, 50, 99]. Yet another relevant approach is Runtime Enforcement [47, 89, 136], which is akin to shielding but incompatible with reactive systems [25].

In a broader sense, these aforementioned techniques can be viewed as part of ongoing research on improving the safety of Cyber-Physical Systems (CPS) [64, 92, 119, 135, 155].

Variability among machine learning models has been widely employed to enhance performance, often through the use of ensembles [39, 52, 82]. However, only a limited number of methodologies utilize ensembles to tackle generalization concerns [112, 113, 130, 172]. In this context, we note that our approach can also be used for additional tasks, such as ensemble selection [13], as it can identify subsets of models that have a high variance in their outputs. Furthermore, alternative techniques beyond verification for assessing generalization involve evaluating models across predefined new distributions [116].

In the context of learning, there is ample research on identifying and mitigating data drifts, i.e., changes in the distribution of inputs that are fed to the ML model, during deployment [18, 49, 56, 78, 102, 134]. In addition, certain studies employ verification for novelty detection with respect to DNNs concerning a single distribution [67]. Other work focused on applying verification to evaluate the performance of a model relative to fixed distributions [19, 167], while non-verification approaches, such as ensembles [112, 113, 130, 172], runtime monitoring [67], and other techniques [116], have been applied for OOD input detection. Unlike the aforementioned approaches, our objective is to establish verification-guided generalization scores that encompass an input domain, spanning multiple distributions within this domain. Furthermore, as far as we are aware, our approach represents the first endeavor to harness the diversity among models to distill a subset with enhanced generalization capabilities. Particularly, it is also the first endeavor to apply formal verification for this goal.

7 Limitations

Although our evaluation results indicate that our approach is applicable to varied settings and problem domains, it may suffer from multiple limitations. First, by design, our approach assumes a single solution to a given generalization problem. This does not allow selecting DNNs with different generalization strategies to the same problem. We also note that although our approach builds upon verification techniques, it cannot assure correctness or generalization guarantees of the selected models (although, in practice, this can happen in various scenarios—as our evaluation demonstrates).

In addition, our approach relies on the underlying assumption that the range of inputs is known apriori. In some situations, this assumption may turn out to be highly non-trivial—for example, in cases where the DNN’s inputs are themselves produced by another DNN, or some other embedding mechanism. Furthermore, even when the range of inputs is known, bounding their exact values may require domain-specific knowledge for encoding various distance functions, and the metrics that build upon them (e.g., PDT scores). For example, in the case of Aurora, routing expertise is required in order to translate various Internet congestion levels to actual bounds on Aurora’s input variables. We note that such knowledge may be highly non-trivial in various domains.

Finally, we note that other limitations stem from the use of the underlying DNN verification technology, which may serve as a computational bottleneck. Specifically, while our approach requires dispatching a polynomial number of DNN verification queries, solving each of these queries is NP-complete [76]. In addition, the underlying DNN verifier itself may limit the type of encodings it affords, which, in turn, restricts various use-cases to which our approach can be applied. For example, sound and complete DNN verification engines are currently suitable solely for DNNs encompassing piecewise-linear activations. However, as DNN verification technology improves, so will our approach.

8 Conclusion

This case study presents a novel, verification-driven approach to identify DNN models that effectively generalize to an input domain of interest. We introduced an iterative scheme that utilizes a backend DNN verifier, enabling us to assess models by scoring their capacity to generate similar outputs for multiple distributions over a specified domain. We extensively evaluated our approach on multiple benchmarks of both supervised, and unsupervised learning, and demonstrated that, indeed, it is able to effectively distill models capable of successful generalization capabilities. As DNN verification technology advances, our approach will gain scalability and broaden its applicability to a more diverse range of DNNs.