Neural predictive monitoring and a comparison of frequentist and Bayesian approaches

Neural state classification (NSC) is a recently proposed method for runtime predictive monitoring of hybrid automata (HA) using deep neural networks (DNNs). NSC trains a DNN as an approximate reachability predictor that labels an HA state x as positive if an unsafe state is reachable from x within a given time bound, and labels x as negative otherwise. NSC predictors have very high accuracy, yet are prone to prediction errors that can negatively impact reliability. To overcome this limitation, we present neural predictive monitoring (NPM), a technique that complements NSC predictions with estimates of the predictive uncertainty. These measures yield principled criteria for the rejection of predictions likely to be incorrect, without knowing the true reachability values. We also present an active learning method that significantly reduces the NSC predictor’s error rate and the percentage of rejected predictions. We develop two versions of NPM based, respectively, on the use of frequentist and Bayesian techniques to learn the predictor and the rejection rule. Both versions are highly efficient, with computation times on the order of milliseconds, and effective, managing in our experimental evaluation to successfully reject almost all incorrect predictions. In our experiments on a benchmark suite of six hybrid systems, we found that the frequentist approach consistently outperforms the Bayesian one. We also observed that the Bayesian approach is less practical, requiring a careful and problem-specific choice of hyperparameters.


Introduction
Hybrid systems are a central model for many safety-critical, cyber-physical system applications [1]. Their verification typically amounts to solving a hybrid automata (HA) reachability checking problem [2]: given a model M of the system expressed as an HA, a set I of initial states of M, and a set D of unsafe states, check whether D is reached along any B Nicola Paoletti nclpltt@gmail.com (time-bounded) path of M starting from a state in I . Due to its high computational cost, reachability checking is usually limited to design-time (offline) analysis.
Our focus is on the online analysis of hybrid systems and, in particular, on the predictive monitoring (PM) problem [3], i.e., the problem of predicting, at runtime, whether or not an unsafe state can be reached from the current system state within a given time bound. PM is at the core of architectures for runtime safety assurance such as Simplex [4], where the system switches to a certified-safe baseline controller whenever PM indicates the potential for an imminent safety violation.
In such approaches, PM is invoked periodically and frequently. Thus, reachability needs to be determined rapidly, from a single state (the current system state), and typically for short time horizons. This is in contrast with offline reachability checking, where long or unbounded time horizons and sizable regions of initial states are typically considered. PM also differs from traditional runtime verification [5] in that PM is preemptive: it detects potential safety violations before they occur, not when or after they occur.
Any solution to the PM problem involves a trade-off between two main requirements: accuracy of the reachability prediction, and computational efficiency, as the analysis must execute within strict real-time constraints and typically with limited hardware resources.
In this paper, we present neural predictive monitoring (NPM), a machine-learning-based approach to PM that provides highly accurate predictions in a highly efficient manner. Moreover, NPM offers principled methods for detecting potential prediction errors, which significantly enhances the reliability of PM estimates.
NPM builds on neural state classification (NSC) [6], a recently proposed method for approximate HA reachability checking using deep neural networks (DNNs). NSC works by training a DNN as a state classifier using examples computed with an oracle (an HA model checker). For any state x of the HA, such a classifier labels x as positive if an unsafe state is reachable from x within a given time bound; otherwise, x is labeled as negative.
Executing a neural state classifier corresponds to computing the output of a DNN for a single input, and thus is extremely efficient. NSC has also demonstrated very high accuracy in reachability predictions, owing to the powerful approximation capabilities of DNNs. Some classification errors are, however, unavoidable, the most important being false negatives, in which positive states are misclassified as negative. Such errors may compromise system safety.
NPM overcomes this problem by extending NSC with rigorous methods for quantifying the uncertainty of the reachability estimates. NPM can consequently identify and reject predictions that are likely to produce classification errors. We investigate two alternative NPM methods: a frequentist approach that uses Conformal Prediction (CP) [7], a method that provides statistical guarantees on the predictions of machine-learning models with minimal assumptions on the data; 1 and a Bayesian approach that leverages uncertainty quantification via Bayesian neural networks (BNNs), rather than traditional (deterministic) DNNs. We consider two popular Bayesian inference methods: Hamiltonian Monte Carlo [8] and Variational Inference [9]. Figure 1 provides an overview of the NPM approach. We sample from a distribution of HA states to generate a training set Z t and a validation set Z v . An HA reachability oracle (a model checker or, for deterministic systems, a simulator) is used to label sampled states as positive or negative. A neural state classifier F (i.e., a DNN-based binary classifier) is derived from Z t via supervised learning, and is either a Fig. 1 Overview of the NPM framework. Double-bordered components denote extensions to the method of [6]. Training of the neural state classifier F and retraining via active learning are performed offline. The only components used at runtime are the classifier F and the error detection criterion deterministic neural network (in the frequentist approach) or a Bayesian neural network (in the Bayesian approach).
In the frequentist case, CP is used to estimate two statistically sound measures of prediction uncertainty: confidence and credibility. Informally, the confidence of a prediction is the probability that a reachability prediction for an HA state x corresponds to the true reachability value of x. Credibility quantifies how likely a given state is to belong to the same distribution of the training data. In the Bayesian case, we use an empirical approximation of the BNN output distribution and, to be precise, statistics of said distribution to quantify the model uncertainty about a specific HA state.
The measures described above, henceforth referred to as uncertainty measures, are used to derive criteria for error detection, i.e., for rejecting NSC predictions that are likely to be erroneous. The rejection criterion is based on identifying, via supervised learning on the validation set Z v , a decision rule that optimally separates incorrect and correct predictions. We again consider both a frequentist and a Bayesian approach, deriving a support vector classifier for the former, and a Gaussian process classifier for the latter.
Executing a neural predictive monitor corresponds to: (i) computing the output of a DNN, either deterministic or Bayesian, on a single input; (ii) computing the corresponding measure of uncertainty, whose computational cost depends on the size of the validation set for the frequentist case, and on the sample size used for the empirical distribution of the BNN for the Bayesian case; (iii) evaluate the rejection rule on the measure of uncertainty obtained via step (ii). For all of the approaches we consider, this whole process is very efficient, taking from 2 milliseconds to 0.3 seconds in our experiments. This makes our NPM method suitable for online analysis and PM.
Finally, our approach includes an active learning strategy to improve the reliability of the state classifier F. The idea is to employ the uncertainty-based rejection criterion to identify HA states for which F yields uncertain predictions, and augment the training and validation sets with those states. We then train a new state classifier with the augmented dataset, thus ensuring improved accuracy on the HA states where F performed poorly, and, in turn, a reduced rejection rate.
Compared to simple random sampling of the state distribution, our active learning strategy has the advantage of parsimony: by focusing on the states with uncertain predictions, it requires a significantly smaller number of additional retraining samples to achieve a given reduction in the rejection rate, and thus significantly reduces the cost of retraining. The active learning procedure can be iterated, as shown in Fig. 1. We stress that these retraining iterations are part of the training process, which is performed offline and hence does not affect runtime performance.
In summary, the main contributions of this work are the following: -We develop neural predictive monitoring, a framework for runtime predictive monitoring of hybrid automata that extends neural state classification with quantification of prediction reliability. -We instantiate the NPM framework in two variants, which, respectively, use frequentist and Bayesian learning techniques. For both approaches, we derive optimal criteria for rejecting unreliable NSC predictions by leveraging sound measures of prediction uncertainty. -We develop an active learning method designed to reduce both prediction errors and the rejection rate. -We evaluate our method on six hybrid automata case studies, demonstrating that our optimal rejection criteria successfully rejects almost all prediction errors, with 100% of the errors recognized in 5 out of 6 case studies for the frequentist method. Moreover, only a single active learning iteration is needed to significantly increase the prediction accuracy and reduce the rejection rate. -With our analysis and experimental comparison of frequentist and Bayesian variants of NPM, we show that the frequentist approach empirically outperforms the Bayesian one on all relevant metrics. Furthermore, the frequentist techniques require minimal assumptions and tuning, which makes them more practical. In contrast, Bayesian inference requires tuning of a number of hyperparameters, including the prior distribution for the DNN weights. If the hyperparameters are not optimally tuned, the performance may be extremely poor.
This work is an extended version of [10], where we first introduced the NPM method, but only the frequentist version. Here, we introduce a fully Bayesian variant of NPM, and compare the two versions in a new experimental evaluation section. In this version of the paper, we also include a more extensive treatment of the problem formulation and the NPM method itself.
The paper is structured as follows. Section 2 presents a rigorous formulation of the problem. Section 3 provides background on neural state classification and on the methods used to estimate predictive uncertainty. The uncertaintybased error detection rules are presented in Sect. 4. Section 5 presents the active learning algorithm. Results of the experimental evaluation are given in Sect. 6. Related work is discussed in Sect. 7. Section 8 offers concluding remarks.

Problem formulation
We describe the predictive monitoring problem for hybrid automata reachability and the related problem of finding an optimal criterion for rejecting erroneous reachability predictions.
is the flow function, defining the continuous dynamics at each location; Trans is the transition relation, consisting of tuples of the form (l, g, r , l ), where l, l ∈ Loc are source and target locations, respectively, g ⊆ V is the guard, and r : V → V is the reset; Inv : Loc → 2 V is the invariant at each location.
We also consider parameterized HA in which the flow, guard, reset and invariant may have parameters whose values are constant throughout an execution. We treat parameters as continuous variables with flow equal to zero and identity reset map. The behavior of an HA M can be described in terms of its trajectories. A trajectory may start from any state; it does not need to start from an initial state. For time bound T ∈ R ≥0 , we denote with T = [0, T ] ⊆ R ≥0 the time domain. For t ∈ T, let ρ(t) = (l(t), v(t)) be the state at time t, with l(t) being the location and v(t) the vector of continuous variables. Let (ξ i ) i=0,...,k ∈ T k+1 be the ordered sequence of time points where mode jumps happen, i.e., such that ξ 0 = 0, ξ k = T , and for all i = 0, . . . , k −1 and for all t ∈ [ξ i , ξ i+1 ), l(t) = l(ξ i ). Then, ρ is a trajectory of M if it is consistent with the invariants: ∀t ∈ T. v(t) ∈ Inv(l(t)); flows: ∀t ∈ T. v(t) = Flow(l(t))(v(t)); and transition relation: Example 1 (Spiking neuron HA) This model describes the evolution of a neuron's action potential. It is a deterministic HA with two continuous variables, one mode, one jump and nonlinear polynomial dynamics, defined by the ODE The jump condition is v 2 ≥ 30, and the associated reset is v 2 := c ∧ v 1 := v 1 + d, where, for any variable x, x denotes the value of x after the reset. We consider the unsafe set D defined by v 2 ≤ 68.5, expressing that the neuron should not undershoot its resting potential. The parameter values are given in Appendix C.
Reachability checking of HA is concerned with establishing whether, given an initial HA state x and a set of target states D -typically a set of unsafe states to avoid -the HA admits a trajectory starting from x that reaches D.
Definition 3 (Time-bounded reachability) Given an HA M with state space X , a set of states D ⊆ X , state x ∈ X , and time bound T , decide whether there exists a trajectory ρ of M starting from x and t ∈ [0, T ] such that ρ(t) ∈ D, denoted M | Reach(D, x, T ).
We aim to derive a predictive monitor for HA reachability, i.e., a function that can predict whether or not a state in D (an unsafe state) can be reached from the current system state within time T . In solving this problem, we assume a distribution X of HA states and seek the monitor that predicts HA reachability with minimal error probability w.r.t. X. The choice of X depends on the application at hand and can include a uniform distribution on a bounded state space or a distribution reflecting the density of visited states in some HA executions [6]. Problem 1 (Predictive monitoring for HA reachability) Given an HA M with state space X , a distribution X over X , a time bound T and set of unsafe states D ⊆ X , find a function F * : X → {0, 1} that minimizes the probability Any practical solution to the above PM problem must also assume a space of functions within which to restrict the search for the optimal predictive monitor F * . Following the neural state classification method of [6], in this work we consider functions described by deep neural networks (DNNs) 2 . Finding F * , i.e., finding a function approximation with minimal error probability, is indeed a classical machine-learning problem, a supervised classification problem in particular, with F * being the classifier, i.e., the function mapping HA state inputs x into one of two classes: 1 (x is positive, can reach D) and 0 (x is negative, cannot reach D).
Machine-learning classifiers often admit an underlying discriminant function, which is used to determine the final classifier output. In neural networks, the discriminant is the function mapping inputs into the class likelihoods (i.e., the softmax probabilities of the last network layer), such that the predicted class is the one with the highest likelihood.

Remark 1
If arg max y i ∈Y f i (x) contains more than one class, then the discriminant implies multiple possible predictions for x. To avoid this ambiguity, we will assume that f is always well defined, meaning that arg max y i ∈Y f i (x) is a singleton, and thus, only one prediction is possible for the classifier F. Any discriminant can be made well defined by, for instance, imposing a total ordering on the classes to select in such ambiguous cases and adequately adjusting the discriminant output 3 .
In what follows, we will interchangeably use the term "(reachability) predictor" for the classifier F and for its discriminant f . Likewise, we will also call F a state classifier, and in particular, neural state classifier (NSC) when its discriminant f is described by a deep neural network. Below we provide a definition of discriminant based on feed-forward neural networks.
Definition 5 (Deep Neural Network discriminant) A DNNbased discriminant over c classes can be defined as a function f w : X → [0, 1] c of the form: where L is the number of hidden layers, • is the function composition operator, f 0 is the input standardization function and, for i = 1, . . . , L, f i is the function computed by the i-th layer. In particular, f L is a function mapping the output of layer L − 1 to [0, 1] c .
Let n i indicate the number of neurons in layer i and let o i−1 ∈ R n i−1 be the output vector of layer i − 1. The output 3 Any discriminant f can be turned into a well-formed one by , ∈ R + , k = max y i ∈Ymax i (based on the ordering on Y ) and 0 k → ∈ R c is a vector whose components are all zeros but the k-th component equals to .
of layer i results from applying function f i : R n i−1 → R n i to the output of the previous layer: where w i,i−1 ∈ R n i ×n i−1 is the weight matrix that connects o i−1 to the neurons of layer i, b i ∈ R n i is the bias vector of layer i, and a i is the activation function of the neurons of layer i. Weights and biases are the parameters learned during training.
Training set. In supervised learning, one minimizes a measure of the empirical prediction error w.r.t. a training set. In our case, the training set Z is obtained from a finite sample X of X by labeling the training inputs x ∈ X using some reachability oracle, that is, a hybrid automata reachability checker like [12][13][14][15]. Hence, given a sample X of X, the training set is defined by Prediction errors. It is well known that neural networks are universal approximators, i.e., they are expressive enough to approximate arbitrarily well the output of any measurable mathematical function [16]. Even though arbitrarily high precision might not be achievable in practice, stateof-the-art optimization methods based on gradient descent via back-propagation [17] can effectively learn highly accurate neural network approximators. However, such methods cannot completely avoid prediction errors (no supervised learning method can). Therefore, we have to deal with predictive monitors F that are prone to prediction errors, which are of two kinds: false positives (FPs), when, for a state x ∈ X , F(x) = 1 (x is positive w.r.t. F) but M | Reach(D, x, T ), and false negatives (FNs), when F(x) = 0 (x is negative w.r.t. These errors are, respectively, denoted by predicates fn(x) and fp(x). In what follows, we consider general kinds of prediction error, described by predicate pe(x), and defined by any arbitrary combination of fn(x) and fp(x).
Remark 2 (Feature space and state space). Neural networks only admit inputs from some vector space ⊆ R n , which is also called feature space. However, the state space X of a hybrid automaton is not a vector space as it is given by X = Loc×V , where Loc is the finite set of HA locations and V is the domain of the continuous HA variables (see Definition 1). To this purpose, for x = (l, v) ∈ X , we apply the (straightforward) vector embedding (l, v) → [#(l) v] T , where # : Loc → R is a suitable injection (a typical choice is assigning an ordinal value to each HA location). In what follows, we keep this vector embedding implicit and work directly with inputs over X .

Uncertainty-based error detection
A central objective of this work is to derive, given a predictor f , a rejection criterion R f able to identify states x that are wrongly classified by F, i.e., FNs and FPs or any combination of the two, without knowing the true reachability value of x. Further, R f should be optimal, that is, it should ensure minimal probability of rejection errors w.r.t. the state distribution X. For this purpose, we propose to utilize information about the reliability of reachability predictions, so as to detect and reject potentially erroneous (i.e., unreliable) predictions. Our solution relies on enriching each prediction with a measure of predictive uncertainty: given f , we define a function u f : X → U mapping an HA state x ∈ X into some measure u f (x) of the uncertainty of f about x. The set U is called uncertainty domain. Defining u f and U is a non-trivial task, details are provided in Sect. 3. The only requirements are that u f is point-specific and should not use any knowledge about the true reachability value. Once defined, u f can be used to build an optimal error detection criterion, as explained below.
Problem 2 (Uncertainty-based error detection) Given a reachability predictor f , a distribution X over HA states X , a predictive uncertainty measure u f : X → U over some uncertainty domain U , and a kind of error pe find an optimal error detection rule G * f , pe : U → {0, 1}, i.e., a function that minimizes the probability Note that Problem 2 requires specifying the kind of prediction errors to reject. Indeed, depending on the application at hand, one might desire to reject only a specific kind of errors. For instance, in safety-critical applications, FNs are the most critical errors while FPs are less important. As for Problem 1, we can obtain a sub-optimal solution G f , pe to Problem 2 by expressing the latter as a supervised learning problem, where the inputs are, once again, sampled according to X and labeled using a reachability oracle. We call validation set the set of labeled observations used to learn G f , pe . These observation need to be independent from the above introduced training set Z , i.e., those used to learn the reachability predictor f . In the simplest scenarios, learning G f , pe reduces to identifying an optimal threshold. However, the proposed supervised learning solution is capable of identifying complex and multi-dimensional decision boundaries in an automatic fashion, making it suitable also for complex scenarios.
For an error pe, the final rejection rule R f , pe for detecting HA states where the reachability prediction should not be trusted, and thus rejected, is readily obtained by the composition of the uncertainty measure and the error detection rule where R f , pe (x) = 1 if the prediction on state x is rejected; R f , pe (x) = 0, otherwise. We remark that rejection rules for different kinds of errors could be combined together to express more sophisticated criteria.

Uncertainty quantification in neural predictive monitoring
We explore two approaches to quantify the uncertainty produced by a neural network-based reachability predictor f , i.e., to derive the uncertainty measures u f introduced in Section 2.1. The first approach, referred to as the frequentist approach, is based on Conformal Prediction [7,18,19]. The second one, referred to as the Bayesian approach, relies on Bayesian learning and employs probability distributions to express and measure uncertainty. In particular, we leverage the theory of Bayesian neural networks [20], which combines neural networks and probabilistic modeling.
We present the two uncertainty quantification approaches for a generic classification problem, where we consider an input space X , a set of classes Y = {y 1 , . . . , y c }, and a classifier F : For an input x, we will use the notationŷ as a shorthand for F(x), the classifier prediction on x. We define the data domain by Z = X ×Y , and we denote with Z the distribution of the data over Z .
In the context of PM of HA reachability, X is the HA state space, Y = {0, 1} (c = 2) is the set of possible reachability values, Z = Pr x∼X (x, 1(Reach(D, x, T ))), where X is the distribution of HA states and Reach(D, x, T ) is the reachability specification, and F is the reachability predictor, i.e., a (sub-)optimal solution of Problem 1.

Conformal prediction
Conformal Prediction (CP) is a technique that associates measures of reliability to any traditional supervised learning model. It is a very general approach that can be applied across all existing classification and regression methods [7,18,19]. CP produces prediction regions instead of single point predictions: given a significance level ε ∈ (0, 1) and a test point x * , its prediction region, Γ ε * ⊆ Y , is a set of classes that are guaranteed to contain the true class y * with probability 1 − ε, i.e., The CP method requires defining a so-called non-conformity function (NCF) h : Z → R: given a predictor f and an example z = (x, y), h(z) measures the "strangeness" of z, i.e., the deviation between the label y and the corresponding prediction f (x). The definition of a suitable NCF for our problem is discussed later. Now consider the distribution of NCF scores induced by the data distribution Z and the predictor f : Note that H precisely captures the distribution of the distances between true classes and corresponding predictions. The rationale behind CP is to construct the prediction region by "inverting" a suitable hypothesis test: given a test point x * and a tentative class y j ∈ Y , we exclude y j from the prediction region only if it appears unlikely that the NCF score h(x * , y j ) is distributed according to H, which implies that it is not likely that y j is the true class. In other words, for each y j ∈ Y , we perform the following hypothesis test: and include in Γ ε * all the y j values for which we fail to reject the null hypothesis H 0 at significance level ε (this is what we mean by "inverting the test"). In particular, H 0 is rejected if i.e., if the probability of observing a data point that is more "non-conforming" than (x * , y j ) is below . This probability is called p value in statistical jargon. Note that exact derivation of H is intractable, as it requires integration of h, which depends on the classifier and thus is typically nonlinear, over Z, a distribution that is seldom given in explicit form (but only empirically, through a set of observations). Therefore, we use instead an empirical approximation of H, derived by computing the NCF scores of a finite sample Z c of Z independent of the training set used to learn the predictor f . We call Z c the calibration set. This approach is called inductive CP [18]. We summarize it in the algorithm below. CP algorithm for classification. Given a sample Z of Z, a test input x * ∈ X , and a significance level ε ∈ (0, 1), CP computes a prediction region Γ ε * for x * as follows.

Divide Z into a training set Z t and a calibration set
5. Compute the NCF scores α j * = h(x * , y j ) for the test input x * and each possible class label j ∈ {1, . . . , c}. Then, compute the smoothed p value where θ ∈ U[0, 1] is a tie-breaking random variable. Note that p j * is the portion of calibration examples that are at least as non-conforming as the tentatively labeled test example (x * , y j ), i.e., an empirical approximation of the p value of Equation 3. 6. Return the prediction region together with the vector ( p 1 * , . . . , p c * ) of p values, one for each class.
Note that steps 1-4 are performed only once, while steps 5-6 are performed for each test point x * . Non-conformity function. A NCF function is well defined if it assigns low scores to correct predictions and high scores to wrong predictions. A natural choice for h, based on the 4 . Recall that, for an input x ∈ X , the output of f is a vector of class likelihoods, which we denote by In classification, a common well-defined NCF function is given by setting where f y (x) is the likelihood of class y when the predictor f is applied on x. If F correctly predicts y for input x, the corresponding likelihood f y (x) is high (the highest among all classes, see Definition 4) and the resulting NCF score is low. The opposite holds when F does not predict y. The NCF measure chosen for our experiments (Eq. 6) preserves the ordering of the class likelihoods predicted by f . Prediction uncertainty. A CP-based prediction region provides a set of plausible predictions with statistical guarantees, and as such, also captures the uncertainty about the prediction. Indeed, if CP produces a region Γ ε * with more than one class, then the prediction for x * is ambiguous (i.e., multiple predictions are plausible), and thus, potentially erroneous. Similarly, if Γ ε * is empty, then there are no plausible predictions at all, and thus, none can be trusted. The only reliable prediction is the one where Γ ε * contains only one class. In this case, Γ ε * = {ŷ * }, i.e., the region only contains the predicted class. This is always true for our NCF function. The proof is trivial and given in Appendix B.
The size of the prediction region is determined by the chosen significance level ε and by the p values derived via CP. Specifically, from Eq. 5 we can see that, for levels ε 1 ≥ ε 2 , the corresponding prediction regions are such that Γ ε 1 ⊆ Γ ε 2 . It follows that, given a test input x * , if ε is lower than all its p values, i.e., if ε < min j=1,...,c p j * , then the region Γ ε * contains all the classes, and Γ ε * shrinks as ε increases. In particular, Γ ε * is empty when ε ≥ max j=1,...,c p j * . We are now ready to introduce our frequentist uncertainty measures, called confidence and credibility, which we define in terms of two p values, independently of the significance level ε. The intuition is that these two p values identify the range of values for which the prediction is reliable, i.e., |Γ ε * | = 1.
Definition 6 (Confidence and credibility) Given a predictor F, the confidence of a point x * ∈ X , denoted by 1 − γ * , is defined as: and the credibility of x * , denoted by c * , is defined as: The confidence 1 − γ * is the highest probability value for which the corresponding prediction region contains onlyŷ * , and thus it measures how likely (according to the calibration set) our prediction for x * is. In particular, γ * corresponds to the second largest p value. The credibility c * is the smallest level for which the prediction region is empty, i.e., no plausible prediction is found by CP. It corresponds to the highest p value, i.e., the p value of the predicted class. Figure 2 illustrates CP p values and corresponding prediction region sizes.
In binary classification problems, like our predictive monitoring problem, each point x * has only two p values: c * (p value of the predicted class) and γ * (p value of the other class).
It follows that the higher 1 − γ * and c * are, the more reliable the predictionŷ * is, because we have an expanded range [γ * , c * ) of ε values by which |Γ ε * | = 1. Indeed, in the degenerate case where c * = 1 and γ * = 0, then |Γ ε * | = 1 for any value of ε < 1. This is why, as we will explain in the next section, our uncertainty-based rejection criterion relies on excluding points with low values of 1 − γ * and c * . Hence, our frequentist uncertainty measure associates with each input its confidence and credibility values. Definition 7 (Frequentist uncertainty measure) Given a predictor F with discriminant f , we define the frequentist uncertainty measure u f : X → U F = [0, 1] 2 as the function mapping inputs x into their corresponding confidence and credibility values, obtained as per Definition 6, i.e.,

Bayesian neural networks
A neural network is a function f w : X → [0, 1] c , which maps an input x ∈ X into a vector of class likelihoods, , depending on some parameters w, namely weights and biases. A trained neural network is typically a complex but deterministic model. The core idea of Bayesian neural networks (BNNs) is to place a probability distribution over its parameters w, thereby transforming the neural network into a stochastic model. The advantage of Bayesian methods, and BNNs in particular, is that they provide a distribution of predictions, called predictive distribution, rather than a single prediction like deterministic NNs. Such distribution captures both the aleatoric uncertainty, i.e., the noise inherent in the observations, and the epistemic uncertainty, i.e., model uncertainty about its prediction [21]. We will therefore leverage this predictive distribution (its mean and variance to be precise) to compute our Bayesian uncertainty measures.
The Bayesian learning process starts by defining a prior distribution for w that expresses our initial belief about the parameter values. As we observe data Z ∼ Z, we update this prior to a posterior distribution using Bayes' rule: Note that placing a prior distribution over w is analogous to the random weight initialization required to train a traditional (deterministic) neural network. A common choice is to choose a zero-mean Gaussian prior. Similarly, we assume that the conditional distribution p(y|x) is a softmax likelihood It follows that, given a set of i.i.d. observations Z , the likelihood function can be expressed as Note that, because of the nonlinearity introduced by the neural network function f w (x) and by the softmax likelihood, the posterior p(w|Z ) is non-Gaussian. Finally, in order to predict the value of the target for an unobserved input x * , we marginalize the predictions with respect to the posterior distribution of the parameters, obtaining The latter is called posterior predictive distribution and it can be used to retrieve information about the uncertainty of a specific predictionŷ * . Unfortunately, the integration is analytically intractable due to the nonlinearity of the neural network function [20,22]. Empirical approximation of the predictive distribution . Assume, for the moment, that we are able to sample from the posterior distribution (9) and let [w 1 , . . . , w N ] denote a vector of N realizations of the random variable w ∼ p(w|Z ). Each realization w i induces a deterministic function f w i that can be evaluated at x * , the unobserved input. The likelihood for a targetŷ * can be computed using (10). The empirical approximation of the predictive distribution (12) can be expressed as where f * w i denotes the component of f w i corresponding to classŷ * . By the strong law of large numbers, the empirical approximation converges to the true distribution as N → ∞ [23]. The sample size N can be chosen, for instance, to ensure a given width of the confidence interval for a statistic of interest [24] or to bound the probability that the empirical distribution differs from the true one by at most some given constant [25]. Bayesian inference techniques. Since precise inference is infeasible, various approximate methods have been proposed to infer a BNN. We consider two approximate solution methods: Hamiltonian Monte Carlo and Variational Inference.
Let w be a weight vector sampled from the posterior distribution p(w|Z ), i.e., a realization of the random variable w. We denote with f w (x) the corresponding deterministic neural network having weights fixed to w.
Hamiltonian Monte Carlo (HMC) [8] defines a Markov chain whose invariant distribution is exactly the posterior p(w|Z ) . The Hamiltonian dynamics is used to speed up the space exploration. HMC does not make any assumption on the form of the posterior distribution, and is asymptotically correct. After convergence, HMC returns a trace of explored network weights w 0 , w 1 , . . . , w N that, all together, can be interpreted as an empirical approximation of the posterior p(w|Z ). Controlling in a precise way the convergence rate and how well the chain explores the parameter space is, however, far from trivial.
Variational Inference (VI) [9] directly approximates the posterior distribution with a known parametric distribution q(w; ψ), typically a distribution easy to sample from. Its parameters ψ, called variational parameters, are learned by minimizing the Kullback-Leibler (KL) divergence between the proposed distribution and the posterior. The KL divergence between q(w; ψ) and p(w|Z ) is defined as Since the posterior distribution is not known, a different objective function, called Evidence Lower Bound (ELBO), is introduced. It is defined as ELBO [26]. In VI, the variational objective, i.e., the negative ELBO, becomes the loss function used to train of a Bayesian neural network [26]. A common choice for q(w; ψ) is the Gaussian distribution (where ψ are its mean and variance). The predictive distribution is a nonlinear combination of Gaussian distributions, and thus it is not Gaussian. However, samples can be easily extracted from q(w; ψ), which allows us to obtain an empirical approximation of the predictive distribution. Prediction uncertainty. Having showed how to derive empirical approximations of the predictive distribution (either with HMC or with VI) we can now extract statistics to characterize the latter. We stress that the predictive distribution, and hence its statistics, effectively captures prediction uncertainty. Our Bayesian uncertainty measure is, therefore, given by the empirical mean and variance of the predictive distribution.
Definition 8 (Bayesian uncertainty measure) Given observations Z ∼ Z and a Bayesian neural network f w with w ∼ p(w|Z ), we define the Bayesian uncertainty measure u f : X → U B ⊆ R 2 as the function mapping inputs x into the empirical mean and variance of the predictive distribution p(ŷ | x, Z ) (12). Formally, ∀x ∈ X , u f (x) = (μ, σ 2 ).

Remark 3 (Softmax probabilities as uncertainty measures).
A popular method to assess the quality of a prediction is to use the probabilities outputted by the softmax layer of the DNN (i.e., the output of the discriminant function). However, previous works have shown that such probabilities are not well calibrated, meaning that the probability associated with the predicted class label does not reflect its ground-truth correctness likelihood [27,28]. For example, in a binary classification problem, one can consider the difference between the probability of the two classes as a measure of uncertainty, where small differences indicate uncertain predictions. Previous experiments in [29] have shown that such a measure yields poor error detection in NPM, because the measure is overconfident in predictions that turn out being erroneous.
Such observation supports our claim that more principled and calibrated methods to measure uncertainty are needed. On one hand, our uncertainty measure based on Conformal Prediction overcomes this limitation, as it makes predictions with statistical evidence, rather than probabilistic evidence. On the other hand, our Bayesian measure of uncertainty remains consistent also in regions where no data has been observed and, in particular, where the deterministic DNN will behave almost randomly.

Uncertainty-based rejection criteria
In this section, we show how to leverage the measures of uncertainty introduced in Sect. 3 to learn an optimal, uncertainty-based error detection rule for reachability predictions, thereby solving Problem 2.
The rationale is that an unseen input x must have sufficiently low uncertainty values in order for the prediction to be accepted. However, manually determining such decision boundaries on the uncertainty domain U is a non-trivial task. As discussed in Sect. 2, optimal error detection thresholds can be automatically identified by solving an additional supervised learning problem.
For a type of error e ∈ {pe, fp, fn}, given a set of validation inputs X v , sampled from X, and an uncertainty measure u f , we build a validation set W e v , defined as The inputs of W e v , referred to as U v , are the uncertainty measures evaluated on X v : Similarly, each input u f (x) ∈ U v is then labeled, respectively, with 1 or 0 depending, respectively, on whether or not the classifier f makes an error of type e on x: For ease of notation, in the following we omit the type of error considered. However, it is important to keep it in mind when addressing a specific application. We seek to find those uncertainty values that optimally separate the points in U v in relation to their classes, that is, separate points yielding errors from those that do not. Finding such a separation corresponds to finding a (sub-)optimal solution to Problem 2.
As for the uncertainty measures of the previous section, we present two alternative solutions to this problem. The first one leverages support vector classification (SVC) and applies to the frequentist case, based on CP, where the uncertainty values are given by confidence and credibility. The second solution leverages Gaussian process classification (GPC) and applies to the Bayesian case, based on BNNs, where uncertainty values are given by mean and standard deviation of the predictive distribution.
Remark 4 (Highly unbalanced dataset). In predictive monitoring, the neural network-based reachability predictors that we use have typically very high accuracy (see results in Sect. 6). Therefore, the dataset W v is typically highly unbalanced, as it contains more examples of correct classifications (class 0) than of classification errors (class 1). In binary classification problems, in particular in both SVC and GPC, accuracy can be a misleading measure when dealing with imbalanced datasets. Indeed, for instance, a constant function mapping any input into the most frequent class will have high accuracy. Therefore, a model trained on accuracy maximization tends to misinterpret the behavior of the observations belonging to the minority class, causing misclassification. However, in our method, the less frequent class (i.e., the prediction errors) is actually the most interesting one, which we want to classify correctly.

Frequentist error detection via support vector classification (SVC)
For data parsimony, since calibration points were not used to train the reachability predictor, we build the validation set W v from the calibration set Z c . A cross-validation strategy is used to compute values of confidence and credibility for points in Z c . The cross-validation strategy consists of removing the jth score, α j , in order to compute γ j and c j , i.e., the p values at In this way, we can compute our frequentist uncertainty measure given by confidence 1 − γ and credibility c for every point in the calibration set. The support vector classifier (SVC) is then trained on pairs ((1 − γ, c), e), where e indicates whether or not a prediction error is observed. In a nutshell, SVC is a kernel-based method that maps the original data into a new space, called feature space, via a feature map φ. By doing so, patterns that are not linearly separable in the original data can be converted to be linearly separable in the feature space [20]. The linear decision boundary for a binary SVC is defined as for x in the original space. Training a SVC reduces to find the values of a and b that maximize the margin around the separating hyperplane and the decision function d(x). In the dual formulation of the SVC problem, whose details are out of the scope of this paper, the optimization is performed using kernel functions rather than feature maps. Recall that a kernel can be defined as k(u, u ) = φ(u) T · φ(u ). In practice, given a test point x * with predicted labelŷ * and uncertainty measure u f (x * ) = (1 − γ * , c * ), error detection at x * boils down to evaluating the learned SVC, i.e., its decision function d, at u f (x * ). If d(u f (x * )) > 0 the point x * is classified as potentially erroneous, class 1, and 0 otherwise. Tuning of SVC hyperparameters. A simple method to handle imbalanced classes in SVC is via cost-sensitive learning [30]. The aim is to find the classifier that minimizes the mean predictive error on the training set. Each misclassified example by a hypothetical classifier contributes differently to the error function. One way to incorporate such costs is the use of a penalty matrix, which specifies the misclassification costs in a class dependent manner [31]. We design an empirical penalty matrix P, as follows: the (i, j)-th entry of P gives the penalty for classifying an instance of class i as class j. Of course, when i = j, the penalty is null. The penalty matrix for dataset W v is defined as where n e is the number of points belonging to class 1 in W v , r e is a parameter influencing how many errors we are willing to accept and q = |W v |. The term r e q 2n e , which represents the penalty for wrongly classifying an error as correct, increases as n e decreases. Note that, when r e = 1 and the dataset is perfectly balanced (q = 2n e ), the penalties are equal: r e q 2n e = q 2r e (q−n e ) = 1. Further, if r e > 1, the penalty term increases, leading to more strict rejection thresholds and higher overall rejection rates. On the contrary, if r e < 1, the penalty decreases, leading to possibly missing some errors.
Definition 9 (Frequentist error detection criterion) Given a state x * ∈ X , a reachability predictor f and an uncertainty measure u f (x * ) = (1 − γ * , c * ) as per Definition 7, the frequentist error detection function G f : where d is the binary SVC of (19) trained on W v (16).

Bayesian error detection via Gaussian process classification (GPC)
Recall that we define our Bayesian uncertainty measure as the mean and variance of the empirical approximation of the BNN predictive distribution. Therefore, the prediction y * made on an unseen input x * is associated with a vector of uncertainty (μ * , σ 2 * ) ∈ U B ⊂ R 2 . To keep the approach fully Bayesian, we propose a probabilistic solution to the error detection problem based on Gaussian Processes [32].
Formally, a Gaussian Process (GP) is a stochastic process, i.e., a collection of random variables indexed by some input variable, in our case u ∈ U , such that every finite linear combination of them is normally distributed. In practice, a GP defines a distribution over real-valued functions of the form : U B → R and such distribution is uniquely identified by its mean and covariance functions, respectively, denoted by m(u) = E[ (u)] and k(u, u ). The GP can thus be denoted as GP (m(u), k(u, u )). This means that the function value at any point u, (u), is a Gaussian random variable with mean m(u) and variance k(u, u). Typically, the covariance function k(·, ·) depends on some hyperparameter γ .
As mentioned before, GPs can be used to perform probabilistic binary classification, i.e., learning a Bayesian classifier G f : U B → {0, 1} from a set of observations. GPs model the posterior probabilities by defining latent functions : U B → R, whose output values are then mapped into the [0, 1] interval by means of a so-called link function Φ. Typically, in binary classification problem, the logit or the probit function are used as Φ.
Given an input u i , let i = (u i ) denote its latent variable, i.e., the latent function evaluated at u i . Also denote The first step of a GPC algorithm is to place a GP prior over the latent function , defined by Let now consider a test input u * with latent variable * . In order to do inference, that is, predict its label e * , we have to compute where E v , see Eq. 18, denotes the set of labels corresponding to points in U v , see Eq. 17. The discriminant function g f : U B → [0, 1] 2 of the error detection classifier G f is defined as To compute Eq. (21), we have to marginalize the posterior over the latent Gaussian variables: where the posterior p( v |U v , E v ) can be obtained using the standard Bayes rule Therefore, performing inference reduces to solving two integrals (Eqs. 21 and 23). In classification, the first integral is not available in closed form since it is the convolution of a Gaussian distribution, p( v ), and a non-Gaussian one, p(E v | , U v ). Hence, we have to rely on approximations in order to compute and integrate over the posterior p( v |E v ).
In our experiments, we use the Laplace method, which provides a Gaussian approximation q( v |E v ) of the posterior p( v |E v ), which can then be easily computed and integrated over.
Tuning of GPC hyperparameters. In the prior distribution, the covariance for the latent variables depends on some hyperparameters γ . A classical strategy to select the optimal values for such parameters is to find the values of γ that maximize the marginal likelihood, which, intuitively, measures how likely the data are, given a certain value of γ . However, as mentioned in Remark 4, the marginal likelihood may be a poor choice because class 1 is very little represented. An alternative solution is to compute, for different values of γ , the confusion matrix of the GPC on the training set W v . The entries of such matrix can be used to define more clever measures of performance that apply to binary classification. In our experiments, we use the true positive rate (TPR), as it is well-suited for datasets presenting a strong disproportion. TPR measures the fraction of points in class 1 that have been correctly classified: where TP indicates the number of true positives and F N indicates the number of false negatives. Alternative measures, such as the Matthews correlation coefficient (MCC) [33], may apply. Note that, during the training phase, the GPC assigns to each point the class with highest likelihood.
Another key step is the following. The discriminant function, g f , returns a vector containing the probability of belonging to each of the two classes. However, such probabilities might not separate well in cases of highly unbalanced datasets. Therefore, choosing the class with the highest probability, as per Definition 4, may lead to bad performance. Therefore, after the GPC has been trained with an optimal value for γ , it may be useful to find the decision threshold that maximizes the GPC accuracy on the training set W v . This can be done, for instance, using the ROC curve. In other words, we classify as correct (class 0) only those points that have an extremely high probability of belonging to that class. We do so by searching for the threshold τ that maximizes the quantity T P R − F P R, i.e., the proportion of recognized errors minus the proportion of points wrongly rejected. Below we provide the formal definition of the Bayesian error detection function for a generic threshold τ . Definition 10 (Bayesian error detection criterion) Given a state x * ∈ X , a reachability predictor f , a Bayesian uncertainty measure u f (x * ) = (μ * , σ 2 * ) = u * as per Definition 8, and a decision threshold τ ∈ [0, 1), the error detection function G f : U B → {0, 1} rejects a reachability estimates F(x * ) if and only if where U v , E v are the inputs and outputs of the validation set, see (17) and (18).

Active learning
Recall that we are dealing with two related learning problems: learning a prediction rule (i.e., a reachability predictor) using the training set Z t , and learning a rejection rule using the validation set W v (via learning an adequate uncertainty measure first).
As the accuracy of a classifier increases with the quality and the quantity of observed data, adding samples to Z t will generate a more accurate predictor, and similarly, adding samples to W v will lead to more precise error detection. Ideally, one wants to maximize accuracy while using the least possible amount of additional samples, because obtaining labeled data is expensive (in NPM, labeling each sample entails solving a reachability checking problem), and the size of the datasets affects the complexity and the dimension of the problem. Therefore, to improve the accuracy of our learning models efficiently, we need a strategy to identify the most "informative" additional samples.
For this purpose, we propose an uncertainty-aware active learning solution, where the retraining points are derived by first sampling a large pool of unlabeled data, and then considering only those points where the current predictor f is still uncertain. The criterion used to decide whether a point is uncertain enough to be considered informative is indeed our rejection rule R f . In particular, recall that the uncertainty function u f : X → U maps input states to their level of uncertainty, and the error detection function G f : U → {0, 1} maps uncertainty values to a binary class interpreted as accepting/rejecting a prediction. The rejection rule R f : X → {0, 1}, introduced in Sect. 2, is defined as the combination of these two functions, R f = G f • u f . Therefore, such rejection rule provides an effective uncertainty-based query strategy. Points rejected by R f , i.e., points whose predictions are expected to be erroneous, are indeed the most uncertain ones.
The proposed active learning method should reduce the overall number of erroneous predictions, because it improves the predictor on the inputs where it is most uncertain, and, as a consequence, also reduces the overall rejection rate.
However, it cannot be excluded in general that the retraining process introduces new prediction errors. We stress that, with our method, these potential new errors can be effectively detected.

General active learning algorithm
Our active learning algorithm works as follows. The rejection rule R f is used as a query strategy to identify, from a batch of randomly selected unlabeled points, those with a high degree of uncertainty. We then query the oracle, i.e., the HA reachability checker, to label such uncertain points, and finally we divide them into two groups: one group is added to the training set Z t ⊆ X ×Y , producing the augmented dataset Z a t ; the other is added to the validation set Z v , producing Z a v . The set of uncertain points must be divided according to the splitting probability used to originally divide Z into Z t and Z v . The first step consists in retraining the reachability predictor on Z a t . Let f a denote the new predictor. The second step requires extracting the augmented validation dataset W a v from Z a v . To do so, we must first train a new uncertainty measure u f a so as to reflect the new predictor f a . Then, the new error detection rule G f a is trained on where e f a is the error predicate introduced in Sect. 2 (the f a index is added to stress dependency on the updated predictor).
In conclusion, this process leads to an updated rejection rule R f a = G f a • u f a , which is expected to have a reduced rate of incorrect rejections. We now describe in detail our uncertainty-aware active learning algorithm, which given an initial training set Z t , a predictor with discriminant f trained on Z t , an initial validation set Z v , and a rejection rule R f trained on Z v , computes an enhanced predictor f a and enhanced rejection rule R f a as follows.
The above algorithm is used as-is in the Bayesian framework. For the frequentist case, we present a refined version of the algorithm that overcomes issues with the sensitivity of CP-based measures. Sensitivity of CP-based uncertainty measures. The distribution of calibration scores depends both on the case study at hand and on the trained classifier. If such a classifier F has high accuracy, then most of the calibration scores α 1 , . . . , α q will be close to zero. Each p value p j * of an unseen test point x * counts the number of calibration scores greater than α j * , the non-conformity score for label j at x * . Credibility, which is the p value associated with the class predicted by F, is expected to have a small score and therefore a high p value. On the contrary, γ , which is the p value associated with the other (non-predicted) class, is expected to have a larger score. However, given the high accuracy of F, the number of cal-

Algorithm 1 Active Learning algorithm
Inputs: training set Z t , validation set Z v , predictor f , uncertainty function u f , rejection rule R f , maximum iterations n it . Outputs: enhanced predictor f a , enhanced rejection rule R a f . repeat n it times: // Select retraining inputs 1. Randomly sample a set of input points. 2. Identify the subset A of points rejected based on R f . // Derive augmented datasets 3. Invoke the reachability oracle to label the points in A. 4. Divide the data into two groups and add them, respectively, to Z t and Z v , obtaining an augmented training set, Z a t , and an augmented validation set, Z a v .
5. Train a new predictor f a from Z a t . 6. Build the training set W fa v using Z a v and f a . 7. Train a new error detection rule G fa , using u f and the method of Section 4, and obtain the enhanced rejection rule R fa .
ibration scores significantly greater than zero is very small. Therefore, the fraction of calibration scores determining γ is not very sensitive to changes in the value of α * , which is determined by f (x * ). On the contrary, credibility is extremely sensitive to small changes in α * . In general, the sensitivity of confidence with respect to α * increases as the accuracy of f decreases, and vice versa for credibility. Figure 3 shows the credibility landscapes for two different training instances of model f on the same training set for a concrete case study. We observe that even if regions where misclassifications take place are always assigned low credibility values, outside those regions credibility values are subject to high variance.
This sensitivity results in an over-conservative rejection criterion, leading to a high rejection rate and in turn, to an inefficient query strategy. However, if we enrich the calibration set using additional samples with nonzero α-scores, we can reduce such sensitivity, thereby making credibility more robust with respect to retraining. This process is illustrated in Figure 3, where the additional nonzero α-scores (right) lead to a more robust credibility landscape, where low-credibility regions are now more tightly centered around areas of misclassification.
Observing that samples with uncertain predictions will have nonzero α-scores 5 , we will use the original rejection rule to enrich the calibration set, thereby deriving a refined rejection rule and in turn, a refined and more effective query strategy for active learning.

Frequentist active learning algorithm
Provided that the credibility measure is extremely sensitive in our application, we found that dividing the frequentist active learning algorithm in two phases dramatically improves performances. In the first phase, we refine the query strategy: we use the current rejection rule R f to identify a batch of uncertain points, temporarily add these points to the calibration set, thereby obtaining an updated, more robust, rejection rule that we use as a query strategy. In the second phase, we simply perform an active learning iteration, i.e., steps 2-5 above) but using the refined query strategy to identify the retraining inputs.
We now describe the details of this variant of the active learning algorithm designed for the frequentist framework. Given an initial training set Z t , a prediction rule f trained on Z t , an initial calibration set Z c , a rejection rule R f trained on Z c and a rejection ratio r e , we proceed as follows.

Algorithm 2 Frequentist AL algorithm
Inputs: training set Z t , calibration set Z c , predictor f , uncertainty function u f , rejection rule R f , maximum iterations n it . Outputs: enhanced predictor f a , enhanced rejection rule R a f . repeat n it times: // Refine the query strategy 1. Randomly sample a set of input points. 2. Identify the subset Q of points rejected by R f . 3. Identify the subset A of points rejected based on R f . 4. Invoke the reachability oracle to label the points in Q. 5. Define a query set Z Q by adding these points to Z c . 6. Obtain an updated rejection rule R Q f from Z Q using the method of Section 4. // Active phase 7. Randomly sample a set of input points. 8. Identify the subset A of points rejected by R Q f . 9. Invoke the reachability oracle to label the points in A. 10. Divide the labeled data into two groups and add them, respectively, to Z t and Z c , obtaining an augmented training set, Z a t , and an augmented calibration set, Z a c . 11. Train a new predictor f a from Z a t . 12 It is important to observe that, in order for the active learning algorithm to preserve the statistical soundness of conformal prediction, the augmented training and calibration sets Z a t and Z a c must be sampled from the same distribution. This is guaranteed by the fact that, in the active learning phase, we add new points to both the training and the calibration dataset, and these points are sampled from the same distribution (in particular, we apply the same random sampling method and same rejection criterion). The only caveat Fig. 3 Credibility values in the spiking neuron case study. Calibration scores (first row) and credibility landscapes using the initial calibration set Z c (left column) versus the query set Z Q (right column). The landscapes are obtained for different instances of the predictor f , trained on the same dataset Z t is ensuring that the ratio between the number of samples in Z c and Z t is preserved on the augmented datasets.

Experimental results
We experimentally evaluate the proposed method and compare the frequentist and Bayesian approaches on a benchmark of six hybrid system models with varying degrees of complexity. We consider four deterministic case studies: the model of the spiking neuron (SN) action potential [6] introduced in Sect. 2 and the classic inverted pendulum (IP) on a cart, which are two-dimensional models with nonlinear dynamics, the artificial pancreas (AP) [34], which is a six-dimensional nonlinear model, and the helicopter model (HC) [6], which is a linear model with 29 state variables. In addition, we analyze two non-deterministic models with nonlinear dynamics: a cruise controller (CC) [6], whose input space has four dimensions, and a triple water tank (TWT) 6 , which is a three-dimensional model. Details of the case studies are available in Appendix C.

Experimental settings
The experiments were performed on a computer with a CPU Intel x86, 24 cores and a 128GB RAM. Table 1 compares the performances of DNN and BNN against different types, respectively, deterministic and Bayesian, of classifiers. In particular, in the deterministic case we compare: a sigmoid DNN (DNN-S) with 3 hidden  In the Bayesian scenario, we observe that GP and BNN have comparable performances on the simplest models. However, BNN works better as soon as the dimension of the system increases. Furthermore, BNNs offer better scalability than GPs. Indeed, the scalability of GP inference depends heavily on the size of the dataset n, with a time complexity of O(n 3 ), whereas for BNNs with VI this is O(n · m), where m is the number of epochs, and for BNN with HMC the complexity is O(n · k), where k is the number of steps of the Markov chain. In our experiments m k, and k and n have same order of magnitude. On the other hand, BLR shows limited performances for systems whose dynamics is intrinsically nonlinear. In general, despite the overall difference in performance may seem small, we would like to stress that we target safety-critical applications, for which we seek accuracies as close as possible to 100% and even small improvements become important.
Motivated by the results presented in Table 1, we choose the sigmoid DNN architecture described above for our reachability predictions. In particular, the output of the DNN with parameters w in a state x ∈ X , denoted by f w (x), is the likelihood of class 1, i.e., the likelihood that the hybrid automaton state is positive. Therefore, the discriminant function f evaluated at x returns a vector of probabilities To avoid overfitting, we did not tune the architecture (i.e., number of neurons and hidden layers) to optimize the performance for our data and, for the sake of simplicity, we choose the same architecture for all the case studies, as we found no specific DNN architecture with consistently better performances. See Appendix D for a detailed performance analysis for different choices of the DNN architecture. In particular, we use the same architecture for deterministic DNNs and their Bayesian counterpart.
The entire pipeline is implemented in Python, and the neural networks are trained with TensorFlow [35] and PyTorch [36]. More precisely, Keras [37], a Python deep learning library, is used to train the deterministic DNN, Edward [38], a Python library for probabilistic modeling built on Tensor-Flow, is used to train the BNN with HMC inference, and Pyro [39], a probabilistic programming library built on PyTorch, is used to train the BNN with VI. The source code for all the experiments can be found at the following link: https:// github.com/francescacairoli/NPM.
For every model, we generate an initial dataset Z of 20K samples and a test set Z test of 10K samples. The helicopter model is the only exception, where, due to the higher dimensionality, a set Z of 100K samples is used. Both Z and Z test are drawn from the same distribution Z; see [6] and Appendix E for more details on how data are labeled and on the distributions for each case study. The training and validation sets are two subsets of Z extracted as follows: a sample z ∈ Z has probability s of falling into Z t and probability 1 − s of falling into Z v , where s = 0.7 is the splitting ratio. Recall that the calibration set Z c of the frequentist approach coincides with the validation set Z v and that we use the same splitting rate s when augmenting the datasets during active learning. The dReal solver [12] is used as a reachability oracle to label the datasets for the non-deterministic case studies. For the deterministic case studies, we used an HA simulator implemented in MATLAB. Kernel choice. Both error detection rules, based on SVC for the frequentist approach and GPC for the Bayesian approach, are kernel-based methods. The radial basis function (RBF) kernel has been chosen in both cases, as it outperforms the polynomial and linear kernels. The RBF is defined as k γ (u, u ) = exp(γ u −u 2 ), where u −u 2 is the squared Euclidean distance between the two input vectors. Error type selection. Below we focus on detecting all kinds of prediction error, including false positives and false negatives. However, it is possible-and this is a very useful feature of our approach-to focus on a specific type of error. For example, in safety-critical applications, one could focus on detecting false negatives, which are the most critical kind of errors. An alternative solution, which we explored in [10], is to learn two distinct rejection rules, one for false positives and one for false negatives, and combine them into a global rejection rule that suits the case study at best.
In addition, in the frequentist case, one can tune the SVC penalty matrix P (see equation (20)) to penalize specific kinds of errors. For instance, setting r f n > 1 will result in a detection rule that is stricter on recognizing false negatives. In the Bayesian case, the GPC decision threshold can be tuned by maximizing scores other than T P R − F P R. Tuning of the Bayesian approach. Training the deterministic DNNs was straightforward in our experiments. All models share the same initialization settings and all reach an extremely high accuracy (always higher than 99%, see Table 3). On the contrary, training a Bayesian neural network requires a careful tuning of the inference hyperparameters, e.g., the choice of prior distributions and the sample size used to empirically approximate the predictive distribution. Furthermore, in the HMC framework, the parameters governing the Hamiltonian dynamics affect the capabilities of the Monte Carlo algorithm to explore and to eventually converge. In the VI framework, the hyperparameters by which we maximize the ELBO (15) may change from one model to the other. The main drawback is that this may limit the effectiveness of the active learning framework, as explained later in this section.

Performance measures
We want our method to be capable of working at runtime, which means it must be extremely fast in making predictions and deciding whether to trust them. We emphasize that the time required to train the reachability predictor and the error detection rule does not affect its runtime efficiency, as it is performed in advance (offline) only once. Also, we do not want an over-conservative rejection rule, as unnecessary rejections would reduce effectiveness of our predictive monitor 7 .
Keeping that in mind, the relevant performance metrics for NPM are the accuracy of the reachability predictor F, the error detection rate (or recognition rate) and the overall rejection rate of the rejection rule R f . The error detection rate measures the proportion of errors made on the test set by F that are actually recognized by R f , whereas, the rejection rate measures the overall proportion of test points rejected by R f . Clearly, we want our method to be reliable and thus, detect the majority of prediction errors (high detection rate) without being overly conservative, i.e., keeping a low rejection rate.
Another important remark is about the interaction of the two classifiers, F and G f : as the accuracy of F increases it commits fewer errors, which makes it harder for the detection rule G f to learn how to capture them because the validation set W v for training G f will contain few examples of prediction errors. The opposite holds as well: if F performs poorly, it produces a less unbalanced validation set W v , which may result in a more accurate rejection rule R f . These two behaviors are balanced against one another, as discussed above, by tuning the training of the rejection rules.

Computational performance
Offline cost. Training an NPM requires the following steps: (i) training the state classifier, (ii) generating the datasets W v , which requires computing the uncertainty values for each point in Z v , and (iii) training the error rejection rule. All these steps are performed offline. Executing the entire pipeline, i.e., learning a working NPM, when |Z | = 20K, takes around 3 minutes in the frequentist case and around 11 minutes in the Bayesian case. When |Z | = 100K, it takes around 6.5 (120-190) minutes in the frequentist (Bayesian) case. The time required to execute 20K VI epochs is comparable with the time required to perform 2K HMC steps (see Table 2, bottom-left frame). Online cost. Given a test input x * , it takes from 1.4 up to 31 milliseconds to evaluate the NPM, i.e., to make a prediction and choose whether to accept it or not (see Table 2, top frame). Importantly, this time does not depend on the dimension or dynamics of the hybrid system. However, in the Bayesian approach it depends on the number of observations used to empirically approximate the predictive distribution, which is a fixed cost, whereas, in the frequentist approach the evaluation time is affected by the size of the calibration set Z c , which may increase as we add observations 8 . On this aspect, the query strategy refinement we propose for active learning ensures that the augmented calibration set is as small as possible, which translates into runtime efficiency of our method. Active learning overhead (offline). Active learning carries two additional training costs: the time needed to compute uncertainty values for a large pool of data, and the time the oracle needs to compute labels for the most uncertain points. The latter dominates, especially for non-deterministic systems, since they require fully fledged reachability checking, which is more expensive than simulation of a deterministic system. Therefore, if the rejection rate is relatively high and we consider a large pool of randomly selected points, Time is measured in milliseconds (ms). (Bottom) Offline computational costs: time required to initially train the NPM. AL overhead: time to complete an active learning iteration (on average). Time is measured in minutes the procedure may take long. The pool of new inputs has indeed to be large in order to have good exploration and find significant instances. As we will show experimentally, our uncertainty-aware active learning approach results in a more precise rejection rule with a lower rejection rate. Therefore, the time spent for offline retraining pays off in improving the online performance of the NPM. In our study, the pool used to refine the query strategy (required only with CP) contains 50K samples, whereas the pool used for the active learning phase contains 100K samples. In particular, one iteration of the active learning procedure takes, for the simplest deterministic models, around 10 minutes in the frequentist scenario, and around 20 minutes in the Bayesian scenario (both VI and HMC approaches). The helicopter model needs longer time, as it is trained for a higher number of epochs: it takes around 1 hour in the frequentist case against the 10 hours of the Bayesian case. For the non-deterministic models (triple water tank and cruise controller), an active learning iteration takes approximatively the same time of a simple deterministic model, except for the overhead introduced in labeling new points (see Table 2: bottom-right frame). dReal, the non-deterministic reachability checker, takes around 1.5 mins to label 100 observations of the CC model and around 4 mins to label the same amount of points of the TWT model. In general, the time required for a single active learning iteration is expected to decrease for subsequent iterations, as the rejection rate will be lower (leading to fewer retraining samples). Note that retraining is performed offline and does not affect runtime performance of our approach.

Experiments
We evaluate our approach on three configurations: the initial configuration, where the predictor F and error detection rule G f are derived via supervised learning; the active configuration, where the initial models are retrained via our uncertainty-aware active learning; and the passive configuration, where the initial models are retrained using a uniform sampling strategy to augment the dataset and with the same number of observations of the active configuration.
In this way, we can evaluate the benefits of employing an uncertainty-based criterion to retrain our models. Table 3 presents the experimental performances (on the test set Z test ) in the initial configuration. The results are averaged over five runs, where in each run, we resample Z t and Z c from Z and retrain F. Table 4 compares the performances of the three configurations, only for one run in this case. Frequentist approach. The average NSC accuracy over the six case studies is 99.68%. The rejection criterion recognizes well almost all the errors, with an average error detection rate of 99.53%, but the overall rejection rate in the initial configu-  Results are over a single run. Legend is as in Table 3 ration is around 5%, a non-negligible amount. Table 4 shows that the passive learning approach provides little improvement: the NSC accuracy is similar to the initial one and the rejection rate is still relatively large. However, the active approach provides a significant improvement: the overall rejection rate and the number of errors made by the NSC falls dramatically, while preserving the ability of detecting almost all errors (with an error detection rate of 100%, except for the helicopter). In particular, rejection rates span from 3.46% to 6.75% with the initial rejection rule, but drop to between 0.51% and 4.52% after active learning, and the average NSC accuracy increases from 99.68% (initial) to 99.82% (active). Bayesian approach. The predictive distribution is approximated by samples of 100 observations. The BNN priors are chosen to be standard normal distributions. In the initial configuration, the NSC accuracy, averaged over all the case studies, is 98.95% with VI and 98.27 with HMC. The rejection criterion recognizes on average 86.18% of errors with VI and 95.27% with HMC. The overall rejection rate is approx. 5.19% with VI, spanning from 1.06% to 13.55%, and approx. 9.87% with HMC, spanning from 6.56% to 16.03%. Table 4 shows that, in the HMC framework, the passive learning approach happens to produce results that are even worse than the initial configuration. The reason might be that that once the training sets are extended with data that may come from a distribution different from X, the set of hyperparameters chosen to optimally solve the initial problem may become sub-optimal. Indeed, when the hyperparameters were tuned again, specifically for the passive learning dataset, high performances were reached. For instance, in the TWT model, the initial HMC performance rates (obtained with proper hyperparameter tuning) are: 97.5% accuracy, 96.8% recognition and 5.04% rejection. Passive learning (without re-tuning the hyperparameters) causes a significant drop in performance: 91.86% accuracy, 52.58% recognition and 11.31% rejection. However, once the HMC hyperparameters are tuned specifically for the passive learning dataset, the level of performance gets back to the initial one: 98.59% accuracy, 96.45% recognition and 3.63% rejection. On the other hand, active learning still yields improvements: the NSC accuracy rises from 98.27% to 99.30%, and the rejection rates, initially very high, are significantly reduced, which unfortunately causes a slight decrease in the error detection accuracy. The recognition rate falls from 95.27% to 91.68%.
We finally observe that VI outperforms inference via HMC, even though VI is not able to reach recognition rates as high as in the frequentist approach. In particular, on average, the initial VI approach yield an NSC accuracy of 98.95%, a rejection rate of 5.19% and a recognition rate of 86.18%. The passive results, as before, introduce only minor improvements, whereas active learning yields a significant reduction in the rejection rate (from 5.19% to 2.36%), an increase in the NSC accuracy (from 98.95% to 99.32%) and an increase in the overall recognition rate (from 86.175% to 91.85%). Discussion. A likely reason why the Bayesian approach falls behind the frequentist one is that the former introduces several levels of approximation. Indeed the BNN is trained using either VI or HMC, two approximate inference techniques, resulting in an approximation of the true posterior distribution. Moreover, the resulting uncertainty measures are defined by statistics of said distribution (mean and variance in our case), which introduces an additional error as these measures do not retain full information about the BNN posterior. The latter error propagates as we apply GP classification for error detection, which produces another approximate solution.
However, it is not to say that the Bayesian approach does not work well overall. Indeed, Bayesian NPM is capable of recognizing always at least the 85% of the prediction errors and the accuracy of the predictive monitor is always well above 98%. Moreover, we expect the Bayesian solution to work better in settings with noise and partial observability, settings where the Bayesian approach should provide a more robust performance than the frequentist one.
Another interesting aspect is that active learning seems to enhance the classifier confidence in its predictions, as demonstrated by an improved detection rate and a sensibly reduced rejection rate. This is the main advantage of active learning, besides providing a higher state classifier accuracy.
In summary, the main conclusions from our experimental analysis are: -Our reachability predictors attain high accuracies, consistently above 97.37% (above 99.43% for the frequentist case).
-The frequentist approach overall outperforms the Bayesian ones in all metrics and configurations, followed by VI. -Error detection rates stay approximately constant after retraining (active or passive). The frequentist approach achieves staggering performance on this metric. -The benefits of active learning are visible from an overall reduction of the rejection rate and an overall increase in the NSC accuracy.

CP over the error detection rule
Recall that our frequentist error detection rule G f builds on CP to quantify the reliability of the NSC predictions. In principle, CP can be applied to derive prediction regions with statistical guarantees to any supervised learning model. The SVC G f for error detection is no exception.
In this experiment, we show that we can apply CP to produce prediction regions Γ ε for G f , which, by definition, contain the correct rejection decision with probability 1 − ε. For this purpose, we apply CP to the SVC G a f obtained after one active learning iteration. In particular, we derive from Γ ε a so-called risky rejection strategy, aimed at reducing the rejection rate: we reject only those points by which the prediction region for G a f contains only class 1, that is, when rejecting is the only plausible decision (according to CP). We report the results for two case studies, the helicopter and the artificial pancreas, to show how the performances of the risky rejection strategy compare to the ones obtained via active learning. These two models are representative of cases where active learning could sensibly reduce the rejection rate (artificial pancreas) and where instead it could not (helicopter).
Note that applying CP to G a f requires a new calibration set, i.e., a set of points from W a v not used to train G a f but rather to calibrate its predictions. The CP framework needs a few further adjustments, discussed in detail in Appendix A.
Choosing an optimal value for ε, i.e., one that yields high detection rates and low rejection rates, is non-trivial and requires problem-specific tuning. In Fig. 4, we compare the above introduced risky strategy against the initial one at different ε levels and after one active learning iteration (results are reported only for the HC and AP case studies). We observe that, with a properly tuned ε, we can achieve the same detection rate of the initial approach (this occurs for ε ∈ [0.057, 0.1] in the AP model, and ε ∈ [0.03, 0.11] in the HC model), but at the same time a lower rejection rate. For instance, a sweet spot that most reduces the rejection rate without sacrificing detection is ε = 0.03 for the HC and ε = 0.057 for the AP. Rejection rate and recognition rate of initial rejection rule (initial), the rule obtained after one active learning iteration (active), and the "risky" rule obtained by applying CP to the latter (risk)

Related work
A number of methods have been proposed for online reachability analysis that rely on separating the reachability computation into distinct offline and online phases. However, these methods are limited to restricted classes of models [3,40], or require handcrafted optimization of the HA's derivatives [41], or are efficient only for low-dimensional systems and simple dynamics [42].
In contrast, NSC [6] is based on learning DNN-based classifiers, is fully automated and has negligible computational cost at runtime. In [43,44], similar techniques are introduced for neural approximation of Hamilton-Jacobi (HJ) reachability. Our methods for prediction rejection and active learning are independent of the class of systems and the machinelearning approximation of reachability, and thus can also be applied to neural approximations of HJ reachability.
In [45], Yel and others present a runtime monitoring framework that has similarities with our NPM approach, in that they also learn neural network-based reachability monitors (for UAV planning applications), but instead of using, like we do, uncertainty measures to pin down potentially erro-neous predictions, they apply NN verification techniques [46] to identify input regions that might produce false negatives. Thus, their approach is complementary to our uncertaintybased error detection, but, due to the limitations of the underlying verification algorithms, they can only support deterministic neural networks with sigmoid activations. On the contrary, our techniques support any kind of ML-based monitors, including probabilistic ones.
The work of [47,48] addresses the predictive monitoring problem for stochastic black-box systems, where a Markov model is inferred offline from observed traces and used to construct a predictive runtime monitor for probabilistic reachability checking. In contrast to NSC, this method focuses on discrete-space models, which allows the predictor to be represented as a look-up table, as opposed to a neural network.
In [49], a method is presented for predictive monitoring of STL specifications with probabilistic guarantees. These guarantees derive from computing prediction intervals of ARMA/ARIMA models learned from observed traces. Similarly, we use CP which also can derive prediction intervals with probabilistic guarantees, with the difference that CP supports any class of prediction models (including autoregressive ones). In [50], model predictions are used to forecast future robustness values of MTL specifications for runtime monitoring. However, no guarantee, statistical or otherwise, is provided for the predicted robustness. Deshmukh and others [51] have proposed an interval semantics for STL over partial traces, where such intervals are guaranteed to include the true STL robustness value for any bounded continuation of the trace. This approach can be used in the context of predictive monitoring but tends to produce overconservative intervals.
A related approach to NSC is smoothed model checking [52], where Gaussian processes [32] are used to approximate the satisfaction function of stochastic models, i.e., mapping model parameters into the satisfaction probability of a specification. Smoothed model checking leverages Bayesian statistics to quantify prediction uncertainty, but faces scalability issues as the dimension of the system increases. In contrast, computing our measure of prediction reliability is very efficient, because it is nearly equivalent to executing the underlying predictor.
In the field of computer security, a machine learning-based method for malware detection [27] is conceptually very similar to our NPM method. The authors develop a tool for assessing the performance of a classifier using the statistical guarantees of conformal predictions with a self-trained mechanism to filter out unreliable classification decisions.
Literature on uncertainty-based active learning in deeplearning models is small and sparse, mainly because deep learning methods rarely represent model uncertainty, and state-of-the-art deep learning techniques require large amounts of data, which makes active learning impractical. Several uncertainty-based acquisition functions (i.e., functions used to rank the informativeness of new observations for active learning) are reviewed in [21]. In [53], a Deep Ensemble active learning method is proposed, where uncertainty is estimated from a stochastic ensemble of BNN models (obtained via MC-Dropout, another approximate Bayesian inference technique). Some applications of BNN in active learning are presented in [54,55]. In these works, however, the decision threshold used to identify informative samples is always chosen empirically. An important contribution of our work is the automatic tuning of the decision rule.
A basic application of conformal predictors in active learning is presented in [56]. Our approach introduces three important improvements: a more flexible and meaningful combination of confidence and credibility values, automated learning of rejection thresholds (which are instead fixed in [56]), and refinement of the query strategy.
In [29], we presented a preliminary version of the frequentist approach. In [10], we added to it an automated and optimal method to select the rejection thresholds and an active learning framework.

Conclusion
We have presented neural predictive monitoring, an approach for runtime predictive monitoring of hybrid systems that complements reachability predictions with principled estimates of the prediction uncertainty. NPM uses these estimates to derive optimal rejection criteria that identify potentially erroneous predictions without knowing the true reachability values. We have further designed an active learning strategy that, leveraging such uncertainty-based rejection criteria, increases the accuracy of the reachability predictor and reduces the overall rejection rate. Our approach overcomes the computational footprint of reachability checking (infeasible at runtime), while improving on traditional runtime verification by being able to detect future violations in a preemptive way.
We have devised two alternative solution methods for NPM. The first one follows a frequentist approach, with state classifiers expressed as deterministic DNNs and rejection rules expressed as support vector classifiers, where the rejection rules are optimized to detect unreliable predictions from uncertainty measures derived via Conformal Prediction. The second one follows a Bayesian approach, with a probabilistic state classifier based on Bayesian neural networks, rejection rules given as Gaussian process classifiers, and uncertainty measures extracted from statistics of the BNN predictive distribution.
The strengths of our NPM technique are its effectiveness in identifying and rejecting prediction errors and its compu-tational efficiency: executing the classifier and the rule take on the order of milliseconds. NPM's efficiency is not directly affected by the complexity of the system under analysis but only by the complexity of the underlying learning problem and classifier.
Our experimental evaluation demonstrates that the frequentist approach outperforms the Bayesian one: the state classifier is simpler to train and faster to evaluate, and the error detection criteria are more accurate. Regarding BNN inference, we found that VI scales better than HMC with respect to the dimension of the system. The assumptions on the prior and the necessary tuning of hyperparameters represent important drawbacks of the Bayesian techniques, which, however, tend to be more consistent than deterministic models in regions with no data observed: here the posterior distribution will typically have high variance.
Among directions for future work, we plan to extend our approach to support more complex and real-world systems that include noise and partial observability, settings where the Bayesian approach can potentially provide more robust performance than the frequentist one.
is not in the prediction set. The validity property, as stated above, guarantees an error rate over all possible labels, not on per-label basis. The latter can be achieved with a CP variant, called label-conditional CP, which is in turn a variant of the Mondrian CP approach. The only change is in the calculation of the p values. The p value associated with class y j on a test point x * is defined as: In words, we consider only the α i corresponding to examples with the same label y j as the hypothetical label that we are assigning at the test point.
Label-conditional validity [57] is extremely important when the CP is applied to an unbalanced dataset, as in our case W v . It has been shown empirically, that, with the plain validity property, the overall error rates tend to the chosen significance level, but the minority class are disproportionally affected by errors. The Mondrian approach ensures that, even for the minority class, the expected error rate will tend to the chosen significance level ε. We refer the reader to the existing literature [58,59] for further details.

A.2 SVC non-conformity measure
CP relies on the definition of a non-conformity measure, which captures the extent to which a given input data conforms to the associated class. Every classification algorithm has a specific non-conformity measure, that makes the CP algorithm perform well in each framework. SVC is a kernelbased method that transforms the original data by mapping them into a new space, called feature space, via a feature map φ(x). By doing so, patterns that are not linearly separable, can be converted to be linearly separable in the feature space [20]. The linear decision boundary for a binary SVC is defined as d(x) = a · φ(x) + b = 0, for x in the original space. Support vectors are the data points that lie closest to the decision hyper-plane. SVC maximizes the margin around the separating hyperplane and the decision function, which depends only on the support vectors. For ease of notation, we are assuming Y = {−1, 1}, rather than {0, 1}. The distance of a point x * from the separating hyperplane of the SVC is given by: where d(x * ) is SVC decision function d(·) evaluated in x * and ||a|| is the weighted sum of the support vectors. Then, Fig. 5 Results as in Figure 4 for the IP model. A good choice for ε is around 0.08 the distance to the margin boundary of the class under consideration is given by: The non-conformity measure is thus defined as Such definition is presented in [60]. In case of transductive CP (TCP), the NCM can be derived directly from the value of Lagrange multipliers associated with the support vectors, as proposed in [59].

Remark 5
When CP are applied to the SVC, the input space is the uncertainty domain U, rather than X. However, for ease of presentation, we stick to the p values notation introduced in Section 3.1.
Proof Suppose by contradiction that Γ ε * = {y j 1 } and y j 1 = F(x * ) = y j 2 . Then, by Equation 5, this implies that p j 1 * > ε and that p j 2 * ≤ ε, i.e., p j 1 * > p j 2 * . In turn, this implies that the corresponding NCF scores are such that α j 1 * ≤ α j 2 * (the inequality is not strict due to the tie-braking factor θ in Equation 4). But according to the definition of our NCF function (6), this means that f y j 1 (x * ) ≥ f y j 2 (x * ), i.e., that the likelihood of the non-predicted class y j 1 is not below than that of the predicted class y j 2 , which, by Definition 4 and the assumption of well-formed discriminant, is a contradiction.

Appendix C Models and case studies
We briefly introduce the case studies used in our experimental evaluation. Spiking Neuron. This model describes the evolution of a neuron's action potential. It is a deterministic HA with two continuous variables, one mode, one transition and nonlinear polynomial dynamics. We consider the unsafe set D defined by v 2 ≤ 68.5, expressing that the neuron should not undershoot its resting potential. The time bound for the reachability property is T = 20. Inverted Pendulum. We consider the classic inverted pendulum on a cart nonlinear system. We consider the unsafe set D defined by |θ | > π/4, corresponding to the safety property that keeps the pendulum within 45 • of the vertical axis. The time bound is T = 5. Artificial Pancreas For the AP model, the unsafe set D corresponds to hypoglycemia states, i.e., D = BG ≤ 3.9 mmol/L, where BG is the blood glucose variable. The state distribution considers uniformly distributed values of plasma glucose and insulin. The insulin control input is fixed to the basal value. The time bound is T = 240. Triple Water Tank For the TWT model, D is given by states where the water level of any of the tanks falls outside a given safe interval I , i.e., D = ∨ 3 i=1 x i / ∈ I , where x i is the water level of tank i. The state distribution considers water levels uniformly distributed within the safe interval. The time bound is T = 1. Cruise Control. It is a non-deterministic HA with 3 continuous variables, 6 modes, 11 transitions, and nonlinear polynomial dynamics. The unsafe set D is defined by v ≤ −1, which expresses that the vehicle's speed should not be below a reference speed by 1m/s or more. The reachability time bound is T = 10. Helicopter Controller. We augment the 28-variable helicopter controller available on SpaceEx website 9 with a variable z denoting the helicopter's altitude. The dynamics of z is given byż = v z , where v z is the vertical velocity and represented by variable x 8 . The unsafe set D is defined by z ≤ 0. The time bound is T = 5. Since this model is large and publicly available on SpaceEx website, we do not provide the details here.

C.1 Spiking neuron
We consider the spiking neuron model on the Flow* website 10 . It is a hybrid system with one mode and one jump.  The dynamics is defined by the ODE v 2 = 0.04v 2 2 The jump condition is v 2 ≥ 30, and the associated reset is v 2 := c ∧ v 1 := v 1 + d, where, for any variable x, x denotes the value of x after the reset.
The parameters are a = 0.02, b = 0.2, c = −65, d = 8, and I = 40 as reported on the Flow* website. We consider the unsafe state set D = {(v 2 , v 1 ) | v 2 ≤ 68.5}. This corresponds to a safety property that can be understood as the neuron does not undershoot its resting-potential region of [−68.5, −60]. The domain for sampling is 68.5 < v 2 ≤ 30 ∧ 0 ≤ v 1 ≤ 25. The time bound for the reachability property was set to T = 20.

C.2 Inverted pendulum
We consider the control system for an inverted pendulum on a cart. This is a classic, widely used example of a nonlinear system. As shown in Fig. 6, the control input F is a force applied to the cart with the goal of keeping the pendulum in upright position, i.e., θ = 0. The dynamics is given by J ·θ = m · l · g · sin(θ ) − m · l cos(θ ) · F (28) where J is the moment of inertia, m is the mass of the pendulum, l is the length of the rod, and g is the gravitational acceleration. We set J = 1, m = 1/g, l = 1, and let u = F/g. Eq. 28 becomes We consider the control law of Eq. 30. Figure 7 shows an evolution of θ under this control law. We consider the unsafe state set D = {(θ, ω) | θ < −π/4 ∨ θ > π/4}. This unsafe region corresponds to the safety property that keeps the pendulum within 45 • of the vertical axis. The domain for sampling is θ ∈ [−π/4, π/4] ∧ ω ∈ [−1.5, 1.5]. We used time bound T = 5.

C.3 Cruise control
The cruise control is a non-deterministic HA with three continuous variables, six modes, eleven transitions, and nonlinear polynomial dynamics. It is shown in Fig. 8. The continuous variable v denotes the difference between the vehicle's speed and the cruise speed in m/s, x is the integral term for the proportional-integral (PI) controller in mode 5, and t is a clock. In mode 5, the PI controller tries to stabilize v to zero, i.e., to match the vehicle's speed with the cruise speed. Modes 3 and 4 represent the first level of brakes where deceleration increases smoothly from 1.2 to 2.5 m/s 2 in mode 4 and stays constant at 2.5 m/s 2 in mode 3. Modes 1 and 2 represent the second level of brakes and work in the same way but with higher starting and peak deceleration. Mode 6 constantly accelerates the vehicle. The guards are designed to prevent chattering or Zeno behavior. The unsafe set D is defined by v ≤ −1, which expresses that the vehicle's speed should not be below a reference speed by 1 m/s or more. The reachability time bound is T = 10 ( Fig. 9).