Unified Framework and Survey for Model Verification, Validation and Uncertainty Quantification

Simulation is becoming increasingly important in the development, testing and approval process in many areas of engineering, ranging from finite element models to highly complex cyber-physical systems such as autonomous cars. Simulation must be accompanied by model verification, validation and uncertainty quantification (VV&UQ) activities to assess the inherent errors and uncertainties of each simulation model. However, the VV&UQ methods differ greatly between the application areas. In general, a major challenge is the aggregation of uncertainties from calibration and validation experiments to the actual model predictions under new, untested conditions. This is especially relevant due to high extrapolation uncertainties, if the experimental conditions differ strongly from the prediction conditions, or if the output quantities required for prediction cannot be measured during the experiments. In this paper, both the heterogeneous VV&UQ landscape and the challenge of aggregation will be addressed with a novel modular and unified framework to enable credible decision making based on simulation models. This paper contains a comprehensive survey of over 200 literature sources from many application areas and embeds them into the unified framework. In addition, this paper analyzes and compares the VV&UQ methods and the application areas in order to identify strengths and weaknesses and to derive further research directions. The framework thus combines a variety of VV&UQ methods, so that different engineering areas can benefit from new methods and combinations. Finally, this paper presents a procedure to select a suitable method from the framework for the desired application.


Introduction
Simulation is getting more and more important in the development and testing process across many engineering fields. It has a long-standing history in numerical simulation areas such as finite element methods (FEM) and computational fluid dynamics (CFD). It is rooted in the development process of complex systems such as aircrafts, trains and vehicles in order to save time and costs and develop high-quality products [39]. Simulation has found its way into the testing process and regulatory approval, for example in the research field of safeguarding autonomous vehicles [149,181].
Since simulation plays a key role in many engineering fields, accurate models of reality become a decisive factor for the credibility and trustworthiness of simulation, especially in safety-critical areas. Model-based activities such as model verification and validation (V&V) as well as uncertainty quantification (UQ), together referred to as VV&UQ, assess errors and uncertainties inherent in any simulation model. An excellent introduction into VV&UQ can be found in [19,31,129].
A large number of publications dealing with aspects of VV&UQ have been published in recent years. The following six heterogeneous examples represent main approaches in their respective fields. They will be shortly introduced here and explained later in more detail with an analysis of their strength and weaknesses. Oberkampf and Roy [133] present the so-called Probability Bound Analysis (PBA), which uses Frequentist statistics and imprecise probabilities [22] to bound model predictions. Sankararaman

3
and Mahadevan [160] analogously describe a Bayesian approach. Both VV&UQ frameworks quantify and aggregate all sources of uncertainties and are frequently used in numerical simulations. Crespo et al. [35] propose Interval Predictor Models (IPMs) that directly predict intervalvalued quantities to bound all future experiments within them. Hills [84] propose a meta-model approach to correct the erroneous model predictions through a data-driven model. In contrast to numerical simulations, complex systems usually rely on conventional deterministic V&V methods. For example, model validation in automotive and railway vehicle simulations is often based exclusively on a tolerance approach. In this approach, the deviation between simulation and experiment is determined and compared with an allowed tolerance. Eek et al. [56] describe an output uncertainty approach by making simplifications to PBA to enable an application to complex aircraft simulations.
In addition to the heterogeneous VV&UQ landscape across many engineering fields, there are still general challenges that need to be solved by the entire research community. Roy [154], Schroeder and Mullins [164], Mullins et al. [128] agree that one of the main current challenges is how to aggregate model uncertainties. The uncertainties are often identified in calibration and validation experiments, but the actual model predictions will take place under untested application scenarios, which may differ significantly from the experiments. In addition to such a challenging extrapolation in the scenario space, there are application fields such as nuclear reactor safety or spacecrafts, which require an extrapolation in the system hierarchy. They may only offer experiments at component-level due to safety or costs, but the entire model will be used for predictions at systemlevel. Furthermore, complex systems are often dynamic and show time-variant behavior, which poses difficulties for aggregation.
There are survey papers that provide an overview about certain aspects of VV&UQ. For example, Funfschilling and Perrin [66] summarize UQ methods in rail vehicle simulations and Durst et al. [47], Sargent and Balci [161] give an historical review of V&V methods. However, none of these papers gives an overview about multiple engineering fields, focuses on the aggregation of errors and uncertainties and develops a novel framework to unite the research activities across the different fields.
The main contributions of this publication are a 1. novel modular and unified framework for aggregation of errors and uncertainties shown in Fig. 1 with 2. formalization in a unified mathematical notation, 3. comprehensive survey and literature review about VV&UQ methods across many application fields with 4. integration into our unified framework, and 5. detailed comparison of main approaches and different application fields to derive future research directions.
In the following sections we describe our framework in Fig. 1 step-by-step. Section 2 begins with the domains of the framework, which reflect a model-based process including typical activities, errors and uncertainties. Section 3 describes continuously the columns of the framework with unique interface descriptions for each block. Section 4 zooms into the simulation model block of the framework to show various manifestations including deterministic, nondeterministic, hierarchical, dynamical and formal models. Section 5 extends the manifestations of the simulation model block to the entire framework with its manifestations. Section 6 focuses on aggregation methods of errors and uncertainties, which currently pose a great challenge. Section 7 gives a comprehensive survey with many references and examples across several engineering fields. Section 8 discusses and compares the strength and weaknesses of the different aggregation methods and draws conclusions. Finally, the conclusion in Sect. 9 summarizes the most important research findings. In summary, we introduce a new framework that includes the majority of the individual approaches as a subset. Therefore, in Sects. 2 and 3 we will successively describe the framework first and then use it as a taxonomy in Sects. 4 and 6 to modularize main VV&UQ approaches and to classify them within it. This will enable a variety of new combinations in the future. In case approaches skip parts of the framework or in exceptional cases deviate from it, this will be mentioned in the text. Sections 7 and 9 contain examples, method comparisons and conclusions across different application fields. The paper thus offers comprehensive information on VV&UQ, including the latest developments, in one central location.

Model-Based Process
The framework in Fig. 1 represents a model-based testing process. The simulation model must be developed in advance by creating a conceptual model, deriving a mathematical model with equations from it, and implementing them as a computer code to obtain a computational model for simulation [133, p. 14]. Then, model verification assesses the numerical solution of the computer code, model calibration estimates the model parameters, model validation assesses the model-form by comparison with physical tests and finally the "validated" model can be used for prediction under new application conditions. The framework represents each of the four model-based activities with an own domain to reflect the model-based process from top to bottom. The activities address various types and sources of errors and uncertainties that are unavoidable in a computer simulation. This section introduces terminology and notation, types and sources of error and uncertainties as well as model-based activities with the corresponding domains. For more detailed information, the interested reader is referred to [133]. Some of the following explanations are based on this comprehensive introduction book. Model-based framework. In principle, the process runs from 1 to 4. However, there may be loops, especially from the validation domain to the calibration domain, if the parameters need to be adjusted again. The inferred errors of the three stacked blocks can be merged in the error integration. The back-most of the three stacked system blocks, which refers to the verification domain, is actually not a system, but a mathematical model. (Color figure online)

Terminology
Oberkampf and Roy [133, Chap. 2.1-2.2] collect definitions for model-based activities across different communities and extend them. In the following we list the relevant ones for our framework.
-Model verification is "the process of determining that a model implementation accurately represents the developer's conceptual description of the model".
-Code verification is "the process of determining that the numerical algorithms are correctly implemented in the computer code and of identifying errors in the software". -Solution verification is "the process of determining the correctness of the input data, the numerical accuracy of the solution obtained, and the correctness of the output data for a particular simulation".
-Model calibration is "the process of adjusting physical modeling parameters in the computational model to improve agreement with experimental data". -Model validation is "the process of determining the degree to which a model is an accurate representation of the real world from the perspective of the intended uses of the model". -Model prediction deals with "interpolating or extrapolating the model beyond the specific conditions tested in the validation domain to the conditions of the intended use of the model".
In accordance with [133, 135, 179, p. 88], we call the input conditions of a simulation or experiment a scenario. We summarize all conditions in a domain, sometimes also referred to as regime or space. This results in the application domain [133,Fig. 2.10] with application scenarios for model prediction and in the validation, calibration and verification domain with scenarios, respectively. We refer to the outputs of simulation and experiment as responses or, for emphasis, as response quantities of interest (QoIs) [135]. Various terms are used to ref lect a deviation. Oberkampf and Roy [133] use the terms error, difference, bias and mismatch. Kennedy and O'Hagan [104] use the terms residual, inadequacy and discrepancy. These terms are all aimed at assessing the accuracy [133] or fidelity [15] of the simulation. We usually stick to the term error.
In addition, VV&UQ approaches should also assess the robustness of the model's fidelity given uncertain parameters [15,141,150].

Types of Errors and Uncertainties
An error reflects the deviation to the true value. If errors can not be quantified exactly, uncertainties arise. They can be either aleatory, epistemic or mixed. Epistemic uncertainties arise from lack of knowledge and can be estimated or even reduced by gaining knowledge. Aleatory uncertainties refer to stochastic effects and variability and can only be quantified.
Deterministic, precisely known quantities without uncertainty are described by point values. Aleatory uncertainties can be described by probability distributions in form of probability density functions (PDFs) or cumulative distribution functions (CDFs), epistemic uncertainties by intervals and mixed uncertainties by a family of distributions or imprecise probabilities. Probability boxes (p-boxes) are a mixture of intervals and probabilities by adding a width to a CDF.

Mathematical Notation
We denote a point value as x, as usual a PDF as f(x) and a CDF as F(x), an interval as I(x) and a p-box as B(x). We usually omit the random variable X in f X (x) or F X (x) for the sake of consistency and readability. Strictly speaking, multiple repetitions of an experiment yield an empirical distribution function (EDF) for each quantity. The EDF has steps compared to the smooth true distribution function. This also applies to the simulation, but there the steps are very fine due to many samples.
Left and right limits x and x bound the interval in set-builder notation. In analogy, CDFs form the left limit F(x) and right limit F(x) to bound the p-box [22,Eq. (12)] We visualize the different mathematical structures in Fig. 2.
A point value can also be interpreted as a degenerate interval or as a degenerate probability distribution and all of them as a degenerate p-box. We introduce this uniform notation with x, I(x), F(x) and B(x) so that the equations of the framework will be generically applicable. An overview about all quantities can be found directly in the framework in Fig. 1 for the deterministic manifestation (x) as representative and in the following figures for the remaining manifestations. Since identical quantities will occur in different domains, we will use a symbol in the upper index of a quantity for distinction ( x v , x a ). A collection of the quantities for all input scenarios is denoted by the data tuple D(x). (1)

Sources of Errors and Uncertainties
Simulations and experiments involve many errors e compared to nature's true behavior [7,Eq. (1-5-6)] We denote the inputs as x, the model parameters as , the step size as h, the responses as y, the real system as g s and the simulation model as g m . Regarding the errors, e y,obs stands for the observation errors of the responses, e m for model-form errors, e x for input errors, e for parameter errors and e h for discretization errors on behalf of all kinds of numerical errors.
Equation (3) focuses on experimental errors, whereas (4) focuses on model errors. The responses of the system y s = g s (x) and model y m = g m (x, , h) are known and result in a total error e. However, the individual sources of errors are unknown and might reinforce or compensate each other depending on the specific input conditions. The latter can cause a misleading trustworthiness in simulation. Therefore, it is crucial to separately quantify all sources of errors during the different model-based activities.
If the errors cannot be determined exactly, corresponding uncertainties must be considered. However, this requires a probabilistic representation. Assuming there are uncertain inputs and parameters and a total error e resulting from (3) and (4). Then a variance decomposition based on the law of total variance [48,Eq. (3,4)] yields two summands. The first one contains the variance of the simulation, the second that of the total error. Input uncertainty quantification and propagation aims at quantifying the first summand, while model validation mainly aims at quantifying the second summand. Since both individually under-approximate the total prediction uncertainty, all uncertainties must be aggregated [48].

Model-Based Activities
Typical model-based activities focus on individual sources of errors and uncertainties in order to quantify them accurately. Table 1 summarizes the corresponding relationships. The sources of uncertainties have different uncertainty types and occur in different domains. Model verification, calibration, validation and prediction are represented each with an own domain. The other activities occur in multiple domains or transfer errors and uncertainties between domains.

Model Verification
Model verification should definitely be carried out by the tool manufacturer before model calibration, validation and prediction. Solution verification identifies numerical errors such as the discretization error [

Experiments
Experiments always contain a certain natural variability and aleatory uncertainty. Measurement systems are required to observe nature's true behavior, but introduce a bias error and aleatory uncertainty itself. The response quantities are observed during experiments for comparison with simulation. Regarding the observation of input quantities for resimulation, Mullins et al. [126] distinguish fully, partially and un-characterized experiments. The former reports the inputs as precisely known point values, the latter reports at most the entire domain. Partially characterized experiments are placed in between and contain point values, probabilities or intervals. Observation errors such as e y,obs should be reduced as much as possible because they interfere with the quantification of input, parameter and response quantities and other sources of errors. Regarding the quantification of measurement uncertainties, the interested reader is referred to traditional standards [98] and more sophisticated methods based on statistical Design of Experiments (DoE) [147].

Input Uncertainty Quantification
Deterministic simulations often assume fully characterized experiments or neglect the resulting input and parameter errors e x and e . However, this assumption does not hold for partially and un-characterized experiments and also not for model predictions in the application domain, if the conditions are unknown in advance. Therefore, non-deterministic simulations rigorously quantify the input and parameter uncertainties in each domain and propagate them by the .
simulation model to obtain the resulting output uncertainties [48]. We dedicate Sect. 4.1.3 to UQ of non-deterministic simulations.

Model Calibration
In model calibration and validation, (3) and (4) must be compared with each other. Model calibration infers the parameters from the comparison so that the deviations are minimized: The framework in Fig. 1 shows an inverse (orange) arrow pointing back to the model parameters to emphasize that inverse methods are used for calibration. In a broader sense, direct measurements for individual parameters can also be performed for parameter identification in the "calibration" domain. This should even be preferred before inverse calibration methods are used [133, p. 45].

Model Validation
A possible combination with model calibration makes it very important that an independent authority assesses the model error using a separate validation data set [133, p. 49]. The model validation of deterministic simulations quantifies the remaining total error e v . The model validation of non-deterministic simulations considers input and parameter uncertainties to better isolate the epistemic model-form uncertainty in I(e v ).

Model Prediction
Model prediction will occur in the application domain under conditions, which the real system will encounter after approval. The great advantage of the model-based process is that the required system characteristics can be extensively evaluated with the "validated" simulation model before approval and without having to perform further real tests in the application domain. For this reason, the system in the application domain is shown in gray and dashed in the framework. However, this is only possible, if the model prediction is accompanied by an aggregation of all errors and uncertainties. (12) arg min (g m (x, ) − g s (x)).

Interpolation and Extrapolation Uncertainties
If the input conditions of a validation scenario x v vary from a future application scenario x a , that introduces interpolation and extrapolation uncertainties. Figure 3 illustrates this with an example. In general, different inter-domain constellations are conceivable. The validation and application domain may be congruent, one may be a subset of the other, there may be an intersection, or both may even be completely separate [133, Fig. 2.10]; [49, Fig. 2]. The latter causes a large extrapolation uncertainty and should be avoided whenever possible.

Aggregation of Errors and Uncertainties
The big challenge is that the sources of errors and uncertainties affect each other and should be quantified separately [59]. To enable the aggregation of all sources of errors and uncertainties for model prediction, connections from each domain to the application domain are required. A direct connection can be seen in the vertical error pipeline of the framework. In addition, there are also indirect connections to the application domain. Model verification can adjust the model-form and calibration the model parameters [160]. Both are indicated by the inverse arrows in the framework. Since the simulation model is part of each domain, this affects all subsequent domains in the modelbased process including the actual application domain. We dedicate the separate Sect. 6 to the aggregation of errors and uncertainties.

Framework Blocks
This section describes each block of the framework. The sub-sections are ordered column-wise from left to right, since many blocks occur several times in different domains. In case of overlaps between the verification, calibration and validation domain, we use the validation domain as a representative example in the remaining paper to avoid unnecessary repetitions. For each block, we add mappings from inputs to outputs, since clear interface descriptions are a key requirement for a modular framework. We will start, as generically as possible, with the deterministic manifestation of the framework including errors and will extend it in the following main sections to non-deterministic models with uncertainties and to hierarchical, dynamical and formal models.

Scenarios
Within the different domains, a test scenario x is input to the system and model and influences their behavior. We aggregate all N D scenarios of a domain in a data tuple to preserve the order for assignment to the corresponding scenario results later. The N D scenarios are composed of N D,d distinct ones to cover the scenario space and N D,r repetitive ones to consider the variability. The six orange validation scenarios and 18 blue application scenarios in Fig. 3 are examples for the former type, whereas the latter means to repeat the same orange validation scenario. There is one scenario tuple for each domain D(x n ) , D(x c ) , D(x v ) and D(x a ) . The amount of scenarios N D varies between the different domains. The number of application scenarios is often significantly larger than the number of calibration and validation scenarios to reduce the physical testing effort and legitimize the model-based process.

System and Model
Both the real system, denoted s, and the corresponding simulation model, denoted m, with model parameters encounter an input scenario x and show a certain response behavior y. Generally speaking, both perform a mapping g from inputs x to responses y: Strictly speaking, the experimentalist can just control a subset x con of all relevant scenario inputs x, but the uncontrolled ones x unc also affect the system behavior. This results in a natural variability in the repetition of real experiments. Besides, since a model is just a simplified abstraction of the real system, the modeler just considers a subset x mod of the scenario quantities that actually affect the system behavior. The complementary quantities x unm are either unknown or neglected by the modeler: In the calibration and validation domain, the physical tests are usually carried out first. During these tests, the input variables x v mod that the simulation model requires should be measured in addition to the response variables y v s . This enables a re-simulation under similar conditions as in the physical tests. In the application domain, the selected scenarios are directly forward to the simulation tool and an actual model prediction is performed.

Application Assessment
Many application fields perform a post-processing of model and system results after the actual simulations and tests and before decision making. The framework reflects this in the application assessment column.

Application Assessment in the Application Domain
The assessment depends on the respective application and can vary greatly. Typical examples are filtering of signals or extraction of characteristic values to obtain the actual QoIs. There are also applications that do not perform post-processing at all. Nevertheless, we want to offer it as an option in the framework. We therefore refrain from introducing a new symbol for the assessment and stick to the established y. This means that the assessment results are contained in the QoI y.

Application Assessment in the Validation Domain
We use the application assessment not only for the simulation model in the actual application domain, but also for model and system within the other domains. This is consistent with the nature of model validation, since there is no generic model validity, but always a validity with respect to a use case. In model validation, checking the identical QoI that will later be used for decision-making in the application domain is the best basis for trustworthiness of simulation [92, 180].

Macroscopic Application Assessment
Typically, the responses are only assessed for each individual scenario x independently. We call this a microscopic assessment and omit the term microscopic in text, figure and equations for simplicity, since it is the default case. Still, there are some applications that also make a macroscopic statement about all scenarios D(x a ) in the application domain together [99]. The decomposition into a microscopic and (18) g s ∶ (x con , x unc ) ↦ y s , x = x con ∪ x unc (19) macroscopic assessment has a big advantage, especially if the the macroscopic assessment is based on all estimated binary microscopic decisions D(d a ) (Sect. 3.5). Then model validation can intervene at this interface and ensure that all binary decisions are correct, so that the macroscopic statements do not need to be validated directly. For example, a macroscopic assessment can weight all binary microscopic decisions with the importance of the corresponding scenarios and aggregate them to an overall risk measure.

Error Pipeline
In model calibration and validation, the simulation result y m is compared with the experimental result y s as a reference.
If the system result y s is unknown, the simulation result y m and the error e must be known to determine it: The two very simple equations are visualized in Fig. 4, as they represent the basic building block of the error pipeline in the framework. Since the error can only be calculated in the (calibration or) validation domain, but the system result is unknown in the application domain, we need to separate both equations: It gets obvious that both errors appear in different domains and an interpolation and extrapolation from x v to x a is required. Thus, we extend the main principle in Fig. 4 with   the two framework blocks error learning and inference. The former tries to learn a data-driven behavior model of the errors within the validation domain so that the latter can perform a prediction in the application domain. The greater the difference between the domains and thus the greater the extrapolation uncertainties, the more important the two blocks are. In addition, we replace both subtractions with a validation metric (error calculation) and an error integration to get more flexibility.

Metric
Since the quantities can have different mathematical representations such as time signals or probabilities, the simple subtraction of point values will not be sufficient as metric in Sect. 5.2. Generally speaking, a calibration or validation metric takes the responses from system and model and calculates an error. The verification metric uses the mathematical model or very accurate computational models as reference instead of the system:

Error Learning
From an abstract point of view we propose to decompose the modeling task in a combination of a physics-based modeling of the system behavior with a data-driven modeling of the error [133, p. 657]; [104]. This is more robust than the pure modeling of the system behavior. The errors typically fluctuate to a small extent around the nominal physics-based simulation model as stable baseline. The decomposition is especially beneficial if independent authorities perform the model validation and the intellectual property (IP) of the nominal model must be protected. The error learning intends to model the errors across the validation domain. It takes the data tuple D(x v , e v ) of scenario and error pairs as training data set to learn a data-driven model g va inf .

Error Inference
After learning the data-driven error model in the validation domain, it can be applied for inference (prediction) of the error in the application domain. With ê va we emphasize that this is an estimate of the true error by extrapolation from the validation to the application domain.

Error Integration
In the application domain, the responses from model prediction are available as well as the inferred errors from model verification, calibration and validation. The uncertainty integration aggregates all of them to estimate the actual behavior of the real system in the application domain. In Sect. 6 we will distinguish different integration techniques.

Decision Making
In decision making, results are compared with thresholds t to obtain Boolean values that indicate whether the results are good enough. The decision making in the validation domain compares an error with a permissible tolerance (threshold, model accuracy requirement) t v e . The comparison can be done as shown with the error and preliminary requirements in the validation domain or with the extrapolated error and final requirements in the application domain [133, p. 478]. The tolerance comparison offers the possibility to integrate expert knowledge into the model-based process. However, only using the tolerance approach without aggregation of errors and uncertainties in the application domain is very dangerous, since the tolerances are subjective and might insinuate misleading trustworthiness. The binary microscopic validation decisions can be combined to a macroscopic statement. For example, Viehof [185, Chap. 4.4.6.1] calculates two validation scores based on relative and absolute frequencies of the binary results.
The decision making in the application domain compares the assessment results with thresholds specified by standards, regulations or internal management. It is very important to mention that many references use the simulation results y a m for decision making without any model validation [149]. However, the estimated system behavior ŷ a s considering errors and uncertainties should be used instead. Macroscopic application decision making works analogously.

Manifestations of Simulation Models
So far, we have presented the framework as generically as possible and provided clear interface descriptions to ensure the desired modularity. In this section we will introduce different manifestations of the simulation model block in the framework.

(Non-)Deterministic Models
At first we will distinguish deterministic point predictions from non-deterministic ones according to Fig. 2. The term non-deterministic includes probabilistic predictions, interval predictions or a mixture of both.

Deterministic Models
Deterministic simulations are point predictions that calculate a result for a single scenario without uncertainties [48]. The framework can be directly applied to the deterministic manifestation. Generally speaking, in a deterministic simulation with multiple inputs, parameters and outputs, the quantities can be real values of a certain dimension ℝ N . The deterministic inputs x are assumed fixed and precisely known. If physical tests are repeated several times to capture the natural variability, the observed input conditions are usually averaged and the mean value is used for re-simulation. If the model-form is strongly non-linear, the propagation of a mean value by the simulation model can cause significant deviations compared to propagating all values and taking the mean afterwards [133, p. 492].

Interval Predictor Models
Since it is unrealistic to get a deterministic point prediction that exactly matches the system behavior under various conditions, set-valued predictions try to enclose the system behavior. Interval Predictor Models (IPM) [ IPMs were introduced in [29] and were extensively developed by Crespo et al. [35]. They represent the parameter set [35,Eq. (5)] as an interval itself or as a hyper-rectangular set for multiple parameters. They show how IPMs can be constructed if both IPM boundaries can be described by polynomial or radial basis functions. An example can be seen in Fig. 5. In [112]  In [32] they introduce IPMs for dynamic systems and in [34] they compare IPMs as meta-modeling technique with linear regression, Gaussian processes and Bayesian credible intervals. In [33,37] they extend IPMs to Random Predictor Models (RPMs) by replacing the interval-valued map with a random-variable-valued map to obtain probabilities within the IPM boundaries. In addition, Sadeghi et al. [158] describe how to propagate mixed uncertainties in the form of p-boxes by IPMs. In [157] they combine IPMs with a Frequentist probabilistic framework.

Non-deterministic Models
A non-deterministic simulation model maps uncertainties instead of points, but the interface quantities of the mapping are preserved compared to deterministic simulations. As shown in Fig. 6, several steps have to be performed inside the framework block of a non-deterministic simulation model to determine the uncertainties due to inputs. In advance, the inputs and model parameters that have uncertainties are identified. The uncertainties are classified as aleatory, epistemic or mixed and quantified. Since the simulation model occurs in all domains, care must be taken to quantify the input uncertainties for the conditions of the different domains. Subsequently, the input uncertainties are propagated by the deterministic simulation model [155]. To solve the forward uncertainty propagation for uncertain parameters and inputs, the integral [160, Eq. (8)] must be solved. Since this is often not possible, sampling methods are used for uncertainty propagation, so that the deterministic simulation model can be executed several times for the individual samples. Finally, all sample results are aggregated to quantify the output uncertainty due to inputs [160].
Frequentists often represent aleatory uncertainty as CDFs, epistemic as intervals and mixed as p-boxes. They treat the epistemic and aleatory uncertainties differently within the sampling block. Roy and Balch [155] propose a nested sampling with epistemic samples in the outer loop and for each of them aleatory samples in the inner loop. The order can also be reversed. The mathematical structure is preserved during propagation, if all input uncertainties are exclusively epistemic or aleatory. If different types of input uncertainties occur, the output uncertainty is always a p-box. Bayesians represent all uncertainties with probability distributions. The posterior distributions represent epistemic uncertainties and can be reduced with further calibration data [128]. Mahadevan [124] describes up to three loops for the distribution type in the outer loop, the distribution parameters in the middle loop and the actual quantity in the inner loop. He proposes to flatten the nested three-loop sampling with an auxiliary variable to get a single-loop sampling for aleatory and epistemic uncertainties. We avoid visualizing a certain number of loops in Fig. 6, since this depends on the selected approach. Besides sampling, there are more sophisticated uncertainty propagation methods such as Polynomial Chaos Expansion (PCE) [87] or Taylor model approximations [57]. Recent trends in UQ are summarized in [58].

Hierarchical Models
So far, we have considered the system as a single entity. However, the internal system architecture might have a hierarchy with sub-systems and components. Mahadevan [124] distinguishes four different architectures. Besides a single-component system, a multi-level system is organized in a hierarchical fashion, a time-varying system includes a sequential processing of the components and a multi-physics system has simultaneous interactions between the components. Figure 7 shows an example of a multi-level simulation model.
There are certain applications, where calibration and validation experiments can just be performed on component-level, but the prediction in the application domain takes place on system-level [56]. This requires an extrapolation in the system hierarchy and must be taken into account by the aggregation in Sect. 6.6.

Dynamic Models
Most of the VV&UQ literature addresses static systems with time-invariant quantities or stationary conditions [8,89]. However, there are also dynamic systems with time varying signals. We denote a time signal with a bold symbol In case of multiple outputs, it is a matrix Y ∈ ℝ N y ×N k consisting of an output vector per time step. Nevertheless, we adhere to the vector representation to ensure readability without loss of validity. We represent the dynamic system and model as a vector-valued (or matrix-valued) map to keep the interface consistent. Internally, a dynamic system depends on past inputs and system states to calculate the next output. This is usually described as a differential equation or in state-space form [89].
In a model-based testing process, the dynamic quantities are often transformed to a simplified static form according to Fig. 8 to enable the applicability of the classic VV&UQ methods [89]. This is a restriction compared to a generalized dynamic system, but is in line with many application fields. Regarding the input quantities, test scenarios are often parameterized [149] with typical signal characteristics like a sine wave. Regarding the response quantities, key performance indicators (KPIs) are often extracted from the data for clear assessment and decision making in the application domain. Here, a KPI function extracts one characteristic value out of the response time signal such as the rise time or an overshoot value [92]. Even if multiple KPIs are calculated per time signal, the result is still very low-dimensional. An alternative approach is to use data reduction techniques to automatically extract low-dimensional features. For example in VV&UQ, principal component analysis (PCA) [4,193], possibly combined with singular value decomposition (PCA-SVD) [138], ensemble empirical mode decomposition (EEMD) [190], wavelet decomposition [96] or Karhunen-Loève (KL) expansion [189] have already been used.

Formal Models
Formal methods are used in computer science to obtain a proof of correctness of safety critical systems. The formal models of cyber-physical systems (CPS) are often represented as hybrid automata [151] or differential inclusions composed of a differential equation with set-based uncertainty for non-deterministic dynamic models [6]. They can integrate uncertainty via a so-called Minkowski addition into measurement and disturbance sets, into additional inputs, parameters or initial states [81]. Reachability analysis is a formal technique that can deal with set-based computations to determine the set of states a system can reach from given initial states and possible inputs and parameters [5]. Unlike sampling such as Monte Carlo, it can thus provide formal guarantees. However, similar to classic model validation, the so-called conformance of the formal model to the actual system must still be tested to ensure the transfer of the formal properties. The corresponding conformance testing approach of reachability analysis is to check behavior inclusion by ensuring that all observed data lies within the non-deterministic boundaries [6], similar to IPMs.

Manifestations of Scenarios, Metrics, Decision Making and Extrapolation
In the previous section we have extended the simulation model block to different manifestations. This also applies to the real system and the assessment. In this section we extend it to the resulting manifestations of the entire framework. We will address scenarios, metrics, decision making, error learning and inference for different manifestations where appropriate. We again choose the validation domain as representative example. The manifestations of the error aggregation including the error integration block will be addressed separately in the following main section.

Scenario Design
In the literature there is a variety of techniques to select test scenarios. The survey paper [149] gives an overview about scenario selection methods for autonomous vehicles. It includes conventional techniques such as combinatorial testing or statistical design of experiments (DoE), but also sophisticated techniques such as reinforcement learning or rapidly exploring random trees that can generate entire trajectories to test dynamic and formal models. Furthermore, there are partitioning techniques to determine an optimal data split [125,177]. Mullins et al. [127] select calibration and validation scenarios so that costs and prediction uncertainties are minimized. Böde et al. [27] select validation and application scenarios based on a similar concept.

Validation Metrics
We distinguish different types of validation metrics according to Table 2 and give some examples. The inputs of the validation metric can either be point values for deterministic simulations or probability distributions (, intervals or p-boxes) for non-deterministic simulations. The output of the validation metric can be a Boolean value, a probability or a real value in the unit of the response quantity. We devote the following three sub-sections to these three output types. For dynamic systems, there are different time series validation metrics. Feature-based time series validation metrics convert both input time signals from system and model to single scalar output values. Full time series validation metrics calculate a complete output time signal. We devote the last three sub-sections to time series validation metrics and to closeness notions for formal models.

Boolean Validation Metrics
Boolean results can be achieved by comparison with model accuracy requirements. This can be a simple tolerance comparison [92] or a statistical hypothesis test (HT) [185]. However, Oberkampf and Roy [133, p. 69] advise against integrating the tolerances into the validation metric, and instead advise to keep them separate, as shown in the validation decision making block of our framework.

Bayesian Validation Metrics
A Bayesian HT [146] fits well with the Bayesian approach, since the latter favors uncertainties as unit-less probability measure. and a-priori knowledge P(H 0 ) and P(H 1 ) , the Bayes factor B F as ratio of likelihoods will be updated based on validation data D v . Then the Bayes factor can be used to estimate the probability that the model is correct regarding the validation data [146,Sect. 2.2] Since the Bayesian HT requires subjective priors, is harder to interpret and since the Bayes factor is relative, Rebba and Mahadevan [146] also introduce an absolute so-called model reliability metric. The latter is the probability that the difference between experiment and simulation mean is smaller than a tolerance interval. They describe how to determine the reliability metric under aleatory uncertainty. Sankararaman and Mahadevan [159] extend the reliability metric to take aleatory and epistemic uncertainties into account by using the entire distributions.

Frequentist Validation Metrics
Frequentists favor errors and uncertainties in the unit of the response quantity. In the simplest deterministic case, an absolute deviation might already be sufficient. Balci [20] summarize many subjective validation techniques such as face validation as well as objective validation metrics. In the probabilistic case, a wide variety of distributional metrics are available in mathematics that quantify the distance between two probability distributions. Gardner et al. [70] compare metrics and divergences such as the Kullback-Leibler divergence, the Hellinger distance or the Kolmogorov distance. Bi et al. [26] compare the Euclidian, Mahalanobis and Bhattacharyya  [186] propose the modified area validation metric (MAVM) as an asymmetric extension by distinguishing the area on the left and right side of the p-box and incorporate the number of experimental repetitions. Tanaka [175] also propose an asymmetric area validation metric (TAVM) by normalizing both distributions first and add tolerances for decision making. Wang et al. [188] define an interval area metric (IAM) that calculates a best-and worst-case model-form uncertainty and thus two possible intervals. An overview about validation metrics and a comparison of the classical HT, the Bayesian HT and the area metric among others is given in [119,123,126].

Feature-Based Time Series Validation Metrics
There is a wide variety of metrics available in mathematics to quantify the distance between two vectors. An overview about time series metrics can be found in [103,162]. It starts with basics such as vector norms, average errors or correlations coefficients. Some metrics such as the Sprague and Geers metric separately quantify the phase information from the magnitude and combine both afterwards to avoid high metric values due to a shift of the correct signal-form. Other metrics typically based on dynamic time warping (DTW) address this issue in a direct calculation. Frequency metrics transform the time signal into the frequency domain before quantifying the distance [96].
To enable probabilistic techniques, KPIs are usually extracted from time signals or reduction techniques are applied as presented in Sect. 4.3. For example, Viehof [185] extracts characteristic values from vehicle dynamic signals or applies window functions to enable the application of hypothesis tests. Wang et al. [189] use a Karhunen-Loève (KL) expansion to apply the area metric to dynamic systems. Jiang and Mahadevan [96] perform a wavelet decomposition followed by Bayesian hypothesis testing.

Full Time Series Validation Metrics
Ao et al. [8] highlight disadvantages of the feature-based metrics such as the loss of time information in general and (46) different principal components for simulation and experiment with possibly high numbers in PCA techniques. Instead, they extend the reliability metric to quantify the discrepancy over time. They propose and analyze three time domain metrics: instantaneous, first-passage and accumulated reliability. The first one performs the evaluation at each time step, the second over a time duration and the third aggregates the evidence over time.

Formal Closeness Notions
There are formal methods that try to falsify a deterministic formal model by means of optimization [1]. They use closeness notions similar to time series validation metrics. For example, a ( , )-closeness between two time series is used in [2,10,23] and a Skorokhod metric in [41].

Validation Decision Making
After quantifying deviations between simulation and experiment with validation metrics, these can be compared with tolerance limits based on expert knowledge. As mentioned, there is an overlap with binary validation metrics such as the hypothesis test that include the tolerances.

Time-Series Validation Decision Making
The tolerance comparison can be easily extended from scalar values to time signals. For example, the vehicle dynamics standard [93] lays a tolerance band around the deterministic time signal of the simulation and checks whether all experimental repetitions fall within the band to obtain a binary result for model validity.

Formal Validation Decision Making
In robust control of dynamic systems, it is often favorable to have formal guarantees (certificates) about the validity of the model. Prajna [142] argues that absolute model validity can never be proven with finite resources, but only model invalidity. He uses barrier certificates, similar to Lyapunov functions in stability analysis, to separate possible model trajectories from experimental data. If a barrier can be found, the model's parameter set and the data are proven inconsistent. Harirchi et al. [80] use model invalidation for guaranteed model-based fault detection. Many other references in the control community relax the hard invalidation for a wide variety of different system representations and consider probabilistic certificates for the model (in)validity instead [136,171]. For example, Halder and Bhattacharya [75] address UQ of dynamic systems, compare the simulation and experimental PDF with a Wasserstein metric and calculate a probabilistically robust validation certificate (PRVC) in each time step based on a required tolerance level. Karydis et al. [102] calculate the probability of violating a confidence region as tolerance value and use it with randomization techniques to expand stochastic models with uncertain parameters that capture the experiments.
There are different notions of conformance to check behavioral inclusion of the experiments within the nondeterministic simulation bounds determined by reachability analysis. Trace conformance [165,173] is a strong relation that ensures that all state transitions are preserved. Reachset conformance [122,151] is a weaker relation for safety that just checks the set of traces instead of each individual one. An overview about conformance notions can be found in [105].

Extrapolation in Scenario Domain
A validation metric quantifies a deviation for a single validation scenario. Typically, multiple validation scenarios are executed across the entire validation domain. They differ from the application scenarios and cause interpolation and extrapolation uncertainties as visualized initially in Strictly speaking, only the point-by-point validation with regression technique fits perfectly into the error learning and inference blocks of our framework. The ensemble validation is placed before the validation metric by already aggregating a tuple of scenarios (double arrow instead of single one) before applying the actual metric. The macroscopic validation metric can be interpreted as a degenerate case of the error learning, since it is not a function of the input scenarios anymore.

Ensemble Validation
If validation data is sparse, u-pooling [133,Chap. 12.8.3] is a technique that performs a transformation from the physical to a probability space. Suppose it is impossible to repeat experiments because the environment cannot be fully controlled in applications such as the safeguarding of automated vehicles with dynamic traffic objects. Then, u-pooling makes it possible to aggregate all experimental data from different scenarios to one point by switching to the probability space where the physical units do not matter anymore. This yields again a CDF for experiment and simulation so that the area validation metric can be used. However, the metric comparison is performed in the probability space. Therefore, a back-transformation is required afterwards to obtain the original physical units. He [82] extend u-pooling to p-boxes and Xi et al. [193] to dynamic responses by applying PCA in advance.

Error Learning and Inference
Oberkampf and Roy [133, p. 657] learn an error model in the validation domain via polynomial regression. They additionally calculate external prediction intervals to also consider the uncertainty of the regression model itself. A prediction interval is greater than a classic confidence interval [133, p. 657]. Kennedy and O'Hagan [104] learn an error term in the calibration domain with a Gaussian process. Farajpour and Atamturktur [59], Shinn et al. [167] use polynomial meta-models, Rutherford [156] a bilinear regression technique and Li et al. [116] a spatial random process. Crespo et al. [36] apply IPMs to learn error models with bounds instead of directly modeling the system behavior. Similar to IPMs, Feeley et al. [60] also aim to identify validated parameter sub-spaces. Romero [152] apply predictor-corrector techniques for extrapolation. Atamturktur et al. [12] extend the KOH equation by an additional application-specific extrapolation term. Hemez et al. [83] refers to this as predictive maturity index (PTI) and argues that it should be based on goodness-of-fit, the amount of calibration, the domain of applicability, etc. Therefore, Atamturktur et al. [14] define a metric that quantifies coverage of scenarios in the validation domain.

Meta-Model Approach
Hills [84], Hamilton and Hills [77,78] propose a meta-model approach to connect the validation domain with the application domain. It differs from the error model in our framework, which models the errors between simulation and experiment as a function of the input conditions in the validation domain. Instead, they model the response behavior of the simulation model in the application domain as a function of its response behavior in the validation domain. Both the error and metamodel will be used for aggregation of errors and uncertainties in model prediction. They define their simulation models [84, Eq. (5, 6)] relative to defined nominal conditions. With a sensitivity analysis they determine the important inputs and parameters around the nominal conditions to reduce the data-driven model order. They assume having a constellation with a small application domain as a neighborhood around a nominal application condition and N experimental neighborhoods around N nominal validation conditions. Generating N D samples for x and within each neighborhood yields a vector of the relative behaviors with the weighting vector ∈ ℝ N . The learning of the meta-model can either be performed with a partial least square approach [182] or an objective function method [84, Eq. (9)] with a factor 0 ≤ a ≤ 1 . It reflects the trade-off between accuracy and robustness to variance in model parameters and measurement uncertainty.

Macroscopic Validation Metrics
Mullins et al. [126] aggregate the results e v of point-by-point comparisons using the reliability metric into an overall macroscopic validation metric [126, Eq.

Manifestations of Error and Uncertainty Aggregation
It is one of the main current challenges in VV&UQ, how to aggregate errors and uncertainties for model prediction [128,154,164]. We present general aggregation techniques first.
In the next three sub-sections we focus on the aggregation from the verification, calibration and validation domain to the application domain. We dedicate separate sub-sections to the aggregation in the scenario and parameter domain, to the system hierarchy domain from component-to system-level and to the time domain for dynamic systems.

Aggregation Techniques
Mullins et al. [128] distinguish three "classes of roll-up approaches: 1. Apply bias correction to the model predictions based on errors observed in the validation assessment 2. Modify parameter uncertainty to add conservatism to prediction 3. Apply domain specific model form corrections to governing equations".
We extend this classification of aggregation (roll-up) approaches significantly through a hierarchical structure according to Table 3. We distinguish the two aggregation techniques bias correction and uncertainty expansion. A visual example of both aggregation techniques can be found in Fig. 9. The uncertainty expansion example refers to Sect. 6.4.5. If error estimates are available, they can be used for bias correction to bring the model responses ŷ a s closer to the true value y a s by correcting the inferred error: (52) y a s = y a m −ê va . Otherwise, uncertainties should be aggregated and integrated by expanding the model response with the uncertainty so that the true value lies within it: Both techniques can be applied at multiple stages: either directly to the model responses as described so far or internally to the model parameters or model-form. Generally, numerical uncertainties can be aggregated from the verification domain, parametric uncertainties from the calibration domain or model-form uncertainties from the validation domain. For all these source domains, the target is always the application domain, either directly or internally via the domains in between. The direct aggregation uses the vertical error (uncertainty) pipeline in the framework consisting of metrics, error learning, inference and integration. The integration includes the actual implementation of the aggregation technique within the application domain. The internal aggregation uses the inverse (orange) arrows in the framework to apply the technique before reaching the application domain.

Aggregation from Verification Domain
The aggregation from the verification to the application domain aims to account for the numerical error or uncertainty in the model prediction.

Bias Correction of Model-Form
Sankararaman and Mahadevan [160] correct the bias of deterministic numerical errors such as discretization errors. They add an additional term to the original model-form so that the corrected model can be already applied in the subsequent calibration, validation and application domain. A correction of the model-form with an additional term can be interpreted as ensuring corrected model responses. Whereas Sankararaman and Mahadevan [160] correct deterministic errors, they sample from stochastic errors such as surrogate model uncertainty to add conservatism. Surrogate models are often used instead of complex high-fidelity simulation models to speed up computation.

PBA-Based Uncertainty Expansion of Model Responses
Oberkampf and Roy [133,Chap. 8.5] quantify the deterministic discretization error and convert it to a numerical uncertainty according to (11). They use a safety factor in the conversion to get one conservative value, instead of learning a model of the numerical uncertainty in the scenario domain. In Probability Bound Analysis (PBA) they use the numerical uncertainty in the application domain to shift the left and right boundary of the simulation p-box outwards [155, Fig. 12]: However, they are not considering the numerical uncertainty in the validation domain so that it influences the determination of model-form uncertainty. The uncertainty expansion can be seen together with model-form uncertainty to obtain the whole p-box B(ŷ a s ) in Fig. 9 and generally in Sect. 6.4.5.

Aggregation from Calibration Domain
In classic model calibration, point values are estimated for selected model parameters. Especially with Bayesian approaches, it is also possible to integrate uncertainty into the model parameters to add conservatism.

Point Estimation of Model Parameters
In the simplest case, the parameter determination is understood as point estimation. Then a least square estimation [124] can be used to determine the parameters so that the errors are minimized  over the number N c D,d of all distinct calibration scenarios and its replicates N c D,r . The errors are described here as squared deviation between simulation and experiment (calibration metric). Assuming that the errors follow an unbiased normal distribution N(0, ) , a likelihood function [160, Eq. (6)] can also be constructed. In Maximum Likelihood Estimation (MLE), the parameters are determined so that the likelihood function is maximized These methods do not consider parameter uncertainties. In some cases, however, separate confidence intervals can be calculated for the parameters [124].

Bayesian Uncertainty Expansion of Model Parameters
The Bayesian approach describes the model parameters as probability distributions to reflect uncertainties and to add conservatism. Bayes' Theorem can be used to calculate the posterior PDF [160, Eq. (5)] based on the prior PDF f ( ) and the likelihood function L( ) = P(D c | ) according to (57). We abbreviate the data pair D c = D(x c s , y c s ) for readability, since it will occur several times. Since it is often hardly possible to solve the integral directly, sampling methods like Markov Chain Monte Carlo are used to determine the posterior PDF and Bayesian credible intervals for conversion to intervals [160].

Bayesian Bias Correction of Model Responses
For calibration, the error could previously be understood as a observation error e obs with unbiased normal distribution. If the predictions are biased, Kennedy and O'Hagan [104] introduce a separate error model term g ca inf (x) (often called model discrepancy term or model inadequacy function (x) ) to account for it [104,Eq. (5)]: In the KOH framework, named after both authors, they learn the error model g ca inf (x) with a Gaussian process dependent on the input conditions. They simultaneously calibrate the model parameters and the hyper-parameters of the Gaussian process and estimate the remaining model error. It can be used afterwards for a bias correction of the model responses in the application domain. However, this is risky, since it means to trust the data-driven regression model more than the original physics-based model [84, p. 15]. The error term enables no natural bias correction, since it reflects a modelform error in the calibration domain [128]. Instead, the error term is often omitted in calibration and instead the modelform error is determined in the validation domain with a separate validation data set using validation metrics [160]. This enables an easier aggregation to the application domain. Furthermore, the KOH framework often leads to an identifiability problem between the model parameters and the discrepancy term [11]. A six-step procedure is wrapped around the KOH framework in [21]. The selection of model discrepancy priors for Bayesian updating is described in [120,121].

IPM-Based Uncertainty Expansion of Model Parameters
Crespo et al. [35] use data-driven IPMs to incorporate the uncertainties into intervals so that the IPM boundaries cover all system behaviors and no error is left in the calibration domain by definition. Therefore, the IPM approach skips the entire verification and validation domain of the framework. After the internal integration of uncertainties into the IPM parameters using inverse calibration methods (orange arrow in the framework), they directly use the IPM for prediction in the application domain without any further bias correction or uncertainty expansion.

Aggregation from Validation Domain
The aggregation from the validation to the application domain seeks to correct model-form errors or to expand responses with model-form uncertainty.

Bias Correction of Model-Form
The VV&UQ approaches rely on the correctness of physics in the equations. Hills [84] distinguishes four cases regarding the implementation of physics. Mullins et al. [128] propose an additional approach that seeks to address the source of errors in the model-form rather than integrating uncertainty in model parameters or responses. The correction of the model-form depends strongly on the application and requires knowledge of the underlying physics.

Meta-Model-Based Bias Correction of Model Responses
Hills [84] uses the meta-model from Sect. 5.4.3 to perform a bias correction in the application domain [84,Eq. (32)]: He uses the meta-model to aggregate both the errors between simulation and experiment and the uncertainties. He considers parametric and observation uncertainties via sampling and bootstrapping.

Bayesian Uncertainty Expansion of Model Responses
In the calibration domain in Sect. 6

Bayesian Uncertainty Expansion of Model Parameters
Similar to the last sub-section, Mullins et al. [127] also perform an uncertainty expansion with the Bayesian approach, but incorporate the uncertainties into the parameters instead of the responses to obtain an updated posterior distribution [127,Eq. (12)] They use the macroscopic metric based on reliability according to (51) as a weighting factor of the prior distribution.
To illustrate this approach in our framework, an additional arrow from the (degenerated) error model back to parameters would be necessary. We did not explicitly draw this due to readability reasons.

PBA-Based Uncertainty Expansion of Model Responses
Oberkampf and Roy [133, p. 657] use polynomial regression to learn a model for the epistemic model-form uncertainty I(ê va ) = [e va , e va ] and include the uncertainty of the regression itself via a prediction interval. They perform an uncertainty expansion with imprecise probabilities to aggregate .
all uncertainties within PBA. They obtain a final p-box [155,Fig. 12] by expanding the left side of the input uncertainty p-box to the left with the estimated model-form uncertainty and the numerical uncertainty and by expanding the right side, respectively. The uncertainty expansion is shown in Fig. 9.
For application decision making based on the final p-box of the response QoI, it is possible to extract an interval of cumulative probabilities at the ordinate for a given response value or to extract a response interval at the abscissa for a given cumulative probability [30, Fig. 2] or even for a given interval of cumulative probabilities [39]. Suppose a probabilistic regularity exists, such as the probability of falling below a threshold must be less than one percent: P(ŷ a s < t a y ) < 0.01 . Then the point of intersection with the upper CDF boundary F(ŷ a s ) can be read at the given threshold value. This represents the cumulative probability and can be compared directly against the one percent. It works in reverse with exceedance probabilities [30,64].

Generalization of the Error Pipeline
In general, there could be further manifestations of the error pipeline of our framework as shown in Fig. 10. It is possible to learn the deterministic error e v or the epistemic modelform uncertainty I(e v ) , which in turn yields a point ê va or an interval estimate I(ê va ) . Preferably, the uncertainty of the regression itself should also be considered, e.g. by means of a prediction interval. As a result, in the non-deterministic case the interval estimate becomes wider both on the positive and negative side or a conversion to an interval estimate is made around the error in the deterministic case. Usually the point estimate of the error is combined with a point estimate from the deterministic simulation by bias correction and the interval estimate of the model-form uncertainty is combined with the non-deterministic simulation by uncertainty expansion. However, bias correction can also be applied to nondeterministic simulations by shifting them, and uncertainty expansion can also be applied to point values by expanding them to intervals. However, according to (6), care must be taken that no relevant input and model-form uncertainties are neglected if they are not quantified separately. The bias correction always preserves the mathematical structure, while the uncertainty expansion yields an interval or p-box. Strictly speaking, expanding a deterministic simulation with an interval estimate resulting from a deterministic metric with a prediction interval is actually a combination of uncertainty expansion with bias correction, since the interval is centered around the error: An error estimate might also be converted to an uncertainty: symmetrically as with the area metric, including a safety factor as in (11) or one-sided from the nominal simulation to the corrected value:

Aggregation in Scenario and Parameter Domain
We devoted the last three sections to the aggregation from the verification, calibration and validation domain. Selected approaches within these sections used techniques from Sect. 5.4 to take into account the extrapolation of errors and uncertainties in the scenario and parameter domain. Oberkampf and Roy [133] used polynomial error learning and inference combined with external prediction intervals in  (65) to obtain the final p-box of PBA. Hills [84] considered uncertainties of the partial least squares regression within the meta-model approach. Mullins et al. [127] incorporated uncertainty into model parameters in (64) considering the scenario domain in form of a macroscopic validation metric. However, many approaches do not address extrapolation uncertainties. Oberkampf and Roy [133] treat model parameters and scenario inputs equally in PBA of CFD simulations. In such applications all equations can then be applied to parameters in the same way as described for the scenario inputs. However, in simulations of complex dynamic systems usually the scalar parameters and the discrete system configurations are distinguished from the continuous scenario domain. In the automotive sector, for example, it is important to validate different configurations of a vehicle series without conducting separate experiments for each of them [185]. Therefore we separated scenario inputs x and parameters in our framework, but these can be merged if desired.
Denham et al. [40] explicitly address the case, where experimental data is only available for a nominal configuration, but decision making focuses on a modified configuration. They consider an correction term from calibration and uncertainty bounds for both models. On the one hand, they describe a method to propagate the uncertainties from the nominal to the modified configuration by assuming that no additional calibration and update of the uncertainty bounds is required. On the other hand, they describe a method by interpreting the correction as an additional uncertainty.

Aggregation in System Hierarchy Domain
Roy [154] state that the extrapolation in the system hierarchy domain is even harder than in the scenario domain. Schroeder and Mullins [164] refer to the previous sub-section as interpolation and extrapolation type 1 in the input space. They refer to the current section as extrapolation type 2 in the output space. Simultaneous occurrence of both types is also possible. We give an overview about approaches that address this challenge.

Bayesian Network Approach
Sankararaman and Mahadevan [160] extend the Bayesian network to hierarchical systems. They focus on multi-level systems on the one hand and on reduced-order systems to replace a complex system for certain tests on the other. They state that they can convert multi-physics systems with simultaneous connections to the former type. They use the component outputs as linking variables for the multi-level type and the common parameters as linking variables for the reduced-order type. The Bayesian updating, Bayesian hypothesis testing and Bayesian network integration work similar as shown previously. Li and Mahadevan [115] weight the relevance of the heterogeneous component data with a sensitivity analysis for aggregation on system-level. Neal et al. [130] formulates an optimization to select the most valuable calibration and validation data on component-level that yields the highest confidence on system-level.
Kwag et al. [111] define an overlapping coefficient (OC) between the simulation and experiment PDF, extrapolate it with a Bayesian network from component-to system-level and compare the system-level OC with a tolerance value.

Bayesian Re-Calibration Approach
Babuška et al. [17] present a re-calibration approach, that addresses the case where experimental quantities deviate from unobserved QoIs in prediction. In the first step, they infer unknown model parameters based on a calibration dataset of observed quantities. In the second step, they update the model parameters with Bayesian updating based on an independent dataset of observed quantities. In the third step, they use both the calibrated model and the updated model to predict the unobserved QoIs and check whether they lie within a rejection criterion. For the latter, they calculate the horizontal distance between both models' CDF and compare it with a tolerance value. In addition, they use bootstrapping or cross-validation to tackle insufficient data. Other approaches build on this re-calibration approach [137]. Oliver et al. [135] focus on physics-based composite models of high-fidelity models combined with low-fidelity embedded models. They calibrate the model parameters based on Bayesian updating and perform a validation decision making based on credible intervals. For prediction of unobserved QoIs, they build inadequacy models at the source of errors in the embedded models and perform a sensitivity analysis to analyze influences.

Meta-Model Approach
The meta-model approach [84] from Sect. 5.4.3 can be used not just in the scenario, but also the system hierarchy domain. The simulation models in (47) and (48) and their response quantities can vary between multiple components and the system.

Output Uncertainty Approach
Eek et al. [53,54] are interested in the conceptual design of complex aircraft systems in an early development stage, where component-level data is available, but hardly any system data. At first, they verify, calibrate and validate the component models. In validation, they perform several experimental repetitions, re-simulate each repetition and calculate an error histogram for each component. They use intervals as uncertainty bounds on the error distribution. This component output uncertainty includes input and model-form uncertainties, but no strict separation as in PBA as stated by Eek et al. [56]. They assume constant input uncertainties across the input space [52]. Then, Eek et al. [53] use this component output uncertainty as input uncertainty on system-level and propagate it by the entire model. They solve this as an optimization problem with both interval boundaries assuming there is an approximately linear relationship. If one input is time-variant, they perform four additional simulations instead of two for both boundaries. They also perform a model verification on system-level. However, Eek et al. [56] state that it is not a complete VV&UQ framework. Model validation with system-level data is necessary to quantify the entire model-form uncertainty.

Multi-physics Coupling
Avramova and Ivanov [16] give an early overview about multi-physics coupling. van Buren et al. [184] focus on UQ to inform a covariance matrix from component-level to the coupled system-level. Stevens et al. [169] assume separateeffect experiments on component-level and integral-effect experiments on system level and use the component data to perform a bias correction of the system results. Stevens [170] additionally offers an inverse method to infer missing component knowledge empirically from system data. Stevens and Atamturktur [168] review recent literature from the last decade regarding strongly coupled models.

Aggregation in Time Domain
There are mainly two kinds of approaches that quantify the model error of dynamic models [89,90]. The first type seeks to transform the dynamic responses to static ones as shown in Sect. 4.3. These approaches can build on all static VV&UQ methods presented so far, but are not applicable to generalized dynamic systems. The second type addresses this and predicts the model error in each time step. Since the first type uses the methods presented so far, we focus in this section on the second type.

Bayesian Bias Correction of Model States
Hu et al. [89,90] focus on error quantification of discretetime state-space models. They intend to create a single error model that performs a bias correction of the hidden state variables in each time step. The challenge is that the static methods, which compare the responses to construct the error model, do not work, since the error depends on all previous input values. Furthermore, the response error cannot infer the error of the hidden state variables. They show two solution approaches. An estimation-modeling method first estimates the hidden states inversely from observations and then learns a non-parametric error model of the state variables. A modeling-estimation method first constructs a parametric error model for the errors of state variables and then estimates the error model parameters from observations. For both approaches they formulate the state-space model as a dynamic Bayesian network (DBN). They conclude that the second approach is preferred because of better accuracy, if appropriate parametric basis functions can be selected.
Subramanian and Mahadevan [174] combine a probabilistic Bayesian state estimation to update the states and estimate the error in known scenarios with a deterministic neural network for the transfer to untested scenarios in the application domain. In the first step, they estimate the error from validation data in general by particle filtering-based state estimators. If the model parameters are uncertain, a combined state and parameter estimation would be necessary. In the second step, they identify internal model-form errors in component models. In the third step, they learn a neural network in nonlinear auto-regressive exogenous (NARX) configuration as predictive error model. Either they learn an ensemble of networks that relates the updated Bayesian system state signals with the input signals or they learn one network that relates the observed system response with the original system state signal. The neural network and the simulation model generate data in untested scenarios so that the Bayesian state estimation can update the response. The trick is that they estimate the model-form errors in the equations rather than the response errors. Green [73] proposes to use Bayesian tracking algorithms and Gaussian processes in NARX configuration to predict the model error of dynamic systems. Wilkinson et al. [192] use particle filters to predict the error in the next time step.

Formal Uncertainty Expansion of Model-Form
Conformance Testing of formal models mainly consists of three steps [151]. The first step is to define a notion of conformance such as trace conformance or reachset conformance as described in Sect. 5.3.2. The second step means to establish a sound conformance check. This is formulated as an inverse optimization problem to get the set of model parameters that captivates by the tightest bounds on the calibration data. It includes heavy inverse calculations with cascading loops [81]. The third step defines the calibration scenarios to be tested.

Application Examples
After presenting our framework step by step in the last chapters, we now give a comprehensive overview of the literature in various technical areas. We present application examples of the methodology, classify the literature in our framework and present conclusions so that the different application areas can be developed further in a targeted manner. We select numerical simulations because they include the latest research on non-deterministic aggregation methods, and complex system simulations [47] from the automotive, railway and aircraft domain because they still rely heavily on conventional deterministic methods, but can benefit greatly from new methods.

Numerical Simulations
The explanations in the previous sections were very much based on advanced VV&UQ methods from numerical simulation fields such as FEM and CFD. We will therefore only give a brief summary of Probability Bound Analysis (PBA) and the Bayesian network approach, as both were spread over several sections. A comprehensive comparison between PBA and subjective probability approaches can be found in [65]. We supplement additional literature and frameworks, give an example and list application fields where those methods were already applied.

Probability Bound Analysis
Oberkampf and Roy [133] use PBA in CFD simulations to separately quantify input, numerical and model-form uncertainties as described in particular in Sect. 5.2.3, 6.2.2 and 6.4.5. Regarding the origins of PBA, the interested reader is referred to Ferson [61,62,64]. PBA is a complete non-deterministic VV&UQ framework. Regarding the classification into our framework, it addresses the verification, validation and application domain. It does not include model calibration, but only direct parameter estimation. The aggregation is done exclusively by uncertainty expansion of model responses.

Bayesian Network Approach
Sankararaman and Mahadevan [160] have developed the Bayesian network approach very far as described in Sect. 6.2.1, 6.3.2, 5.2.2 and 6.4.3, even with extensions to dynamic systems in Sect. 6.7.1 and hierarchical systems in Sect. 6.6.1. It addresses all domains and blocks of our framework, except the error learning and inference blocks for extrapolation. It uses bias correction for numerical errors and otherwise focuses on uncertainty expansion.

V&V 20 Standard
The standard V&V 20 [7] deals with V&V of CFD and thermal simulations. It distinguishes four cases regarding the observation of QoIs: direct observation, calculation from other observed variables with a simple equation either with or without joint error source and finally a model to relate the QoIs to observed variables. The standard describes extensively how to determine numerical, input and experimental uncertainties. These can then be used in each of the four cases to deduce the validation uncertainty of the model [7,]: An overview of six V&V standards for different numerical applications can be found in [166]. Tanaka [175] present the V&V plus UQ and Prediction (V2UP) procedure, which describes five steps for the assessment of thermal simulations, based on existing methods such as V&V20 and the AVM metric.

Numerical Application Fields
These VV&UQ methods were applied in many numerical simulations. For example, Choudhary et al. [30] applied PBA to a Sandia V&V challenge problem including 450 liquid tanks. The application decision making states that the quantity von Mises stress does not exceed the yield stress by a probability of 10 −3 . They quantified and propagated input uncertainties for scenario inputs such as the liquid height and parameters such as the radius and material of the tank. In model verification they performed code-to-code comparisons with Richardson extrapolation. In model validation they applied u-pooling and the MAVM metric, since only sparse data of four tanks was available and not for the actual QoI. After the aggregation of all uncertainties, they estimated the maximum failure probability as 0.0034 and concluded that the tanks cannot be considered safe.
Further applications can be found in Reynolds-averaged Navier-Stokes equations [194], in manufacturing [44,176], in civil engineering [13,109,118], in wind energy [183], in watershed modeling [197], in naval engineering [50], in power electronics [144,145] and in nuclear reactor safety [18,140]. In contrast to automotive system simulations in the following sub-section, probabilistic approaches and metrics have already been applied several times in automotive FEM crash simulations [195,196,199,200].

Automotive System Simulations
Today, cars are in a transition to automated vehicles (AVs). AVs consist of multiple modules including environment sensors, a software stack (perception, planning and control) and vehicle dynamics and interact with other traffic participants. Regarding model validation, there is a rich history in classic vehicle dynamics and a few recent papers addressing AVs. Most of these papers present interesting validation metrics. Only in vehicle dynamics, tolerances are given for validation decision making. If we classify the automotive literature in our framework, it focuses mainly on the uppermost part. However, the aggregation of errors and uncertainties to the application decision making is missing. Future literature should therefore focus more on non-deterministic methods and aggregation in order to be able to make statements about the specific validity of the models in the application case.

Deterministic Simulations
Vehicle dynamics has a rich history in model validation with several standards and regulations. For assessment they describe test maneuvers such as sinusoidal or step steering inputs and define KPIs such as response times or overshoot values [91,180].  [163] apply an overall error, Barons and Pearson correlation coefficients as three validation metrics to validate LIDAR sensor models. They compare multi-dimensional points clouds as raw sensor data as well as occupancy grids as processed sensor data in a parking lot scenario. Zec et al. [198] validate fused radar and camera sensor models. They use a log-likelihood and a root mean squared error (RMSE) for comparison of time signals of processed traffic objects. In addition, they convert the signals to histograms and perform a distributional comparison with a Jensen-Shannon divergence. Nentwig et al. [131] create a camera model and apply a classifier to the real and synthetic camera images to generate object hypothesis and bounding boxes around the objects. They compare the size of the bounding boxes and the final binary hypothesis with a confusion matrix. Gaidon et al. [69] present the very popular computer vision dataset Virtual KITTI. They also use algorithms on real and synthetic images and eight different metrics to assess the realto-virtual performance gap. Abbas et al. [3] similarly assess image datasets of their driving simulator and additionally compare the visual complexity regarding color and spatial information. Similar to those camera approaches, Jasinski 1 3 [95] validates radar sensor models. Goodin et al. [71] qualitatively validate LIDAR sensor models compared to an analytical model, laboratory and field tests. Holder et al. [85], Rosenberger et al. [153] systematically derive requirements for LIDAR and radar sensor models from the developer's point of view, but do not perform actual model validation.
Model validation of the entire closed-loop AV is only very rarely addressed, although the safety assessment of AVs is mainly based on simulations in the literature [149]. A Model-in-the-Loop and a Vehicle-in-the-Loop simulation were validated against proving ground tests in multiple longitudinal scenarios in [148]. Graphical comparisons and validation metrics such as the correlation coefficient were used at different stages of the AV pipeline. The influence of ground truth measurements compared to environment sensor data on the model validation of the entire AV is analyzed in [74,132,187]. They use multiple box plots to visualize how simulation errors flow through the AV pipeline. They compare quantities such as the AV velocity, trajectories and an overall stochastic risk measure. Johnson et al. [97] apply a formal analysis based on correct-by-construction controller design and validate it by comparing the percentage of collision-free runs to a full-scale AV. Aramrattana et al. [9] evaluate the influence of modeling errors and uncertainties on the controller performance.

Non-deterministic Simulations
Viehof [185] performs several experimental repetitions of vehicle dynamics maneuvers and re-simulates each one to obtain two PDFs for comparison. For highest requirements, he uses a statistical t-test to accept the model if its PDF lies within the experimental one. For lower requirements, he checks whether the simulation PDF lies within conventional tolerance values around the experimental mean. Introducing probabilistic simulations into automotive vehicle dynamics is an important contribution. However, there are some aspects to be considered in the experimental comparison [133, p. 490]. Assuming a perfect model-form with true input uncertainties, the simulation PDF should have exactly the same width as the experimental PDF and not a smaller one. This is especially relevant since currently neither parametric uncertainties are taken into account nor extensive MC sampling. This under-approximates the true input uncertainties and leads to small simulation PDFs and erroneously to valid model hypothesis, even if the model-form might be inaccurate. In addition, the binary results from the hypothesis test cannot be used for aggregation of uncertainties to the application domain. These challenges were addressed in [39] by applying PBA to a longitudinal consumption simulation and by quantifying all sources of uncertainty, just omitting the extrapolation uncertainty to the application domain.
Hartung et al. [81] apply the conformance testing approach of checking behavioral inclusion presented in Sect. 6.7.2. They use a non-deterministic dynamics model with controller of the actual AV and non-deterministic models for the other traffic participants. With reachability analysis they can prove online during driving that both trajectory sets do not intersect and the AV behavior is safe. Therefore, they only require the mentioned sub-models, but no model of the entire AV. To the best of our knowledge, there is as yet no probabilistic model validation of the entire AV for offline safety assessment.
Whereas an individual AV is assessed in microscopic simulations, several agents interact in macroscopic traffic simulations. An overview about the calibration of traffic models and calibration metrics is given in [38,86,178]. Detering et al. [42] propose a measurement concept to determine parameter uncertainties for statistical MC simulations. Rao and Owen [143] perform an error analysis with autoregressive integrated moving average (ARIMA) models and Zheng et al. [201] validate statistical models using extreme value theory. Ferson and Sentz [63] introduce a concept how to extend PBA to agent-based simulations considering aleatory and epistemic uncertainty.

Railway System Simulation
In the railway field, different projects and research groups address deterministic validation methods on the one hand and UQ methods on the other hand. Both worlds must be combined to obtain a complete VV&UQ framework [117]. Similar to automotive vehicle dynamics, there are many KPIs, metrics and tolerances defined, but the overall aggregation part of our framework is missing.

Deterministic Simulations
The project DynoTrain led to several publications regarding virtual rail vehicle acceptance and resulted in the standard [43]. Polach and Böttcher [139] summarize stationary tests in the first stage and dynamic on-track tests in the second stage. Typically KPIs such as quasi-static, maximum or root mean square (RMS) values are defined. The mean and standard deviation of each quantity must not exceed certain tolerance values. They distinguish different types of tolerances such as relative, constant or decreasing ones. Götz and Polach [72] analyze the influence of certain input conditions on the validation results and give a short overview about validation metrics. Bezin et al. [25] summarize how to deal with cases such as a train running in new conditions or a slightly modified train after acceptance.

Non-deterministic Simulations
Funfschilling et al. [67] quantify uncertainty due to inputs and parameters in rail vehicle simulations. They show how the variability of the input conditions such as the track geometry can be modeled and how to propagate it by the simulation model. In [68] they perform a sensitivity analysis of the inputs and in [66] they present a survey on uncertainty quantification methods in rail vehicle dynamics. Lestoille et al. [114], Lestoille [113] create a stochastic model considering uncertainties and use advanced stochastic methods such as polynomial chaos expansion to calibrate it.
Bogojević and Lučanin [28] define a probabilistic validation metric comparing two CDFs by an f-test weighted with a mean value difference. Kraft et al. [107] use statistical indicators such as mean and standard deviation of validation errors as well as a so-called least-square misfit function. The latter quantifies the distance between the error time signals in least-square fashion and can be plotted as a CDF to compare different vehicles, responses or running conditions. In [108] they analyze the effect of measurement uncertainty on the validation results.

Aircraft System Simulation
In general, it is very difficult to apply probabilistic model validation to very complex system simulations with hundreds of parameters. To counteract this, the aircraft community developed simplified VV&UQ methods that focus in particular on the aggregation in the system hierarchy.

Deterministic Simulations
Hällqvist et al. [76] use field measurement data to validate the entire aircraft system. They automatically extract steadystate conditions in the data and compute relative errors. They also compute a coverage measure in the input space to find the model's domain of validity. They manually extract transient measures in the data such as overshoot values or rise times to quantify the model's dynamic validity.
Eek [51], Eek et al. [55] focus on the credibility assessment of entire simulators including a meta-model with credibility information and techniques for visualization.

Non-deterministic Simulations
Eek et al. [56] tried to apply PBA to aircraft simulation models, but found that the number of components and parameters of the aircraft is too complex for this. Instead they developed the output uncertainty approach presented in Sect. 6.6.4 that uses component validation data on system-level without strict separation of the uncertainty sources. Kammer et al. [101] focus on spacecraft models that can never be validated entirely. They also integrate component uncertainties on system-level.
Dorobantu et al. [45] combines the Theil Inequality Coefficient (TIC) as robust control metric with statistical model validation. They prove the relation for linear aircraft systems and use Monte Carlo simulations for non-linear systems. If the TIC of the MC simulations bounds the flight data TIC, the non-deterministic simulation can be used as an overapproximative representative. In [46] they generalize the TIC to a gap metric that takes time series input and response data instead of just response data.

Evaluation of VV&UQ Methods
In this section, we derive evaluation criteria and apply them for the analysis of important VV&UQ approaches presented in this paper.

Evaluation Criteria
We have derived twelve evaluation criteria based on expert judgment to assess and compare the capabilities of VV&UQ approaches. In order to ensure the traceability in this analysis process, we introduce a generic rating system in Table 4. The order of the following criteria is related to the model-based process.

Dynamics
For generalized dynamic systems according to Sect. 6.7.2, an approach shall consider different types of representations such as differential equations [174] or discrete state-space  [89]. If a dynamic system can be simplified to a static one, the classic approaches can be applied instead.

Hierarchy
For hierarchical systems according to Sect. 4.2, different types of architectures [124] exist, which an aggregation approach should consider.

Physics
Most of the current VV&UQ approaches rely on the correctness of the physical equations in the simulation model. Thus, it is helpful to preserve the physical principles, but to add extrapolative power to the correction [128], instead of only extrapolating the error structure [106].

V&V Process
It is not sufficient to only calibrate parameters. Model calibration should be accompanied by a preceding model verification and a subsequent model validation.

IP Protection
In certain use cases such as a cooperation between system manufacturers, suppliers, tool manufacturers and technical services, the organizations must respect intellectual property (IP) so that the model equations or even the model parameters are unknown. In these cases, VV&UQ approaches that can deal with black-box models are inevitable.

Computing
The VV&UQ approaches differ in their computational complexity during forward simulation, often heavy inverse parameter updates or learning and inference of error models.
In the end, it should be feasible to execute the approaches within reasonable time.

Uncertainty Types
Since epistemic (E) and aleatory (A) uncertainties are two inherently different types of uncertainty, aggregation approaches that can represent both are preferred over approaches that can only represent one type and eventually convert the other with some workaround.

Uncertainty Sources
Since there are several sources of errors and uncertainties that can potentially compensate each other, it is preferred to separately quantify each of them.

Uncertainty Expansion
Uncertainty expansion is an aggregation technique that adds conservatism to the simulation model. It is important since each simulation has uncertainties, but the uncertainty bounds should be as tight as possible compared to the true experimental value.

Bias Correction
Bias correction is an aggregation technique that uses the gained knowledge from the model-based activities to correct the simulation model. Since it is smart to use this knowledge, but also risky, it should be combined with a prediction interval (PI) that quantifies the inherent prediction uncertainty of the correction model.

Extrapolation
The more the application domain differs from the other domains, the more decisive inter-and extrapolation uncertainties become. Thus, it is important to take these uncertainties into account, at best combined with the inherent prediction uncertainty of the extrapolation model.

Guarantees
The reliability of the final statements is crucial for credible decision-making. Interval-based VV&UQ approaches can provide absolute guarantees, probabilistic approaches statistical guarantees and deterministic approaches without uncertainties no guarantees.

Strength and Weaknesses
We compare six presented VV&UQ approaches: The classification into our framework can be found in the respective sections. We choose PBA and the Bayesian approach because both are commonly used VV&UQ frameworks, IPMs and the meta-model approach because both deal with the sophisticated aggregation of errors and uncertainties, the output uncertainty approach because it represents a simplification of PBA for complex aircraft, and finally the tolerance approach because it represents a conventional, deterministic and commonly used baseline. The ratings can be seen in the Kiviat diagrams in Figs. 11 and 12. The approaches are divided into two diagrams for visualization purposes only. We have used our best judgment in choosing the rating that best represents the overall approach. Individual papers may differ from this. As shown in the diagrams, all approaches have strength and weaknesses. PBA quantifies all sources of errors and uncertainties separately and can naturally represent all types of uncertainties, but is sometimes very conservative due to strong uncertainty expansion. The Bayesian approach has already been developed very far including extension to dynamical and hierarchical systems. It is based on subjective probabilities with priors, cannot naturally represent epistemic and aleatory uncertainties, currently lacks extrapolation uncertainty and includes heavy inverse calculations. IPMs impress with their tight bounds, but have no separate calibration and validation. Conformance testing via behavioral inclusion [81] is similar to IPMs and can also deal with dynamical systems. The meta-model approach captivates by the connection between the validation and application responses for extrapolation and by a bias correction with prediction uncertainty, but requires a linear dependency in the meta-model and a small application domain with many validation domains. The output uncertainty approach also does not quantify all sources of uncertainty, but is therefore applicable even to very complex hierarchical aircraft systems. The tolerance approach is very easy, fast and flexible in its extension to dynamical and hierarchical systems, but lacks the consideration of uncertainties and the entire aggregation.

Identification of Research Gaps
Since all aggregation approaches have strengths and weaknesses and none of them is superior with respect to all evaluation criteria, further research is needed. Each individual approach can be improved to eliminate weaknesses and extended to dynamic and hierarchical systems. Furthermore, a combination of different methods could be interesting. With bias correction and uncertainty expansion we have presented two aggregation techniques. The former exploits the error estimation, but might be risky, since usually some uncertainties are neglected and even the estimation itself includes an uncertainty. The latter reflects all sources of uncertainties, but might be overly conservative due to uncertainty inflation [63]. The compromise between a risky bias correction and a conservative uncertainty expansion could be solved by a correction in combination with tight uncertainty bounds. Then the existing knowledge is exploited, but all remaining uncertainties of model inadequacy and the correction itself are still considered.
In general, the user needs to analyze his application in order to select the most appropriate approach and highlight the relevant weaknesses and assumptions. Therefore it is crucial that VV&UQ is accompanied by overall maturity assessment procedures, such as the Predictive Capability Maturity Model (PCMM), the Phenomenon Identification Ranking Table (PIRT) or the gap-analysis, to increase the credibility of simulation for decision-making. The interested reader is referred to [24,88,100,134].

Method of Manufactured Universes
So far, we have highlighted the strengths and weaknesses of the different VV&UQ approaches, allowing the user to make a pre-selection of the most appropriate approaches at an abstract level. In addition, we want to provide the user with a concrete procedure for analyzing and comparing in detail the effects of selected VV&UQ methods on his application. Therefore, we introduce the Method of Manufactured Universes (MMU) from Stripling et al. [172]. In contrast to the previous references, it is not a VV&UQ method to validate simulation models, but a method to validate the VV&UQ approaches themselves. It is also not a replacement for closing the identified research gaps, but rather a supportive procedure for selecting an appropriate approach from the available options.
The idea behind MMU is to replace the reality with a manufactured universe in which the true values are known and infinite simulations can be easily drawn. In reality, the true values of nature are typically unknown and experiments are very costly, whereas in the manufactured universe the true values are known and infinite simulations can be easily drawn. Thus, two simulation models are compared: the nominal model and the MMU model representing reality as a reference. The user should manufacture the universe so that the physics, model errors and experimental uncertainties are represented in the best possible way. This ensures a good transferability of the conclusions from the manufactured universe to reality. Stripling et al. [172] advise to use a high-fidelity model for MMU to get close to reality and show an example from a particle-transport universe.
Whiting et al. [191] use MMU to compare the following four approaches in a CFD universe: PBA with area metric [133], PBA with modified area metric [186], Bayesian KOH framework [104] and the V&V 20 standard [7]. They define two overall evaluation scores referred to as conservativeness and tightness. They conclude that if few data are available, PBA with modified area metric performs significantly well, whereas for many data the Bayesian approach becomes attractive.
The MMU method can be extended and embedded in our framework from Fig. 1 by replacing the system blocks of each domain with the MMU model and thus also "reviving" the dotted system block and the pipeline of the true values in the application domain. The principle is illustrated by the excerpt in Fig. 13 and highlighted in orange. Therefore, the results of an VV&UQ approach can be compared with the corresponding true value at the different evaluation stages (blocks and arrows) along the processing pipeline: In summary, MMU allows to inject modeling errors so that each VV&UQ approach can aggregate them for model prediction in untested application scenarios. Knowing the true values enables a detailed comparison and helps to select a suitable approach.

Conclusion
For credible decision-making based on simulation models, VV&UQ activities including an aggregation of all errors and uncertainties are crucial. However, the aggregation is challenging, since the errors and uncertainties are deeply interwoven. Therefore we introduced a novel modular and unified framework, which is based on the current state of science.
Since the interfaces of all framework blocks are kept consistent, the framework can be used in different manifestations such as deterministic, probabilistic, interval, multi-level, dynamic and formal simulations. We gave a comprehensive survey of the VV&UQ literature across numerical simulations as well as automotive, railway and aircraft system simulations and integrated the approaches into our framework. Based on twelve evaluation criteria, we derived strength and weaknesses of the different approaches. Combined with the method of manufactured universes, those can support the user to select a suitable approach for his application. The application fields show a very heterogeneous landscape. VV&UQ methods are comparatively advanced in numerical simulations, while complex systems present some challenges and often rely on conventional deterministic approaches for feasibility. With our framework we have created a novel, modular and uniform basis, which was a key enabler to integrate several VV&UQ methods. With our comprehensive cross-field survey, we brought approaches from several engineering fields together. This gives the possibility for new combinations and exchanges between the engineering fields, which promotes the improvement and development of new methods. Complex system simulations in particular can benefit from the transfer of new methods from numerical simulation, for example from Bayesian approaches or Probability Bound Analysis.
Through our analysis, we have found that no individual VV&UQ approach excels with regard to all criteria. The identified weaknesses directly show the direction for individual improvements in the future. In addition, we emphasize that a combination of approaches, such as a risky bias correction of model predictions with a conservative expansion of uncertainties in untested application scenarios, can solve the trade-off and compensate weaknesses. We are convinced that both individual improvements and combinations are important to fill the research gaps.