2.1 Introduction

This position report aims to support future standardisation efforts on Good Simulation Practice. Good practice standards are usually a summary of best practices, collected empirically and consolidated through a consensus process among practitioners. As such, they are the least theoretical artefact that can be expected in regulatory science. Thus, it might require some explanation on why we decided to add a chapter on some of the theoretical foundations supporting the concepts in the following chapters.

As already mentioned, regulatory best practices emerge through consensus among practitioners. This implies that such practitioners are culturally relatively homogenous and share the same vocabulary. Even more important, they share a common epistemology, the principle around which humans establish new knowledge, in this case, knowledge on the safety and efficacy of new medical products. This is one of the reasons why the regulatory assessment of medicinal products and medical devices remain separated, despite the more frequent combination products; each class of products has its own vocabulary, expertise, and epistemology.

Nevertheless, there are also commonalities. For example, the whole regulatory science is formulated as purely empirical, where experimental evidence and real-world observations are considered the only source of reliable information. Introducing modelling and simulation in the regulatory process raises several epistemological challenges. The primary one is that CM&S evidence is predicted, not observed. Such predictions can be based on well-accepted theories that have resisted extensive falsifiability efforts, theories that might still be debated, and even purely phenomenological observations on a large volume of observational data. It is quite clear that a predictive model and a controlled experiment are different ways to investigate physical reality, but how they differ is debatable. Even more complex is the definition of a formal process to establish the truth content of a model’s prediction (what we call here “credibility”).

Last but not least, the introduction of computational modelling and simulation must add to the panels of experts that develop by consensus the good simulation practice totally new expertise such as applied mathematics, computer science, software engineering, and a whole territory of engineering science sometimes referred to as Modelling and Simulation in Engineering. But this creates a group of experts with different backgrounds, terminologies, and even epistemologies. This is why the discussion around the regulatory acceptance of in silico methodologies is so complex; the involved experts struggle to communicate and collaborate effectively.

There is no easy solution to this problem. People with different expertise and backgrounds will have to try to talk to each other and try to understand the other points of view. But in such a complex debate, we believe it is essential to have some theoretical foundations to which we can resort. Thus, contrary to all other chapters, this does not directly contribute to the regulatory science debate on the GSP. As such, it might not be of particular utility to the regulators, although it may serve as an indirect nexus between the regulatory and the CM&S sciences. However, we believe it is a necessary element of such a document that might prove useful in some complex discussions that the consensus process will inevitably impose.

2.2 What is a Model in Science?

“A model is an invention, not a discovery” (Massoud et al., 1998). The Stanford Encyclopaedia of Philosophy devotes an entire chapter to the non-trivial question in the heading (Frigg & Hartmann, 2020). For the purposes of this chapter, a useful definition is: “Models are finalised cognitive constructs of finite complexity that idealise an infinitely complex portion of reality through idealisations that contribute to the achievement of a knowledge on that portion of reality that is reliable, verifiable, objective, and shareable” (Viceconti, 2011). Models are a way that we humans think about the world. In science, models idealise a quantum of reality:

  • To memorise and logically manipulate quanta of reality (Descriptive models)

  • To combine our beliefs on different quanta of reality in a coherent and non-contradictory way toward the progressive construction of a shared vision of the world (Integrative models)

  • To establish causal and quantitative relationships between quanta of reality (Predictive models)

Predictive models are used in science primarily for two purposes:

  • as tools used in the development and testing of new theories

  • as tools for problem-solving

In this second use purpose, we define the credibility of a predictive model as its ability to predict causal and quantitative relationships between quantities in the natural phenomenon being modelled, as measured experimentally. Thus, the first foundational aspect of a model’s credibility is the complex relationships that predictive models have with controlled experiments.

2.3 A Short Reflection on the Theoretical Limits of Models and Experiments

Nature is infinitely complex and its mere observation, while useful to formulate explanatory hypotheses of why a certain phenomenon occurs, is not sufficient to test whether such hypotheses are true. To attempt the falsifiability of an explanatory hypothesis, we need real-world observations or a controlled experiment, or experiment for short. In an experiment, we intentionally perturb the system under investigation and observe how it responds to this perturbation. By controlling some of the variables that describe the system's state and observing how other state variables change, we can reject all hypotheses that are inconsistent with the results; the hypothesis that resists all our falsifiability attempts is tentatively assumed to be true.

Controlled experiments are extremely challenging in life sciences because of the complexity and entanglement of living organisms. The most realistic experiment is the one where we merely observe the system, but even in that case, because of the observer effect, by the simple act of observing the system, we perturb it; then, human beings cannot achieve a hundred percent (100%) realism. As soon as we perturb the system of interest, what we observe is not the system per se but an experimental model of that system. In other words, even an observational study is a model of reality. As soon as we investigate reality with a model (which we believe is always the case), the key question is the “Degree of Analogy” between the model and the reality being modelled: How close does the model capture the functional aspects of the reality that we are trying to understand? It might look completely different, but if it works like the portion of reality under investigation, it is a good model.

A major advantage of experimental models is that their Degree of Analogy with the reality they model can be inferred from how they were built. Every experimental model contains a fraction of physical reality. The bigger this fraction, the higher the Degree of Analogy of the experimental model.

Too frequently in medicine, we confuse analogy with homology: biological systems are homologous if they have evolved from the same origin or from a common ancestor, regardless of their function. As such, we consider mice as experimental models of humans because both are terrestrial vertebrates with common ancestors. But a mouse might be farther from a human than a fruit fly, for a specific physiological function.

However, there is unquestionably a relationship between analogy and homology. The closer our experimental model is to the reality we want to investigate, the more likely the model will have a strong analogy with such reality. Therefore, even if it is done because of homology and not of analogy, in general, a randomised clinical trial of a new drug is in general more analogous to the reality of the use of that drug in clinical practice than an animal study on the efficacy of that drug, which in turn is more analogous than an in vitro experiment in cell culture. This might not always be the case, but it frequently is.

Thus, we can infer the Degree of Analogy of an experimental model has with the portion of reality we are investigating by analysing how the experiment was built. The more controlled the experiment, the heavier perturbation we make to the physical reality and the lower the degree of analogy. Thus, the experimental models trade off their controllability with their degree of analogy, which can be inferred from how the experiment was built.

It should be noted here that the controllability of an experiment in the context of life science is not only limited by the trade-off with the Degree of Analogy. Living organisms are very complex and highly entangled, which means that perturbing one specific aspect will often impact other aspects, sometimes in fairly unpredictable ways. To this, we need to add all the ethical limits of animal and human experimentation. Sometimes the optimal experimental design is not possible for ethical reasons.

There is another way to build models of reality. As introduced above, models can be defined as “finalised cognitive constructs of finite complexity that idealise an infinitely complex portion of reality through idealisations that contribute to the achievement of knowledge on that portion of reality that is objective, shareable, reliable and verifiable” (Viceconti, 2011). If we accept this definition, then models can be built not only by perturbing/manipulating the physical reality we want to investigate (experimental models) but also by any other type of idealisation process. Here, we are interested in “in silico” models built through computational modelling and simulation of specific idealisation processes.

The idealisation processes we use to build in silico models can differ greatly. For example, statistical inference models are built through inductive reasoning framed in a frequentist or Bayesian theory of probability; biophysical mechanistic models are built by deductive reasoning starting from tentative knowledge that has resisted extensive attempts of falsifiability (laws of physics). While these differences will become vital in other chapters, here it will suffice to recognise that in silico models are built through some idealisation process.

We notice two significant differences when comparing in silico and experimental models. The first is that the Degree of Analogy an in silico model has with the reality under investigation cannot be inferred by how the model was developed. Since there is no grounding with the physical reality typical of experimental models, the degree of analogy must be demonstrated for each in silico model.

This is a major shortcoming of in silico models, which would almost always make us prefer experimental models if not for another important difference: the controllability of in silico models is entirely independent of the Degree of Analogy. This means that we could, in principle, consider the use of in silico models to reduce, refine, and replace experimental models when it is possible to demonstrate their Degree of Analogy with the reality being modelled and when that Degree of Analogy is higher than that offered by experimental models with similar levels of control. The second motivation for using in silico models to reduce, refine and replace experimental models is when for the same Degree of Analogy and the same level of controllability, in silico models can provide the required answer faster and/or at a lower cost. A third motivation comes from the observation that even for experimental studies within the currently accepted ethical boundaries, every animal and human experiment has an ethical cost that should be minimised as much as possible.

We can infer the Degree of Analogy of experimental models simply by how they are built; all we need to do is to quantify their validity and reliability. On the contrary, with in silico models, we must demonstrate that an in silico model has the necessary Degree of Analogy for each Context of Use before we can use it to reduce, refine, or replace experimental models.

2.4 Model for Hypothesis Testing, Models for Problem-Solving

In the previous section, we introduced experimental models as a necessity of the scientific method, which requires that each hypothesis born out of the observation of a natural phenomenon is relentlessly challenged with controlled experiments designed to falsify this hypothesis. This is the classic use of models in fundamental science when the goal is to increase our knowledge of the world around us. But there is another use for models, whether experimental or in silico: problem-solving. In his famous book “All Life is Problem Solving” (Popper, 1994), Karl Popper insists on using tentative scientific knowledge to solve problems affecting human life, including healthcare.

All our reflections in this position report are related to the use of models for problem-solving, and in particular, to a specific class of problem: determining, before its widespread use, if a new medical product is sufficiently safe and effective to justify its marketing authorisation.

While in knowledge discovery, the focus is on the falsifiability of hypotheses, in problem-solving, we assume that the knowledge used to build our predictive models (if any) is tentatively true. However, this does not automatically imply that the model predictions will be accurate; several factors may introduce errors in the prediction, which we will detail in the next section. Therefore, it is necessary to systematically assess the Degree of Analogy before a predictive model is used in a mission-critical context (e.g., a predictive model of a medical device or medicine that is intended to save a patient’s life).

Another related dichotomy frequently used to separate statistical models from machine learning models is between inference and prediction. Inference aims to generalise for an entire population the properties observed in a sample of such a population. The purpose of inference models is representational in nature. Their predictions aim to forecast unobserved data, such as future behaviour (e.g., in the business context, predictive modelling uses known results to create, process, and validate a model that may be used to forecast future outcomes in a specific context of use). On the other hand, the purpose of predictive models is predictive in nature. While inference is backed by a robust mathematical theory (probability theory) and, in particular, by the Law of Large Numbers, which has resisted extensive falsifiability attempts, this theory does not necessarily apply to data-driven predictive models, which makes the evaluation of the Degree of Analogy for data-driven models epistemologically challenging.

2.5 Assessing the Degree of Analogy of a Model: Evidence by Induction

The predictive accuracy of a model can be estimated by comparing its predictions to the results of a matching controlled experiment. Matching here means that the model should be informed with a set of inputs that quantify the independent variables of the controlled experiment, i.e., the quantities we control in the experiment. By doing so, we assume that the model is associated with that specific experiment. Thus, the predictive accuracy (for that particular set of inputs) is the degree of agreement between the values of the dependent variables measured in the controlled experiment and the same values as predicted by the model. This activity is usually called experimental validation of a predictive model. It should be noted that for classic validation studies, it is expected that the errors affecting the measurement methods used in the experiment to be negligible when compared to those affecting the model’s prediction; this allows the assumption that the measured value is “true” and the difference between prediction and measurement is due to the errors affecting the model. When this is not the case, comparing the model to the experiment becomes much more complex.

There is a major issue with this approach in that it is inductive in nature. By validating the model with one experiment, we estimate its predictive accuracy for those input values. This only allows us to say that the model has a certain accuracy when used to predict a specific condition described by those input values. A priori, nothing can be said about the model accuracy for other input values. Of course, another validation experiment can be performed, followed by calculating the model's predictive accuracy for a second inputs set. Still, again, this will extend our validity statements only to this second condition. We can do many validation experiments and try to build by induction a general validity for our model, or we can look at the nature of the predictive error the model being tested exhibits and find patterns and regularities.

The analysis of how the prediction error is composed is more commonly used in the validation of mechanistic, knowledge-based predictive models. In contrast, the validation by induction is typical for data-driven predictive models. The separation of the predictive error in its numerical, epistemic, and aleatoric components is the central motivation for the so-called Verification, Validation, and Uncertainty Quantification (‘VVUQ’) Viceconti et al., (2020b).

2.6 The Theoretical Framing of VVUQ

VVUQ developed within engineering sciences as an empirical practice without clear theoretical foundations. This may sound surprising, but historically also the most important numerical methods in engineering, like finite element analysis, were first developed as empirical methods and only later found a theoretical framing as a special case of the Galerkin method. Like all practices, the meaning of VVUQ may vary among practitioners. Also, VVUQ is frequently used in engineering science without many questions like why such a process should inform us better about the credibility of knowledge-based predictive models than any other approach.

However, for the purpose of this chapter, it is important to make explicit the theoretical framing that supports the use of VVUQ. This is because, as we will see, this approach relies on several assumptions, which might not always be true when the evaluated model predicts complex living processes. Here, we provide a summary; full details can be found in Viceconti et al. (2020b).

There are three possible sources of predictive error in a knowledge-based model:

  • The numerical error we commit by solving the model’s equations numerically;

  • the epistemic error that we commit due to our incomplete, idealised, or partially fallacious knowledge of the phenomenon being modelled;

  • the aleatoric error due to the propagation of the measurement errors that affect all our model inputs.

If we compare a model’s prediction to the result of a controlled experiment, we will observe a difference caused by all these errors. The VVUQ process aims to separate these three components of the predictive error. If the model is solved appropriately, then we expect the numerical error to be negligible compared to the other two. We expect the aleatoric error to be comparable to the measurement errors that affect the model’s inputs. If this is not the case, the model might have mathematical or numerical instabilities. In other words, we want to be reassured that the epistemic error is the predominant component of the predictive error.

Verification activities aim to quantify the numerical error. At the risk of oversimplifying, verification tests the model with special input values where epistemic and aleatoric errors are exactly null or asymptotically convergent to null. While the verification is performed for these special input values, because it is generally true that numerical errors are independent of or only weakly dependent on the inputs, we assume that the numerical errors found with those special input values will remain roughly the same for any other input value.

Uncertainty quantification explores how the experimental errors affecting the model inputs propagate within the model and affect the predicted values. The input values are perturbed according to the probability distribution of the experimental error affecting them, and the variance induced in the predicted outputs is recorded. Uncertainty quantification directly estimates the aleatoric error for a specific set of input values. It is usually assumed that how the error due to the inputs’ uncertainties propagates into the model’s predictions is independent of the specific values of the inputs. In other words, it is usually tested how the variance of the inputs around a single set of average input values propagates.

Validation activities rely on two assumptions. First, the numerical errors are negligible compared to the other two sources of error. Second, the aleatoric error is normally distributed around a null mean. If this second assumption is true, the effect of the numerical errors will be negligible when we calculate the predictive error as an average (e.g., root mean square average) over multiple experiments. In this case, the aleatoric errors will also cancel out, leaving the average predictive error as a good estimate of the epistemic error.

The last step in the VVUQ process is the so-called applicability analysis. While we tend to assume that numerical and aleatoric errors do not depend on the specific values of the inputs, such an assumption cannot generally be made for the epistemic error. On the contrary, it is expected that any idealisation holds within a limit of validity, and as we get closer to those limits, the epistemic error will increase. There are various approaches to evaluating the applicability of a model. Still, most rely on one fundamental assumption: if two input sets are similar, the two output sets will also be similar. Suppose the model tends to show similar epistemic errors for all tested inputs. In that case, we can consider that the epistemic error will also be similar for all other input values within the range of values tested during the validation.

Additionally, the reliability of the epistemic error estimate obtained during the validation activities decreases the further the model inputs drift outside the range of values tested in the validation. Another issue to consider, as mentioned above, is that every mechanistic model relies on theories, and every theory has limits of validity. The model’s predictive accuracy can degrade considerably as the inputs reach these limits.

Additionally, the further the model is used in terms of inputs from the range of values tested in the validation, the lower the reliability of the estimate of epistemic error we obtained with the validation activities. Another issue to consider, as mentioned above, is that every mechanistic model relies on theories, and every theory has some limits of validity. The model's predictive accuracy can degrade considerably once the inputs get closer to such limits.

2.7 Levels of Credibility Testing

The combination of VVUQ and applicability analysis extends the concept of the model’s credibility to combinations of input values that have not been experimentally validated. However, the issue of assessing if a predictive model is credible enough for a specific context of use has two additional aspects: the level of credibility at which we test the model and the minimum predictive accuracy below which we must reject the use of the model. The level of credibility testing is not an attribute of the model; instead, it is the expectation of the model's predictive accuracy, which we define by choosing against what we calculate the model's predictive accuracy. Three possible levels are shown in Fig. 2.1, for the specific case of a model that predicts a single output set (as opposed to models that predict entire distributions of possible values):

Fig. 2.1
3 graphs. Left. A scatterplot of observations versus cases plots dots for prediction and observation. Middle. A line graph of frequencies versus observations plots a bell-shaped curve with a dot for prediction. Right. A scatterplot of predictions versus observations plots dots in increasing order.

Definition of the levels of credibility for a predictive model

  • At the lowest level of credibility testing (L1), models aim to predict a value within the range of values observed experimentally over a population. Here, the predictive accuracy is measured in terms of the probability that the predicted value for each Quantity of Interest (‘QoI’) is a member of the population of values measured experimentally.

  • The second level of credibility testing (L2) expects the model to accurately predict some central properties (e.g., the average) of the distribution of values observed experimentally over the population. Here, the predictive accuracy is quantified by measuring the distance for each QoI between the predicted value and the average of the values measured experimentally.

  • Lastly, the highest level of credibility testing (L3) expects the model to accurately predict the observed value for each member of the population. Here, the predictive accuracy is calculated as a p-norm of the vector of differences between the predicted value and the measured value for each member of the population. A 2-norm is commonly used (root mean square error). A more restrictive infinity-norm may also be used, where the measure of the error is the maximum error found among all members of the population, may also be used.

While this taxonomy of the level of credibility testing is not considered in any current regulatory document, we recommend it be considered in future guidelines and standards.

2.8 The Conundrum of Validating Data-Driven Models

Model credibility frameworks based on VVUQ plus applicability were developed having in mind models built starting from a causal explanation of the phenomenon of interest (mechanistic models). By considering epistemic errors, VVUQ-based credibility accepts that the prior knowledge we used to build the model might be inaccurate, but it is always present. And in most cases, such knowledge is expressed with mathematical forms whose properties summarise such knowledge. For example, all theories expressed with differential equations implicitly assume that the variation of the quantities of interest over space and/or time occurs smoothly. This, in turn, derives from an essential physical knowledge of the conservation of mass, momentum, and energy. In fact, many of the implicit assumptions that the use of VVUQ to assess a model’s credibility that we listed in the previous sections are usually valid under such assumptions.

But this raises an important question: can credibility assessment based on VVUQ plus applicability be used also for models that are not built with some prior knowledge (hereinafter referred to as ‘data-driven models’)? The short answer is no; here, we provide some theoretical justifications for this conclusion.

In probability theory, if we are sampling some population properties, the Central Limit Theorem (‘CLT’) tells us that such sampling will eventually converge to a normal distribution. The Law of Large Numbers (‘LLN’) states that with enough samples, the estimates of certain properties of the probability distribution, such as average or variance, will asymptotically converge to the true values for that population. This guarantee of asymptotic convergence makes it possible to infer the properties of a distribution from a large but finite number of samples.

Let us now consider the use of a statistical model as a predictor. Here, using statistical inference, it can be shown that the hypothesis that the value of the dependent variable Y can be predicted given the values of a set of independent variables [X1,.., Xn] so that Y = f(X1, …, Xn). Here, for simplicity of treatment, we assume that the variables Xi can be quantified without any uncertainty. By inferring the relations and correlations between X1 and Y, one can build an estimate of f(), (called f’()), which it can be used to predict Y for combinations of X1 that have not yet been observed experimentally. If the LLN theorem holds, it is sufficient to have a finite number of observations [Xi, Yi] to build f’().

But if we now want to quantify the predictive accuracy by comparing the value predicted Y’(X) = f’(X) to that observed experimentally (Y) for a finite number of Xi sets, does the LLN still apply? Given a large enough set of validation experiments where it is observed P(Y | Xi), is there a theoretical foundation to assume that the estimate of the average prediction error e’ = ave(Y |Xi) tends to the true value e it would be obtained if we could validate the predictor with an infinite number of experiments? Does the average prediction error estimate tend asymptotically to the true average prediction error?

When estimating, we learn about the characteristics of a population by taking a sample and measuring those characteristics. The fact that we have a sample brings about variability (uncertainty), normally described by a probability distribution whose parameters are related to the characteristic of interest. Usually, the more information we have about the characteristic (the larger the sample size), the greater the accuracy (estimating the correct value of the characteristic) and precision (decreasing the uncertainty) of the estimation. If some very mild conditions apply, we can assume the variability in the estimators follows a normal distribution:

$$f(x) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{{ - \frac{1}{2}\left( {\frac{x - \mu }{\sigma }} \right)^{2} }}$$

where x is the measured quantity, \(\mu\) is the mean, and \(\sigma\) is the standard deviation.

Now we must consider an extra source of uncertainty if the objective is to make predictions. This changes the nature of the statistical problem, indeed. The formal problem is to analyse P(Y|X), with Y the characteristic of interest and X all the relevant data available. Of course, Y and X are related formally by a set of parameters relevant to X and Y; for simplicity, formalise them as g(X|A) and g(Y|B), with B = B(A) an invertible function. So, one can learn about A—and hence B—from X and that knowledge is described by f(B|X)—which can be Gaussian and with increasing precision and accuracy as above. We can then use this knowledge to inform g(Y|B), but there is still a source of uncertainty in g(Y|B) that cannot be reduced further, even if B is known exactly; moreover, the shape of g is not warranted to be normal at all.

There are alternative ways to include the information about B in g(Y|B). But in any case, the law of iterated expectations can provide a formal interpretation. The expected value of Y|X can be calculated as the expected value of Y|X,B. The more you learn about B from X, the better the estimation of the mean of Y. Thus, larger sample sizes yield more accurate predictions. However, this is not necessarily the case for the precision of the prediction. The variance of Y|X can be calculated as the expected value of the variance of Y|X,B plus the variance of the expected value of Y|X,B. The second term decreases with the sample size, but the first does not and depends on the distribution of Y|B. Therefore, validating a predictive model must take both sources into account.

Let us assume we are interested in predicting a quantity Y, which depends on a set of values X. f() is a predictive model that provides an estimate of Y, which we call Y. The concept of model credibility assessment based on VVUQ is that the model f() is mechanistically defined, so we know that Y = f(X), and any other variable outside of the set X has little or no effect on Y (or the effect is mediated through X).

An important implication of all this is that the smoothness of the prediction error is not guaranteed for data-driven models as it is for mechanistic models. In mechanistic models, we can assume that our error e = YY’ depends only on X, so if we test f() for X1 and X2, where X1X2, the prediction error will be similar, e1 ≈ e2. This also means that if e(X1) is the prediction error for X1, and e(X2) is for X2, e(Xi) will be close to e1 and e2 if Xi is close to X1 and X2. In other words, if the model is validated for a range of Xi, it could safely be assumed that the error will be similar for any other X close to Xi. But this cannot be said for data-driven models, such as Machine Learning (ML) models, because there is no guarantee that Y is a function only of X. We cannot, as we do with mechanistic models, test the ML model for a finite number of cases, and assume that average accuracy will not change significantly if more cases are tested. This is pure induction: by testing the ML model against ten experiments, it can only be said that the error with those ten cases is that, but the next could show an error totally different, even for an Xi close to the ten we already tested.

This poses two significant problems when the VVUQ approach is used to assess the credibility of data-driven models. The first is that while in mechanistic models, the variance of Y can be mainly explained with the variance of X, in data-driven models, this is not assured. As explained above, X may include variables that have little effect on Y. Also, we have no guarantee that all variables affecting Y are included in X. This uncertainty is the primary cause of the so-called ‘concept drift’, which sometimes causes a data-driven model to perform much worse than it did on the training set when the test is done against an independent validation set.

The second is the lack of smoothness in the prediction error. As explained in Sect. 2.4, the applicability analysis presumes that the model's prediction error will vary smoothly as the inputs of the model are varied. This makes it possible to assume that if the model is used with inputs “near” to the values for which the model has been validated against experimental results, the error affecting the prediction will be similar to that quantified with the validation. However, such assumptions cannot be made for data-driven models.

The “Artificial Intelligence and Machine Learning (AI/ML) Software as a Medical Device Action Plan” published by FDA on January 2021Footnote 1 explicitly refers to introducing a so-called Predetermined Change Control Plan in the US regulatory system. Thus, a total product lifecycle (‘TPLC’) regulatory approach to AI/ML-based SaMD, is designed considering the iterative, adaptive, and autonomous natures of AI/ML technologies. Essentially, the idea is that the validation of data-driven models is a continuous process where we continuously extend the test set, re-evaluate the model's predictive accuracy, and then regenerate the model using this test set as an extended training set.

Our reflections would suggest that this approach is not only possible but should be the only acceptable approach. In light of this discussion, the idea of a “frozen” data-driven model that has been validated using VVUQ procedures developed for mechanistic models, seems unwise. This conflicts with the obvious need for regulatory approval processes to base the decision on a prediction made using a “frozen” model.

2.9 Conclusions

Some conclusions can be drawn that can inform the rest of this position report.

The human mind can investigate reality only through cognitive artefacts we call models. Whether we use a mathematical model or a controlled experiment (including observational studies), we always deal with models of reality; ultimately, what matters is the Degree of Analogy that the model has with the reality being modelled. The main advantage of experimental methods over in silico methods is that the Degree of Analogy in experimental models can be easily inferred by their design, whereas the Degree of Analogy for in silico methods must be assessed on a case-by-case basis.

When a model is used to make predictions in the context of problem-solving, the Degree of Analogy with the reality being modelled becomes the credibility of the model’s predictions. In general, credibility can be assessed only by induction, so if we quantified the predictive accuracy of our model against hundred (100) experimental measurements, then we could only state the credibility of the model under those 100 experimental conditions. The number of experimental conditions for which the predictive accuracy needs to be tested, called the “solution space”, is infinite (∞). However, under certain assumptions, we can analyse how the various components of the predictive error (numerical, aleatoric and epistemic) vary over the solution space using a process known as VVUQ plus applicability analysis. This makes it possible to estimate the predictive accuracy over the entire solution space based on a finite number of validation experiments.

The assumptions that make the VVUQ process possible are usually valid only if the tested model is built with some degree of prior knowledge (e.g., a mechanistic model). This is not true for data-driven models, which can only be tested by induction.

2.10 Essential Good Simulation Practice Recommendations

  • The human mind can only understand reality through models. Models are finalised cognitive constructs of finite complexity that idealise an infinitely complex portion of reality. Their usefulness is measured by their ability to capture the functional aspects of interest of the portion of reality that we are investigating. This measure is called the Degree of Analogy.

  • In each portion of reality, the functional aspects of interest can be observed experimentally or predicted through inductive or deductive reasoning. All these methods of investigation are models. However, the Degree of Analogy of experimental models can be directly inferred, whereas that of predictive models must be demonstrated by comparisons with controlled experiments. In other words, experiments are not necessarily more trustworthy than predictions, but their trustworthiness is easier to assess.

  • Predictive models can be divided into predominantly data-driven models and predominantly mechanistic models. In predominantly mechanistic models, the Degree of Analogy can be established by decomposing the predictive errors in numerical, aleatoric, and epistemic errors through a process known as Verification, Validation, and Uncertainty Quantification. But in predominantly data-driven models, the Degree of Analogy can only be estimated by induction, using a total product lifecycle regulatory approach.