1 Introduction

Machine Learning (ML) methods in general (Hastie et al., 2009) and Deep Neural Networks (DNNs) in particular (Goodfellow et al., 2016; LeCun et al., 2015) have achieved tremendous successes in the past decade. For example, a classifier based on the resnet architecture (He et al., 2016) reached human level accuracy in the ILSVRC2015 challenge (Russakovsky et al., 2015). Furthermore, Neural Networks based on the idea of transformers (Vaswani et al., 2017) have recently spurred breakthroughs in the field of Natural Language Processing (NLP), enabling high-quality machine translation. Answers generated by Large Language Models (LLM) such as GPT-3 (Brown et al., 2020) have achieved an impressive level of similarity to those generated by humans. There is by now convincing evidence that the best ML/DNN algorithms can learn effectively highly sophisticated tasks. However, some important questions remain open concerning the reliability of DNN algorithms.

First, we cannot completely exclude the possibility that successful DNN algorithms are over-fitting the collections of datasets that are used to train and test them (Recht et al., 2018). In fact, because of the difficulty of collecting good quality labelled data, very few very popular datasets (in particular CIFAR-10, Krizhevsky et al. (2014), ImageNet, Russakovsky et al. (2015) and a few others) represent the unique benchmark for the majority of the research work on DNN. This challenges one of the key statistical assumptions of all ML methods, namely that the parameters are set independently of the test data. In principle, nothing about the test data should be known in the design and training phase. In practice, train-test contamination can occur in subtle ways, even if we don’t directly use test data for training (Kapoor & Narayanan, 2023). Strictly speaking, if some data have been used in an article that we read before designing our new DNN architecture, those data should not be used to test our new architecture, because they might already have influenced our design choices indirectly. In practice, it can be hard to adhere to this requirement rigorously.

Secondly, successful applications of ML methods are much more likely to be published than unsuccessful ones. While publication biases (Dickersin et al., 1987) affect all types of scientific research, one might expect a stronger impact on ML research, because ML models are less likely to bring valuable insight when they fail. In fact, we have less compelling reasons to believe that they must succeed, which is related to the lack of full understanding of the mechanism that allows a DNN model to learn some features and ignore others.

Furthermore, assessment of confidence levels of ML predictions is notoriously difficult, especially in the case of DNN. The best evidence of these difficulties is provided by so-called adversarial examples (Szegedy et al., 2014): images that are misclassified with very high confidence by the DNN classifier, although they are humanly indistinguishable from other images that are correctly classified. Adversarial examples are much more pervasive in DNN than initially expected (Carlini & Wagner, 2017). The problem is not the occurrence of misclassification itself, but the very high confidence (easily \(>99\%\)) quoted by the classifier. This is clearly not a reliable error estimate.Footnote 1

Finally, the presence of social biases in some datasets widely used to train ML algorithms (Buolamwini & Gebru, 2018) is a matter of concern. Reliable error estimates could help significantly to detect earlier those predictions that rely on too limited statistics. This would be an important step toward the goal of deploying responsible AI (European Commission, 2023; High-Level Expert Group on AI, 2019) at scale.

ML methods in general, and DNN in particular, are being used successfully in many contexts where assessing their expected errors is not necessary. This applies, for example, to all those cases where the ML method improves the search efficiency of candidate solutions that can be subsequently verified by other means (Duede, 2023). In these cases, ML methods are used in the context of discovery, and not in the context of justification. Examples cover a vast range of applications, spanning material discovery, drug discovery, predictive maintenance, fraud detection (Baesens et al., 2015), code suggestion (Brown et al., 2020), protein folding (Jumper et al., 2021) and many others. In those cases, assessing the reliability is the responsibility of the independent check, while the ML method merely improves the efficiency of the overall process. However, there are also applications where independent checks are not practical [e.g. for safety-critical real-time systems (Buttazzo, 2022)] or cases where it is crucial to estimate the amount of missed solutions (e.g. compliance applications). In those cases, assessing the reliability of the ML method is very important.

This paper addresses the reliability of DNN methods from a fundamental epistemological point of view. It is very important to combine this perspective with a purely statistical one because even traditional science does not offer an absolute (assumption-free) guarantee that its predictions have specific probabilities. So, we must understand to what extent DNN models rest on similar grounds as those employed by more traditional scientific disciplines and to what extent they might suffer from fundamentally different problems.

This paper focuses on models based on DNNs, because these have been responsible for the most impressive successes in recent years and because they pose the most interesting challenges from an epistemological point of view. However, we will occasionally consider also other ML methods, such as Logistic Regression (LR) and Random Forest (RF) models as they share some features of DNN models and some features of traditional scientific models. For the sake of concreteness, we focus on supervised ML models for binary classification. However, everything discussed in this article could be easily extended to other models of supervised ML. Unsupervised ML methods have quite different purposes and are beyond the scope of this paper.

The topic of this article falls at the intersection of the theory of complexity (Franklin & Porter, 2020; Hutter, 2007; Li & Vitányi, 2019; Zenil, 2020a), DNNs (Goodfellow et al., 2016; LeCun et al., 2015), responsible AI (Eitel-Porter, 2021; MacIntyre et al., 2021) and the epistemology of ML, which has attracted considerable attention recently [see e.g. Desai et al. (2022), Leonelli (2020)]. As opposed to many works on this topic (but similar to Kasirzadeh et al. (2023), Zenil & Bringsjord (2020) among others), this article stresses the important peculiarities of DNNs within ML methods. The present approach has some similarities with the one adopted in Watson and Floridi (2021). Some of the most important differences are: (a) the present focus is on reliability, that naturally leads to consider global, rather than local interpretability; (b) for the same reason, the relevance defined in Watson and Floridi (2021) is less applicable and it is not considered here; (c) finally, Watson and Floridi (2021) consider a subjective notion of simplicity, while we focus on the hardcore complexity that no human can reduce, regardless of language and individual skills. This article investigates the objective foundations of reliability assessment (i.e. based on well-defined assumptions). Hence, it is largely unrelated to the literature that investigates the sociological basis for the trust in ML models, or analogies between ML and human behaviours [see e.g., Clark and Khosrowi (2022), Duede (2022), Tamir and Shech (2023)]. Further comparisons with the literature are provided in the main text.

2 Assessing the reliability of model predictions in science and ML

To assess the reliability of any model we must be able to estimate the uncertainty of its predictions, in some precise and useful probabilistic sense. To start, it is convenient to distinguish between statistical and systematic uncertainties. Traditionally, statistical uncertainties are defined as those error sources which have a known statistical distribution (Bohm & Zech, 2017).Footnote 2 These uncertainties can be safely analysed via statistical methods. Systematic uncertainties are all the others: they may stem from systematic distortions in the measurement devices, from imbalances in the data selection, from inaccurate models, from approximation errors, or from inaccurate parameter fitting. As emphasised in Bohm and Zech (2017), even random noise with unknown statistical properties must be classified as a systematic effect.Footnote 3 Statistical uncertainties can be very difficult to estimate in practice, but the process is conceptually clear. They are known–unknown, as they are expected by the model. Systematic uncertainties are unknown–unknown and they can be estimated only by enriching the model assumptions. This raises the question of which assumptions are acceptable to assess our model assumptions?

This question is crucial and problematic not only for ML models but also for traditional scientific (TS) models,Footnote 4 and they have been the focus of extended analysis in the philosophy of science. It is therefore important to compare ML models to TS models in this respect, to understand to what extent ML introduces novel issues to the existing ones. It turns out that studying the reliability of ML models forces us to reconsider classic philosophical questions that are too often regarded as void of practical consequences.

Besides DNN and TS models, in the present comparison, we consider also Logistic Regression (LR) models and Random Forest (RF) models, because they represent very popular ML models whose properties often interpolate between DNN models and TS models in an interesting and instructive way.

2.1 Sources of errors

It is useful to distinguish four different sources of errors and see how they affect ML and TS models in different ways. Errors may stem (a) from data measurements, including the labels of training data; (b) from the model, which may not faithfully describe the actual phenomena in scope; (c) from any approximation that may be applied to the model to derive further predictions; and (d) from fitting the parameters that are left free by the model.Footnote 5,Footnote 6

Source (a) is the same for any models under consideration (ML or TS). Note that all measurements are theory-laden (Duhem, 1954; Hanson, 1965). But, whenever two models are employed to study the same phenomena, we can also assume that the theory behind these measurements is the same. Source (c) applies only when the original model is replaced by an approximation to enable further analytic or numerical derivation. ML models are already in a form that can be used for numerical applications and therefore source (c) applies only to some TS models.

Sources (b) and (d) are the most interesting ones when comparing TS and ML models. Model errors (b) can be reduced by extending the basic (ideal) model with further elements designed to model the deviations from the basic model. However, this enrichment comes at the expense of more complex assumptions and more parameters to be fitted [generating more uncertainties of type (d)].

Source (d) refers to errors in the determination of the optimal parameters, even when the model might provide a correct representation of the phenomena. This may happen because of limited data, noisy data [source (a)], an inaccurate minimisation algorithm or an inexact parameter fitting process. Error source (d) has a statistical component, which is propagated from noisy data (a) and can be analysed statistically (along the lines of e.g. Marshall and Blencoe (2005) and references therein). However, source (d) also has systematic components, due to the complexity of determining the optimal parameters and also due to the unknown statistical properties of the data.Footnote 7 Moreover, increasing the number of parameters requires, in general, an exponential increase in the amount of data required for their determination.Footnote 8 In conclusion, also estimating (d) requires making suitable assumptions.

TS and ML models do not show fundamental qualitative differences from the point of view of the error sources above [except for the less important source (c)]. But they do display quantitative differences: ML models tend to be very flexible and characterised by significantly more parameters than TS models. Traditionally, fitting too many parameters was considered hopeless, because of the curse of dimensionality (Giraud, 2021). But, the impressive predictive success of DNN models, despite having vastly more parameters than training data points, shows that we cannot dismiss them based on naive parameter counting. The question of this paper is whether we can also ensure the reliability of models with so many parameters. We will see that this is a very different question that will be addressed in Sects. 3 and 4. But, before coming to this, we comment, in Sect. 2.2, on the radical idea that the success of DNNs might even represent a new kind of “theory-free” science.

2.2 The illusion of predictions without assumptions

A common fallacy is to hope that we can produce error estimates which do not depend on any assumption, or, equivalently, produce estimates that take into account all possible models. In the context of ML, this fallacy becomes even more tempting because ML models can be very flexible, and, for some DNN models, a universal approximation theorem applies (Hornik et al., 1989; Leshno et al., 1993). The idea of “theory-free” science does enjoy some support (Anderson, 2008) and it is not ruled out by others (Desai et al., 2022).

The simple reason why this idea is fallacious is the well-known (but often overlooked) underdetermination of the theory by the data (Duhem, 1954; Stanford, 2021; van Orman Quine, 1975). No matter how much data we have collected, there are always infinite different models that fit those data within any given approximation. Moreover, for every conceivable prediction, there exist infinite models that still fit all the previous data and also deliver the desired prediction. Data alone cannot enforce any conclusion.

Even ML models that enjoy the universal approximation property offer no exception to the underdetermination thesis.Footnote 9 On the contrary, they provide more evidence for it, because we can include any desired prediction among the data that we want to describe and the universal approximation property will ensure the existence of parameters that reproduce also the desired prediction within the desired approximation. The predictions of the ML model hence depend not only on the data but also on the architecture of the ML model and, in the absence of evidence to the contrary, also on the algorithmic and initialisation details.

Another frequent fallacy is the idea that we should only accept assumptions that have been tested. But how much and under which conditions should they be tested? Newton’s laws were tested extensively under all conceivable circumstances for over two centuries before discovering that they do not always hold. There is no way to fully test our assumptions (or to define their scope of applicability) in a future-proof way. They may fail in contexts that we cannot currently foresee.

In conclusion, any error estimate is necessarily based on assumptions that cannot be fully justified on an empirical basis. They might be narrower or broader than the assumptions of the model itself, but they must be acknowledged because the error estimates crucially depend on them.

3 Current approaches to assess the reliability of DNN predictions

After the general considerations of the previous section, let us discuss how the reliability of DNN predictions is currently assessed in the literature.

Until recently, it was very common to quote the normalised exponential function (or softmax) (Bishop, 2006) as a measure of confidence in a DNN prediction. However, it has been shown very convincingly that the softmax output lacks the basic features of a measure of confidence [see e.g. Guo et al. (2017)]. In particular, it tends to be increasingly overconfident for increasingly out-of-distribution samples (Hein et al., 2019). Ad-hoc attempts to correct this pathology lead to arbitrary results.

The issue of proper estimation has received growing attention in recent years, and numerous review articles are available today (Abdar et al., 2021; Gawlikowski et al., 2021; Loftus et al., 2022). A popular view regards the Bayesian approach as the most appropriate framework, in principle (Wang & Yeung, 2016). According to this view, the main drawback of a full Bayesian estimate is its prohibitive cost, which leads to a very active search for approximations that offer the best trade-off between accuracy and computational efficiency (Blundell et al., 2015; Gal & Ghahramani, 2016; Hartmann & Richter, 2023; Jospin et al., 2020; MacKay, 1992; Sensoy et al., 2018; Titterington, 2004). Besides the Bayesian framework, the other main approaches rely either on ensemble methods (Lakshminarayanan et al., 2017; Michelucci & Venturini, 2021; Tavazza et al., 2021; Wen et al., 2020), or data augmentation methods (Shorten & Khoshgoftaar, 2019; Wen et al., 2021). An alternative is to train the DNN to specifically identify outliers or uncertain predictions [see e.g. Malinin and Gales (2018), Raghu et al. (2019)]. Let us consider them separately in the following.

3.1 Bayesian error estimates

Consider the Bayesian approach first. The computational cost is not the only difficulty. It is well known that Bayesian error estimates depend on the assumption of prior distributions (or priors) for the parameters whose impact should be estimated, and the choice of priors is largely arbitrary (Gelman et al., 2013). For infinite statistics, the Doob and Berstein–von Mises theorems (Van der Vaart, 2000) state that the posterior distributions converge to prior-independent limits. In simple words: “the data overwhelms the priors”.

However, such an argument is based on the unrealistic idea that the model is fixed, while we can rather easily collect more data. This is not what usually happens in real applications. In many cases, instead, collecting valuable data can be very hard, while producing new models is often much easier. This is the case in many TS domains, where planning a new experiment might take decades, while new models are published every day. It is the case also in most ML applications, where collecting labelled data can be extremely labour intensive, while new ML models are produced at the rate of millions per second by the ML algorithm. Under these conditions, there is no guarantee that the dependency on the priors will be sufficiently suppressed by the time the model is used.

Increasing the number of parameters might look like a way to eliminate the need of changing the model. But, the Doob and Berstein–von Mises theorems do not apply when the number of parameters is of the same order as the size of the dataset or more (Johnstone, 2010; Sims, 2010). Extensions exist for high-dimensional problems (Banerjee et al., 2021) and even infinite-dimensional (aka nonparametric) ones (Ghosal & Van der Vaart, 2017). However, those extensions rely on the assumption that most parameters are very close to zero, which is achieved via LASSO priors or equivalent suppression mechanisms (Banerjee et al., 2021). Note that this is a much stronger assumption than requiring that all the relevant parameters live in some low-dimensional subspace of a high-dimensional model. A parameter suppression mechanism imposes a sparse realisation of the given parametrisation. As a result, even in the asymptotic limit of infinite statistics, the posteriors depend on the choice of the parametrisation, similar to the case of low-dimensional models. In other words, nonparametric models are not a way to eliminate model dependency.

The limitations of nonparametric models confirm the conclusions of Sect. 2.2 about the impossibility of estimating errors that take into account all possible models. The Bayesian approach is an excellent framework to discuss the probability of a model within a given class (say C) of models, but it cannot justify the selection of one class C of models over other options.

In principle, the class C should include a model that describes well all past and future data (what is sometimes called a true modelFootnote 10).Footnote 11 However, we cannot know which are the true models, of course, and we have to resort to other criteria.

The ambiguity in the selection of the class C applies to both TS and DNN models. However, in practice, the scientists tend to agree on which other models should be included to assess the errors of a given TS model. Ideally, they should include all the models that might provide legitimate alternative descriptions of the relevant phenomena, because they offer different compromises between accuracy (for different phenomena) and non-empirical epistemic virtues. These are sometimes referred to as state-of-the-art models, but the class is not well defined unless we identify the appropriate non-empirical epistemic virtues. We will come back to this point in Sect. 4.2, where we consider a definition of state-of-the-art based on the simplicity of the assumptions (see also Appendix B). We will also see that identifying a similar class for DNN models is significantly more challenging.

3.2 Frequentist error estimates

The frequentist approach (Lakshminarayanan et al., 2017; Michelucci & Venturini, 2021; Wen et al., 2020) is not better equipped to answer the questions above (Sims, 2010). In fact, the frequentist error assessment relies on the same assumptions that characterise the model that has produced the predictions (and, possibly, additional assumptions). By construction, the frequentist approach is not designed to discuss the probability of the selected model. Frequentist error estimates do have a value, but only under the assumption that the model is valid for some parameters in the neighbourhood of the value selected by maximal likelihood.

This limitation is especially problematic for a model that displays high sensitivity to details, which are not understood. Because any little change to the parameters, as it happens, e.g., for any new step of training, may bring the system to a configuration for which the previous error analysis is not relevant anymore.

3.3 DNN-based error estimates

The spectacular and inexplicable success of DNNs represents a strong temptation to try to use DNN models to answer just about any difficult question, provided that we have at least some labelled examples to train the model. The fact that we do not understand why DNNs work, or when they do, makes almost any new application an interesting experiment in itself.

In this sense, using another DNN to detect outliers (Malinin & Gales, 2018) or uncertain predictions (Raghu et al., 2019) represents an interesting further application of DNN, but it does not provide an error estimate that relies less on DNN model assumptions. On the contrary, we must rely on two DNN models, in this case.

On the other hand, the same criticism may be directed equally well against traditional science: we rely on the laws of physics to build experimental devices that we use to test the laws of physics themselves. And we rely on other laws of physics to determine the accuracy of our measurements. Why is this problematic for DNN, if it is not for traditional science? Or should it be? Again, the analysis of DNN forces a reappraisal of classic epistemological questions concerning traditional science. We do it on Sect. 4, but, before that, we consider one last potential approach.

3.4 Predictions are NOT all you need

Can’t we just use the success rate of the test dataset as an estimate of the expected prediction error? Intuitively, one expects that novel predictions on unseen data do provide a meaningful test. But, from a statistical point of view, the success rate corresponds to the simplest frequentist statistical estimator, with the limitations already discussed in Sect. 3.2. From a philosophical point of view, the idea that successful tests confirm a model is fraught with riddles (Crupi, 2021; Huber, 2014). In fact, for example, we cannot be sure that the future data will conform to the same distribution of the past data. One might argue that this corresponds to a change of domain of application and the model cannot be blamed for that. But we cannot define the scope of applicability of any model in a way that is future-proof: the model may fail in circumstances that we currently cannot even imagine. All the attempts to quantify the degree to which the evidence confirms a model have shown the need to rely on extra assumptions which are themselves hard to justify (Huber, 2014). For these reasons, predictions alone—no matter how impressive—are never sufficient to warrant trust in a model, and Likelihood—by itself—is never sufficient to determine model selection. Other non-empirical aspects are also important. But which ones?

These problems also apply to TS models, but are more evident for DNN models whose persuasive power relies relatively more on their empirical success than on their theoretical virtues. Although DNN successes are impressive, it is impossible to derive any quantitative level of confirmation from them without additional assumptions. For example, published results certainly suffer a positive selection. Kaggle competitions (Goldbloom, 2015) are very interesting also from a philosophical point of view because they provide a controlled framework for assessment. However, they also cannot be used directly to confirm either general DNN architectures or specific DNN models, because submissions are strongly biased in terms of the approach used and the competencies of the participants. The fact that some successful DNN architectures keep being successful over time is certainly convincing evidence that they can learn something valuable. However, also in those cases, we cannot exclude that the DNNs have actually learned unwanted features that happened to have a strong spurious correlation with the labels (see, e.g., Xiao et al. (2021) for an example of this phenomenon). This suggests that DNN predictions are only valuable when they stem from a valuable model. What characterises a valuable model, both in TS and in ML? This question is addressed in more depth in Sect. 4.

4 Assumptions, simplicity and interpretability

The previous discussions started from different perspectives which all led to the same conclusion: the reliability of a model necessarily depends on the reliability of all its model assumptions, which are never self-justifying. Empirical evidence alone is never enough, neither in ML nor in TS. In Sect. 4.1 we review which classes of assumptions characterise TS models, DNN models and other ML models. While we cannot assess the reliability of these assumptions, in Sect. 4.2 we propose a measure of their simplicity as the most relevant surrogate used in the scientific practice (often implicitly). In Sect. 4.3 we identify a close relation between the simplicity of the assumptions and the concept of interpretability (now intensively discussed within the field of responsible AI).

Any model, whether TS or ML is essentially characterised by all its assumptions and its accuracy with respect to the existing empirical data.Footnote 12 In the following, the expression “all assumptions” refers to all those that are necessary to reproduce the outcome of the model, including the comparison with the empirical evidence and the expression of the input data.Footnote 13 Normally, the training data themselves are not part of the model assumptionsFootnote 14: they contribute to select the model parameters (which are part of the assumptions) and they contribute to the empirical accuracy of the model just like the experiments that contribute to the development of TS models. However, including the training data among the assumptions might lead to simpler formulations of some DDN models characterised by a huge number of parameters. Hence, this possibility is also considered.

4.1 Assumptions

TS models rely on multiple categories of assumptions: besides the specific assumptions of the model itself, they rely also on multiple scientific disciplines that are not the main focus of the specific TS model but are essential to constrain the TS model. These disciplines are referred to as background science and they include a variety of domains from logic, mathematics, basic physics, chemistry and the modelling of any relevant experimental device. TS models typically include also assumptions to restrict their own domain of applicability.Footnote 15 In particular, TS models assume suitable values of the model parameters.Footnote 16 The parameters inherited from background science must be consistent with a much broader spectrum of empirical evidence from very different domains. In this sense, traditional science adopts a kind of divide et impera strategy to parameter fitting.

On the other hand, DNN models require very little assumptions from background science (besides modelling the data collection). They are very generic and they are defined, ideally, only by the architecture (hyper-)parameters and the domain in which they are trained. However, this is not precise enough to determine their predictions, not even statistically. Predictions are certainly determined by the entire set of DNN parameters, which are typically in the order of billions in modern DNNs. Ideally, the specific values of all those parameters are not essential to determine a prediction, which should depend only on the training domain. However, the training domain is difficult to define and the outcome may depend also on subtle details of the training process. In fact, weight initialisation and pre-training techniques are key design choices when training and deploying a DNN model (Glorot & Bengio, 2010; Narkhede et al., 2022). Adversarials (Szegedy et al., 2014) are also evidence of high sensitivity to details. If we had evidence that the training of a DNN model depends only on some simple rules and is independent of the specific initial values of the weights (within some clearly defined bound and approximations), we would need to include only these rules as part of the assumptions of the DNN model. Instead, the lack of understanding of the training dynamics enforces the inclusion of the full specification of the DNN (initial) parameters and training data as part of the assumptions.

It is interesting to compare the DNN training process to other scientific applications of Markov Chain Monte Carlo (MCMC).Footnote 17 For many TS models, there is no proof of convergence of the MCMC algorithm to a definite outcome. However, best-practice rules and diagnostic tools have been developed that enable a rather accurate formulation of the conditions that must be fulfilled for the algorithm to converge to a stable outcome [see e.g. Roy (2020)]. Formal proofs of independence from details, if available, are very desirable, because no further assumption is needed in that case. But, also assuming a few semi-empirical rules is satisfactory, if they are de-facto accurate. Despite considerable effort, researchers have not been able to identify any simple set of rules that makes the outcome of DNN models independent of any further detail. Unfortunately, research is very difficult on this topic, because it requires testing different initial conditions, which is computationally very expensive. Moreover, this difficulty might actually be an intrinsic price to pay for the great flexibility of DNN models [see also Hartmann and Richter (2023)]. High sensitivity to details is the key aspect of the so-called black-box problem (Desai et al., 2022) of some ML methods: lack of understanding is a serious limitation to the extent that we cannot tell which details actually matter for a conclusion.

Furthermore, TS models strive to be consistent, when applied to different data and different domains. In particular, it is typically ensured that the values of the parameters with identical meaning remain consistent, within expected uncertainties, across different applications. Too large deviations are interpreted as failures of the model. DNN parameters, on the contrary, are usually not regarded as something that should be consistent across applications. Although DNN training often starts from DNN models that were pre-trained on other datasets, further training is always performed for new applications and no constraint is usually imposed to keep the parameters close to the original values. That means that different assumptions are made by the DNN models for different domains of application.

A further key difference between TS and DNN models is the specification of the domain of application. TS models typically define their domain of applicability in terms of the features that play a role in the model and they are typically measurable. If the domain is defined in terms of measurable features, it becomes possible to suspend a prediction for out-of-domain data. For example, domain restrictions usually enter the detailed specifications of the experimental set-ups that are required to collect valid data. Normally, the domain of applicability can be formulated in terms of a limited set of additional assumptions. This does not ensure that some overlooked features may not play an unexpected role and compromise the accuracy of the model, but this scenario happens rarely for state-of-the-art TS models.

In the case of DNN models, a measurable specification of the domain of application is much more difficult, and it is practically never provided. One key idea of DNNs is that the relevant features do not need to be specified explicitly. But this complicates the formulation of the domain of applicability. At least the experimental set-up for valid data collection must be specified with sufficient precision to ensure the relevance of the training dataset. For DNN, however, it is more difficult to define the domain by describing the experimental set-up, because, again, we have a very limited understanding of which features are learned during the DNN training process. Hence, the risk of omitting relevant prescriptions is much higher than for typical TS models. If, on the other hand, we omit the specification of the domain, we should test the performance of the DNN on any possible phenomena, even those totally different from the scope in which the DNN was actually trained.

It is worth including LR models in this comparison because they represent an interesting alternative to both TS and DNN models that live somewhere in the middle ground. LR models are defined unambiguously by their features.Footnote 18 Because the optimal regression coefficients are uniquely defined by the Maximum Likelihood principle, no coefficient must be included to define the assumptions of the model. Domain restrictions are also specified via the relevant features.

Another interesting comparison is with RF models: they also display significant sensitivity to some details (initial conditions and algorithmic hyperparameters), but they are based on limited and well-defined features. This enables, in the first place, a clear definition of the domain of applicability. It also makes it easier to identify conditions of robustness, for some applications and contexts. A deep analysis of the stability of RF models is beyond the scope of this article. However, it should be noted that their sensitivity to details seems closer to traditional applications of MCMC rather than DNN.

4.2 Simplicity of the assumptions

The previous section shows that a key difference between TS models and DNN models is that the former use relatively few assumptions and few parameters for a wide range of phenomena. A classic scientific-philosophical tradition singles out precisely this aspect as the main non-empirical epistemic value that scientific models should try to achieve, in addition to empirical accuracy [see e.g., Barnett (1950), Chaitin (1975), Derkse (1992), Galilei (1962, p. 397), Kemeny (1953), Lavoisier (1862, p. 623), Lewis (1973), Mach (1882), Newton (1964, p. 398), Nolan (1997), Poincaré (1902), Scorzato (2013), Swinburne (1997), van Orman Quine (1963), Walsh (1979), Weyl (1932), Zenil (2020b), also reviewed e.g. in Baker (2004), Fitzpatrick (2014), Zellner et al. (2001)]. The first goal of this section is to clarify the meaning of “few assumptions”.

An apparent problem with this view is that the number (and the length) of the assumptions is catastrophically language-dependent: it can always be made trivial—hence meaningless—by a suitable choice of language.Footnote 19 However, it was observed in Scorzato (2013) that requiring the inclusion of a basisFootnote 20 of measurable concepts among the assumptions enforces a minimal achievable complexity (see details in Appendices A and B). The original argument of Scorzato (2013) is applicable to a wide class of TS models, but not immediately to DNN models. It is shown in the App. A that the mere possibility of the existence of adversarials enables the extension of the argument to DNN as well. This enables the reference to the epistemic complexity of a model in a way that is precise and language independent.Footnote 21

How does the epistemic complexity of a model impact its reliability? This question does not have a simple answer. Certainly, we cannot dismiss DNN models on the grounds of their high complexity, because we mostly do not have simpler TS models that cover the same topics and complexity-based model selection applies only to empirically equivalent models (see Appendix B for more details). However, a high complexity affects reliability at least in three indirect ways. The first one is interpretability. If it is challenging for any human to even understand which are all the assumptions of the model or review them, it becomes difficult to even formulate precisely the question: what is it whose reliability we are investigating? In this sense, a manageable epistemic complexity should be a precondition for a plausible assessment of reliability. The next Sect. 4.3 looks in greater detail into the relation between the epistemic complexity introduced here and the concept of interpretability that plays a prominent role in the current discussion about responsible AI.

The second relation between simplicity and reliability comes from the fact that simplicity enables the definition of the state-of-the-art models (see Appendix B), namely those models that represent all the possible compromises between simplicity and the many dimensions of accuracy. The state-of-the-art, in turns, represents a non-arbitrary class of models that offers a reference for a Bayesian framework for error estimation.Footnote 22 The high epistemic complexity of DNN models makes it very difficult to identify the state-of-the-art. The computational cost of the Bayesian approach applied to DNNs has been widely acknowledged (Abdar et al., 2021; Gawlikowski et al., 2021; Loftus et al., 2022), but it is important to emphasise that this makes it also difficult to define the Bayesian errors in the first place.

The third indirect relation between simplicity and reliability is via (scientific) progress. It was observed already in Popper (1959) that simpler models offer a more direct path toward scientific progress. The present framework offers further support for this idea: simpler models offer fewer opportunities for small changes. Hence, they point more clearly to what must be changed, or they point to a fundamental deficiency of the model. Complex models might be more flexible and survive more empirical challenges, but that is not an advantage if we are looking for a better model. As a matter of fact, TS models are the result of a long history of small and big unambiguous improvements, where different models are unified under a single one (unification) or where the parameters of some empirical model are derived from a more fundamental one (reduction) or radically new models supersede the existing ones via revolutionary changes. This history of progresses is certainly part of their perceived reliability. On the other hand, DNN models are too complex to allow the identification of this kind of unambiguous progress: every fine tuning of parameters or architecture adjustment might bring some improvements, but it is difficult to tell what might be lost. This topic is discussed further in Sect. 5.

4.3 Interpretability

Recently, the notions of interpretability, explainability, explicability, transparency and related concepts have attracted considerable attention. Together, they build the core taxonomy of the topic of responsible AI (Arrieta et al., 2020). There is still considerable debate and obscurity around the precise definition of each of these terms, which have also started to acquire a legal connotation. The need for clarification has been analysed, from the philosophical point of view, by Beisbart and Räz (2022). In fact, the concept of interpretation has a long history in the philosophical literature (see, e.g., (Lutz, 2023) as a starting point). It is not the goal of this paper to try to clarify the relations between these concepts.

Instead, we focus here only on what is relevant for reliability. To this end, it is worth noting that one of the most quoted definitions of interpretability reads “the degree to which a human can understand the cause of a decision” (Miller, 2019; Lipton, 2018). Here a “decision” translates to what we previously called a “model prediction”. What are the causes of model predictions? They are exactly the full set of model assumptions discussed in the previous section. In fact, they must include any information necessary to reproduce the model’s predictions.Footnote 23 Anything less, would be an incomplete cause; anything more would be superfluous.Footnote 24

It is essential to appreciate that what matters is to understand the causes (i.e. the model assumptions) and not the entire internal mechanism that leads, step by step, from a given input to the corresponding model output. Most of our best scientific theories do not allow such “understanding”, which is often encoded in millions of operations executed by computers. Not even the best world experts would be able to reproduce most of the model output without the support of those computers. But understanding these details is unnecessary,Footnote 25 because all these operations are entirely determined by a few equations, which are the true causes, i.e. the model assumptions.Footnote 26 Of course, this is true also because we know that little details, such as the specific value of the random generator used in the simulation code, do not matter. Requiring a full understanding of the detailed mechanism is an act of desperation in cases where we do not know which details actually matter, but it is not a sensible requirement when we do know that.

The view described above is consistent with the one defended by de Regt and Dieks (2005), who criticize the causal-mechanical conception of explanatory understanding for similar reasons to those given here. According to de Regt and Dieks (2005), experts are often able to connect some changes in model assumptions to some changes in model outcomes, without performing calculations because they extrapolate from other results that they know. But, for most modern scientific models, even the best experts cannot predict most of the outcomes without the help of external computational tool.

Another often-cited definition reads: “providing either an understanding of the model mechanisms and predictions, a visualisation of the model’s discrimination rules, or hints on what could perturb the model” (Arrieta et al., 2020). As mentioned above, “understanding the model mechanisms” (or visualising them) is too strong a requirement, that would unfairly penalize most modern science. However, this definition also emphasises an important aspect: control of “what can perturb the model”. As discussed in the previous section, if a set of simple rules ensures the stability of the output (not necessarily a unique exact output, but at least a unique statistical distribution which does not depend on other details), then those rules are sufficient causes/assumptions for the model’s predictions. If not, more details must be included among the causes/assumptions.

Let us examine one more definition, which reads: “a method is interpretable if a user can correctly and efficiently predict the method’s results” (Kim et al., 2016). This definition tries to remove the ambiguity of the word “understanding”, which is used elsewhere, but without success: is the user allowed to utilise a tool to reproduce the model’s results? If she is, then she can just use another, identical, DNN model and every DNN model becomes perfectly interpretable. If she is not, then most modern science is not interpretable, according to this definition, because no human being can go very far without any instrument. This definition is not ambiguous in the context of image recognition, where it has been introduced (Kim et al., 2016), because it assumes that human vision is the tool available to the user. But, it cannot be extended to a general context, as suggested in Molnar (2020). However, this attempt contains a very interesting aspect: the concept of understanding must be clarified; we must understand enough causes (i.e. model assumptions) to reproduce the outcome, and the outcome must be “consistent”, i.e. robust to changing details. This is achieved if we know all the model assumptions, if these do not require the inclusion of too many details and if the output is computable for the relevant inputs (i.e. the input is legitimate for the model). So, also this last definition supports an identification of the concept of interpretability with a measure of the simplicity of all the assumptions.

In conclusion, we can attribute a precise meaning to the concept of “interpretability”. In this way, “interpretability” does not depend on the vague notion of “understanding”, does not depend arbitrarily on the choice of language, and does not depend on the individual skills of the persons interacting with the model.

Moreover, interpretability is not only a desirable property for non-expert, it is a precondition for a plausible assessment of the reliability. More precisely, lower interpretability makes the assessment of reliability less plausible. This claim can be justified as follows. Assessing the reliability of any model is done in two steps. The first step consists in identifying a class (C) of models that should be used to assess the reliability of the given target model (see Sect. 3.1). The second step consists in computing the confidence ranges (e.g. via the Bayesian priors and the Bayesian formula). The first step is the focus of this paper. In TS, scientists tend to agree, de facto, on which class C should be used (which we can call the state-of-the-art models, for which Scorzato (2016) proposes a characterization. See Definition 3 in Appendix B). Whether we adopt Definition 3 or not, identifying the class C requires examining non-empirical epistemic properties of the corresponding models. If we do adopt Definition 3, the state-of-the-art becomes broader when more complex models are involved, because it is increasingly more difficult to check if any of the many variations of a complex model should or should not be included in the class C. In fact, complex models leave larger opportunities for models that are not more complex. This leads to larger confidence ranges and larger uncertainties on the confidence ranges themselves. Even if we did not adopt Definition 3, we should still evaluate some (unclear) non-empirical properties of the models, which must include also reviewing their assumptions. The lack of precise requirements cannot simplify the assessment of more complex assumptions. This forces, again, larger uncertainties on the confidence ranges. In both cases, more complex (i.e. less interpretable) assumptions lead to less plausible assessments of reliability.Footnote 27

5 Conclusions and way forward

DNNs are being used effectively in many contexts where it is not necessary to assess their reliability precisely. This is the case when DNN models generate candidate solutions that are eventually checked via independent tests (Jumper et al., 2021), or when they generate data that do not need to be sampled with a strictly defined distribution, because the conclusion has no strict statistical value or because the distribution is corrected afterwards (Albergo et al., 2019).Footnote 28

However, when assessing the expected errors is necessary, it is important to understand how traditional science achieves that. The reliability of traditional science does not depend only on a statistical analysis of the uncertainties. It depends also on the fact that scientific models rely on few assumptions that remain the same for a very large amount of phenomena. This is accomplished by building minimal models for different domains of phenomena. Of course, these different domains are interdependent and they must be combined to describe more complex phenomena. However, this divide et impera strategy turns out to be quite efficient. Moreover, TS models strive to employ as much analytical understanding as possible (enabled by simpler models), next to empirical testing to ensure that empirical successes aren’t ephemeral. In other words, this strategy enables progress towards models that are gradually more and more accurate and/or more and more interpretable (i.e. simpler).

DNNs often display impressive predictive power, which can only be possible because, somehow, the training process finds a DNN configuration that is sensitive to the relevant features of the data and is insensitive to the irrelevant ones.Footnote 29 However, while the existence of such configurations (and the ability of the algorithm to find them) is sufficient for their powerful predictivity, it is not sufficient for their reliability. To define what we mean by reliability, we must clarify the class of possible alternative models (see Sect. 3.1). TS models can typically refer to the class of models that belong to the state-of-the-art or some approximations of it. Identifying the state-of-the-art is much more difficult for DNN models because of their high epistemic complexity, which makes it difficult not only to compute the Bayesian errors, but also to define them.

This paper has proposed to identify the interpretability of a model with its inverse epistemic complexity. In this sense, interpretability is not only a desirable property for non-expert users: it is a precondition for a plausible assessment of the reliability itself. Identifying the assumptions is also necessary for scientific progress (Scorzato, 2016), because we need to know which assumptions we may want to replace.

Impressive predictive power in new domains, combined with weak foundations and difficult synthesis with background-science can be a signal of an ongoing (and incomplete) scientific revolution, as was the case for the birth of Quantum Mechanics. What could be the way forward? Predicting the retirement of the scientific method itself (towards something like a theory-free science) appears misguided, according to the present analysis.

One natural hope is to better understand the conditions that ensure the robustness of DNN predictions, hence enabling an interpretable formulation of the assumptions. However, this might be inherently impossible, because the high flexibility of the DNN might be intrinsically coupled with their high sensitivity to details (Hartmann & Richter, 2023).

One interesting alternative approach is to use DNN as a tool to suggest features for other models, along the lines of Huang and LeCun (2006); Notley and Magdon-Ismail (2018). However, it is important to ensure that the features obtained through this process are directly measurable and not simply the output of the last DNN layer, otherwise the DNN component cannot be removed from the formal counting of the assumptions needed to produce the results. The features extracted in this way would ease the connection with background science (imposing valuable constraints) and enable the building of simpler models, possibly unrelated to DNN models.

Another interesting approach is to focus on those models where the entire possible input space is potentially completely available for training. This is the case, for instance, when reading printed characters (from a limited choice of fonts), or scanning objects that belong to a finite list of possibilities (possibly rotated and translated in space). To cover the full input space, it is crucial to identify and implement all possible symmetry transformations, which include not only spatial transformations but also background and noise transformations. Even when all these steps are put in place, it is still difficult to check that the entire possible input space has been covered. Adversarials may still lurk, as long as the DNN is not exactly invariant under the transformation of redundant parameters. Ideally, to ensure robustness, one should also reduce the number of DNN parameters to the actual degrees of freedom of the input space. This is not easy, but it might be possible along the lines of Liu et al. (2015).

Another promising avenue of research is to study those limits that can be computed exactly, as the limit of infinitely large width of fully connected DNNs (Jacot et al., 2018). Such analytical results are essential to test the behaviour of a DNN where we know exactly how it should behave. So, their first epistemological advantage is to increase the empirical accuracy of the DNNs (or rule them out). Furthermore, these limit cases could potentially also suggest how to realise simpler ML methods that are as powerful as DNN under specific circumstances. Another possibility is to use exactly solved limit cases as a starting point to set up DNNs that are small deviations from those limit cases, which might potentially enable a simpler formulation (with fewer assumptions) of the DNN itself.