Scientific Inference with Interpretable Machine Learning: Analyzing Models to Learn About Real-World Phenomena

Freiesleben, Timo; König, Gunnar; Molnar, Christoph; Tejero-Cantero, Álvaro

doi:10.1007/s11023-024-09691-z

Scientific Inference with Interpretable Machine Learning: Analyzing Models to Learn About Real-World Phenomena

Open access
Published: 15 July 2024

Volume 34, article number 32, (2024)
Cite this article

Download PDF

You have full access to this open access article

Minds and Machines Aims and scope Submit manuscript

Scientific Inference with Interpretable Machine Learning: Analyzing Models to Learn About Real-World Phenomena

Download PDF

Timo Freiesleben ORCID: orcid.org/0000-0003-1338-3293¹,
Gunnar König¹,
Christoph Molnar² &
…
Álvaro Tejero-Cantero¹

227 Accesses
1 Altmetric
Explore all metrics

Abstract

To learn about real world phenomena, scientists have traditionally used models with clearly interpretable elements. However, modern machine learning (ML) models, while powerful predictors, lack this direct elementwise interpretability (e.g. neural network weights). Interpretable machine learning (IML) offers a solution by analyzing models holistically to derive interpretations. Yet, current IML research is focused on auditing ML models rather than leveraging them for scientific inference. Our work bridges this gap, presenting a framework for designing IML methods—termed ’property descriptors’—that illuminate not just the model, but also the phenomenon it represents. We demonstrate that property descriptors, grounded in statistical learning theory, can effectively reveal relevant properties of the joint probability distribution of the observational data. We identify existing IML methods suited for scientific inference and provide a guide for developing new descriptors with quantified epistemic uncertainty. Our framework empowers scientists to harness ML models for inference, and provides directions for future IML research to support scientific understanding.

The Automated Laplacean Demon: How ML Challenges Our Views on Prediction and Explanation

Article 16 October 2021

Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges

Conceptual challenges for interpretable machine learning

Article Open access 01 March 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Scientists increasingly use machine learning (ML) in their daily work. This development is not limited to natural sciences like material science (Schmidt et al., 2019) or the geosciences (Reichstein et al., 2019), but extends to social sciences such as educational science (Luan & Tsai, 2021) and archaeology (Bickler, 2021).

When building predictive models for problems with complex data structures, ML outcompetes classical statistical models in both performance and convenience. Impressive recent examples of successful prediction models in science include the automated particle tracking at CERN (Farrell et al., 2018), or DeepMind’s AlphaFold, which has made substantial progress in predicting protein structures from amino acid sequences (Senior et al., 2020). In such examples, some see a paradigm shift towards theory-free science that “lets the data speak” (Anderson, 2008; Kitchin, 2014; Mayer-Schönberger & Cukier, 2013; Spinney, 2022). However, purely prediction driven research has its limits: In a survey with more than 1,600 scientists, almost 70% expressed the fear that the use of ML in science could lead to a reliance on pattern recognition without understanding (Van Noorden & Perkel, 2023). This is in line with the philosophy of science literature, which does recognize the importance of predictions (Douglas, 2009; Luk, 2017), but also emphasizes other goals such as explaining and understanding phenomena (Longino, 2018; Salmon, 1979; Shmueli, 2010; Toulmin, 1961).

The reason why understanding phenomena with ML is difficult is that, unlike traditional scientific models, ML models do not provide a cognitively accessible representation of the underlying causal mechanism (Boge, 2022; Hooker & Hooker, 2017; Molnar & Freiesleben, 2024). The link between the ML model and phenomenon is unclear, leading to the so-called opacity problem (Sullivan, 2022). Interpretable machine learning (IML, also called XAI, for eXplainable artificial intelligence) aims to tackle the opacity problem by analyzing individual model elements or inspecting specific model properties (Molnar, 2020). However, it often remains unclear how IML can help to address problems in science (Roscher et al., 2020) as IML methods are often designed with other purposes and stakeholders in mind, such as guiding engineers in model construction or offering decision support for end users (Zednik, 2021).

In spite of these challenges, scientists increasingly use IML for inferring which features are most predictive of crop yield (Shahhosseini et al., 2020; Zhang et al., 2019), personality traits (Stachl et al., 2020), or seasonal precipitation (Gibson et al., 2021), among others. Although researchers are aware that their IML analyses remain just model descriptions, it is often implied that the explanations, associations, or effects they uncover extend to the corresponding real-world properties. Unfortunately, drawing inferences with current IML raises epistemological issues (Molnar et al., 2022): it is often unclear what target quantity IML methods estimate (Lundberg et al., 2021), is it properties of the model or of the phenomenon (Chen et al., 2020; Hooker et al., 2021)? Moreover, theories to quantify the uncertainty of interpretations are underdeveloped (Molnar et al., 2020; Watson, 2022).

1.1 Contributions

In this paper, we present an account of scientific inference with IML inspired by ideas from philosophy of science and statistical inference. We focus on supervised learning on identically and independently distributed (i.i.d.) data relating predictors $\varvec{X}$ and a target variable $Y$. Our key contributions are:

We argue that ML cannot profit from the traditional approach to scientific inference via model elements because individual ML model parameters do not represent meaningful phenomenon properties. We observe that current IML methods generally do not offer an alternative route: while some do interpret the model as a whole beyond its elements, they are designed to support model audit and not scientific inference.
We identify the criteria that IML methods need to fulfill so that they provide access to the properties of the conditional probability distribution $\mathbb {P}(Y\,|\,\varvec{X})$. We provide a constructive approach to derive new IML methods for scientific inference (we call them property descriptors), and identify for which estimands existing IML methods can be appropriated as property descriptors. We discuss how property descriptions can be estimated with finite data and quantify the resulting epistemic uncertainty.

1.2 Roadmap

This paper is addressed to an interdisciplinary audience, where various communities may find some sections of special interest:

Philosophers of science may start with our discussion in Sect. 3 on traditional scientific modeling and why its notion of representation is not longer applicable to ML models (see Sects. 4 and 5). Additionally, Sect. 11 offers insights on what causal understanding can be derived from ML model interpretations.
Researchers in IML may want to skip ahead to Sect. 9, where we offer a concise guide to selecting or developing IML methods apt for scientific inference. In preparation, sections Sects. 6 and 7 motivate our approach and Sect. 8 reviews some necessary mathematical background. We suggest future avenues for IML research in Sect. 12.
Finally, scientists will find in Table 2 a list of published IML methods that address a variety of practical inference questions, with Sect. 12 discussing limitations to consider in applications.

1.3 Terminology

For the purposes of our discussion below, a phenomenon is a real-world process whose aspects of interest can be described by random variables; these aspects possess distinct properties. Observations of the phenomenon are assimilated to draws from the unknown joint distribution of the random variables to form a dataset, or just data. Scientific inference describes the process of rationally inducing knowledge about a phenomenon from such observational data (via ML, or other types of models), providing a basis for scientific explanations. In this work, when we talk about ML, we focus exclusively on the supervised learning setting. A supervised ML model is a mathematical function with some free parameters that a learning algorithm optimizes (“trains”) based on existing labeled data in order to accurately predict future or withheld observations, i.e. to generalize beyond the initial data. While these predictions are often called ‘inferences’ in the deep learning literature, we here employ inference in its original sense in statistics: investigating unobserved variables and parameters to draw conclusions from data. In this sense, inference goes beyond prediction; it concerns the data structure and its uncertainties. When we talk about IML methods in this paper, we include all approaches that analyze trained ML models and their behavior. In contrast to Rudin (2019), we do not limit the scope of IML to inherently interpretable models. These brief conceptual remarks are meant to reduce ambiguity in our usage, we lay no claim as to their universality.

2 Related Work

Whether and how ML models, and specifically IML, can help obtain knowledge about the world is a debated topic in philosophy of science, statistics, causal inference, and within the IML community.

Philosophy of Science

According to Bailer-Jones and Bailer-Jones (2002) and Bokulich (2011) ML models are only suitable for prediction, since their parameters are merely instrumental and lack independent meaning. Conversely, Sullivan (2022) contends that nothing prevents us from gaining real-world knowledge with ML models as long as the link uncertainty—the connection between the phenomenon and the model—can be assessed. While Cichy and Kaiser (2019) and Zednik and Boelsen (2022) claim that IML can help in learning about the real world, they remain vague about how model and phenomenon are connected. We agree with Watson (2022) that only IML methods that evaluate the model on realistic data can allow insight into the phenomenon, but clarify that without further assumptions such methods only reveal associations learned by the ML model, not causal relationships in the world (Räz, 2022; Kawamleh, 2021). Our work makes precise that ML models can be described as epistemic representations of a certain phenomenon that allow us to perform valid inferences (Contessa, 2007) via interpretations.

Statistical Modeling and Machine Learning

Breiman (2001b) distinguishes between algorithmic (ML) and data (statistical) modeling. He illustrates, using a medical example, that interpreting high-performance ML models post-hoc can yield more accurate insights about the underlying phenomena compared to standard, inherently interpretable data models. Our paper provides an epistemic foundation for such post-hoc inferences. Shmueli (2010) distinguishes statistics and ML by their goals—prediction (ML) and explanation (statistics). Like Hooker and Mentch (2021), we argue against such a clear distinction and offer steps to integrate the two fields. To do so, we initially clarify what properties of a phenomenon can be addressed by IML in principle. Starting from theoretical estimands, we develop a guide for constructing IML methods for finite data, following the scheme in Lundberg et al. (2021). Finally, based on the approach by Molnar et al. (2023), we show how to obtain uncertainty estimates for IML-based inferences.

Causal Inference Using Machine Learning

ML is now widely used as a tool for estimating causal effects (Dandl, 2023; Knaus, 2022; Künzel et al., 2019). Double machine learning, for example, provides an ML-based unbiased and data-efficient estimator of the (conditional) average causal effect, assuming that all confounding variables are observed (Chernozhukov et al., 2018).^{Footnote 1} While these works focus on the construction of practical estimators for causal effects, we focus on the question what inferences can be drawn from interpreting individual ML models and how to design property descriptors for such inferences.

The Targeted Learning framework by (Van der Laan et al., 2011; Van der Laan & Rose, 2018) is closely related to our proposal and also to double ML (Díaz, 2020; Hines et al., 2022). Like us, they discourage the use of interpretable but misspecified parametric models for scientific inference. Instead, they suggest to directly estimate relevant properties (they call them parameters) of the joint data distribution that are motivated by causal questions. They estimate relevant aspects of the joint distribution from which the parameters can be derived using the so-called super learner (a weighted ensemble of ML models), and debias their estimator with the targeted maximum likelihood update (Hines et al., 2022; Van Der Laan & Rubin, 2006). Compared to targeted learning, we ask more specifically what inferences we can draw from interpreting individual ML models and how to match such interpretations with parameters of traditional scientific models. Also, we primarily provide a theoretical framework for IML tools and how they must be designed to gain insights into the data, whereas the work of Van der Laan et al. (2011) is more practical, providing unbiased estimators and implementations for a variety of inference tasks (Van Der Laan & Rubin, 2006). We further discuss the relationship between property descriptors and targeted learning in Sect. 12.

Interpretable Machine Learning

IML as a field has been widely criticized for being ill-defined, mixing different goals (e.g. transparency and causality), conflating several notions (e.g. simulatability and decomposability), and lacking proper standards for evaluation (Doshi-Velez & Kim, 2017; Lipton, 2018). Some even argued against the central IML leitmotif of analyzing trained ML models post hoc in order to explain them (Rudin, 2019). In this paper, we show that if we focus solely on interpretations for scientific inference, we can clearly define what estimand post hoc IML methods estimate and how well they do so.

Using IML for scientific inference is not a completely new idea. It has been proposed for IML methods like Shapley values (Chen et al., 2020), global feature importance (Fisher et al., 2019) and partial dependence plots (Molnar et al., 2023). Our framework generalizes these method-specific proposals to arbitrary IML methods and provides guidance on how to construct IML methods that enable inference.

3 The Traditional Approach to Scientific Inference Requires Model Elements That Meaningfully Represent

Models can be found everywhere in science, but what are they really? Like Bailer-Jones, we see a scientific model as an “interpretative description of a phenomenon that facilitates perceptual as well as intellectual access to that phenomenon” (Bailer-Jones 2003, p. 61). The way models provide access to phenomena is usually specified by some sort of representation (Frigg & Nguyen, 2021; Frigg & Hartmann, 2020). Indeed, models represent only some aspects of a phenomenon, not all of it—a good model is true to the aspects that are relevant to the model user (Bailer-Jones, 2003; Ritchey, 2012).

In scientific modeling, we noted a paradigm that many models implicitly follow—we call it the paradigm of elementwise representationality.

Definition

A model is elementwise representational (ER) if all model elements (variables, relations, and parameters) represent an element in the phenomenon (components, dependencies, properties).

Figure 1 illustrates how models relate to the phenomenon when they are ER (in the example, two-body Newtownian dynamics is used to model the Earth-Moon system) : variables describe phenomenon components; mathematical relations between variables describe structural, causal or associational dependencies between components; parameters specify the mathematical relations and describe properties, like the strength, of the component dependencies. By distinguishing components, dependencies and properties, we closely follow inferentialist accounts of representation by Contessa (2007) and other philosophers like Achinstein (1968), Bailer-Jones (2003), Ducheyne (2012), Hughes (1997), Levy (2012), Stachowiak (1973). The upward arrows in Fig. 1 describe encoding into representations i.e. the translation of a phenomenon observation to a model configuration; The downward arrows describe decoding i.e. the translation of knowledge about the model into knowledge about the phenomenon.

ER does not mean that each model element represents independently of the rest of the model. Instead, ER is model-relative. When we specify the rest of the model, ER implies that each model element has a fixed meaning in terms of the phenomenon. We also want to emphasize that ER is an extreme case. There are indeed models in science where not every model element represents but some parts of the overall model do. A typical non-representational element used in scientific models is a noise term. Instead of representing a specific component or a property of the phenomenon, the noise can be a placeholder for unexplained aspects of the phenomenon (Edmonds, 2006).

ER is obtained through model construction; ER models are usually “hand-crafted” based on background knowledge and an underlying scientific theory. Variables are selected carefully and sparsely during model construction, and relations are constrained to a relation class with few free parameters. When ER models need to account for additional phenomenon aspects, they are gradually extended so that large parts of the “old” model are preserved in the more expressive “new” model (Schwarz et al., 2009). ER even eases this model extension process because model interventions are intelligible on the level of model elements. Usually, ER is explicitly enforced in modeling: if there is a phenomenon element devoid of meaning, researchers either try to interpret it or exclude it from the model.

ER is so remarkable because it gives models capabilities that go beyond prediction. ER has a crucial role in translating model knowledge into phenomenon knowledge (surrogative reasoning Contessa 2007; Hughes 1997; Swoyer 1991). Scientists can analyze model elements and draw immediate conclusions about the corresponding phenomenon element (Frigg & Nguyen, 2021). However, only those properties of the phenomenon that have a model counterpart can be analyzed with this approach. Fortunately, as described above, ER models can be extended to account for further relevant phenomenon elements identified by the scientist.

Example Associational ER Model: Simple Linear Regression

The mechanistic causal model from Fig. 1 is ideal for illustrating what constitutes an ER model. However, ML models are associational in nature. To better isolate the differences in the ways ER and ML models represent, we now focus on an associational ER model, in this case a simple linear model.

Suppose we want to study which factors influence students’ skills in math, specifically focusing on language mastery (Peng et al., 2020). We adopt grades as quantitative proxies for the respective skills, and find in Cortez and Silva (2008) a suitable dataset reporting Portuguese and math grades on a 0–20 scale, along with 30 other variables such as student age, parents’ education, etc (see Appendix A for details). Here, the students’ characteristics like their math and Portuguese skills are the phenomenon components, genetic and environmental associations are the dependencies, and the strength or direction of these associations are instances of relevant properties.

We start with a linear regression with the Portuguese grade as the only predictor variable, denoted $X_p$, and the math grade as the target, $Y=\beta _0+\beta _1 X_p+\epsilon$, with $\beta _0,\beta _1\in \mathbb {R}$. This linear relation is a reasonable, if crude, assumption for a first analysis in absence of prior insight.^{Footnote 2} Our model is ER except for the noise term, $\epsilon \sim {\mathcal {N}}(0,\sigma ^2)$, which accounts for all variability in $Y$ not due to $X_p$ and thus is, by design, not ER. We train the model by finding the ${\hat{\beta }}_1,{\hat{\beta }}_0$ that minimize the mean-squared-error (MSE) of predictions on the training set:

$$\begin{aligned} {\hat{m}}_\textrm{LIN}(x_p)= {\hat{\beta }}_0 + {\hat{\beta }}_1 x_p. \end{aligned}$$

(1)

The fitted regression coefficient ${\hat{\beta }}_1=0.77$ has a 95% confidence interval (CI) of $[0.63<\beta _1<0.91]$,^{Footnote 3} and represents the association as the expected increase in math grade for a unit increase in Portuguese grade. The bias ${\hat{\beta }}_0$ is also straightforward to interpret.^{Footnote 4} Thus, from ${\hat{\beta }}_1$ and its CI, we might conclude with some confidence that language and math skills are positively and strongly associated. This conclusion is contingent on the model being ER, but crucially, also on it capturing well the phenomenon. Although by optimizing the MSE we targeted an appropriate estimand, namely the conditional expectation $\mathbb {E}_{Y\,|\,X_p}[Y\,|\,X_p]$, we estimated it with a rather crude model. Clearly, if the model is not highly predictive, it is ill-suited for reliable scientific inference. To improve performance, we can make the model more expressive by introducing additional variables, relations, or interaction parameters. As long as we preserve ER, we can draw scientific inferences directly from model elements. These inferences are only as valid as the modeling assumptions (e.g. target normality, homoscedasticity, or linearity).

4 The Elements of ML Models Do Not Meaningfully Represent

ER models suit our image of science as an endeavor aimed at understanding. ER enables us to reason about the phenomenon, and in causal models it even allows us to reason about effects of model or even real-world interventions. However, when constructing ER models, we require background knowledge about which components are relevant, and we usually need to severely restrict the class of relations that can be considered in modeling a given phenomenon. These difficulties might lead scientists to either limit their investigations to simple phenomena or to settle for overly simplistic models for complex phenomena and, as Breiman (2001b) and Van der Laan et al. (2011) cautioned, possibly draw wrong conclusions.

ML models, on the other hand, excel for complex problems with an unbounded number of components that entertain ambiguous and entangled relationships, i.e. ML models are highly expressive (Gühring et al., 2022). Indeed, given enough data, effective prediction with ML requires less subject-domain background knowledge than traditional modeling approaches (Bailer-Jones & Bailer-Jones, 2002). While the definition of a prediction task and the encoding of features are still largely based on domain knowledge, the choice of model class, hyperparameters and learning algorithms is often guided by the data types and aims to promote efficient learning for the respective modality, e.g. by selecting architectures such as CNNs for images, LSTMs for sequences or GNNs for relational data.

The gain in generality and convenience with ML comes at a price—ML models are generally far from being ER. As also argued in Bailer-Jones and Bailer-Jones (2002), Bokulich (2011) and Boge (2022), ML models (e.g. paradigmatically artificial neural networks) contain model elements such as weights that have no obvious phenomenon counterpart.

Example ML Associational Model: Artificial Neural Network (ANN)

Suppose we want to improve on the predictive performance of our simple linear model (Eq. 1), and opt instead for a dense three-layer neural network that uses all available features to predict math grades. To train it, we minimize the MSE on training data, but now use gradient descent with an adaptive learning rate for lack of an analytical solution. The trained model can be described as a function parameterized by $31\times 31$ weight matrices ${\hat{W}}_1$, ${\hat{W}}_2$, ${\hat{W}}_3$ and $31\times 1$ bias vectors $\hat{\varvec{b}}_1,\hat{\varvec{b}}_2,\hat{\varvec{b}}_3$ using component-wise ReLU activations ${\text {ReLU}}(\varvec{x}_j) := \max (\varvec{x}_j,0)$:

$$\begin{aligned} {\hat{m}}_{\textrm{ANN}}(\varvec{x})={\hat{W}}_3\,{\text {ReLU}}\Bigl ({\hat{W}}_2\,{\text {ReLU}} \bigl ({\hat{W}}_1 \varvec{x}+\hat{\varvec{b}}_1\bigr )+\hat{\varvec{b}}_2\Bigr )+\hat{\varvec{b}}_3. \end{aligned}$$

(2)

While this ANN achieves a test-set MSE of just 8.9 compared to 16.0 for the simple linear model,^{Footnote 5} it is now highly unclear what aspects of our phenomenon the ANN parameters correspond to. Its weights and biases are very hard or even impossible to interpret: any individual weight might have a positive, neutral, or negative effect on the target, dependent on all other model elements. Similarly, the design of the architecture aims to maximize predictive performance, rather than to reflect any actual or even hypothesized dependencies between components of the phenomenon.

5 But Do ML Model Elements Really Not Represent?

One may believe that we went too fast here and argue that individual elements of ML models do have a natural phenomenon counterpart, one that only surfaces after extensive scrutiny. The underlying intuition is that human representations are near-optimal to perform prediction tasks and will be eventually rediscovered by ML algorithms. We find this unlikely: ER is not enforced in most state-of-the-art models (Leavitt & Morcos, 2020) and, even worse, some widely-employed ANN training techniques, such as dropout, purposefully discourage ER in order to gain robustness (Srivastava et al, 2014). High-capacity ML models like ANNs are indeed designed for distributed representation (Buckner & Garson, 2019; McClelland et al., 1987).

Still, it has been claimed that model elements represent high-level constructs constituted from low-level phenomenon components that are often called concepts (Buckner, 2018; Bau et al., 2017; Olah et al., 2020; Räz, 2023). The idea is that similar to the hierarchical schema we use to understand nature, with lower level components such as atoms combining to form higher level entities such as molecules, cells, and organisms, representations in deep nets evolve through layers from pixels to shapes to objects. If this were the case, model elements or aggregates of such elements could be reconnected to the respective phenomenon entities; ER would be restored by the representations of coarse-grained phenomenon components.

While empirical research on neural networks finds that some model elements are weakly associated with human concepts ( Bau et al., 2017; Bills et al., 2023; Kim et al., 2018; Mu & Andreas, 2020; Olah et al., 2017; Voss et al., 2021), often these elements are neither the only associated elements nor exclusively associated with one concept as shown in Fig. 2 (Bau et al., 2017; Donnelly & Roegiest, 2019; Olah et al., 2020). Moreover, intervening on these model elements generally does not have the expected effect on the prediction—the elements do not share the causal role of the “represented” concepts, even in prediction (Donnelly & Roegiest, 2019; Gale et al., 2020; Zhou et al., 2018). It is therefore questionable in what sense, or even whether, individual elements of ML models represent (Freiesleben, 2023). Moreover, this line of research that tries to identify concepts in model elements predominantly focuses on images, where nested concepts are arguably easy to identify for humans.

In sum, current ML algorithms do not enforce ER—hence, trained ML models rely on distributed representations and cannot be reduced to logical concept machines. An associative connection between a model element and a phenomenon concept should not be confused with their equivalence. While research on the representational correlates of model elements may indeed seem fascinating, analyzing single model elements appears to be a hopeless enterprise, at least if the goal is to support scientific inference.

6 IML Analyzes the Model as a Whole, but Does It Allow for Scientific Inference?

Let us accept that ML models are indeed not ER. Can we still exploit their predictive power for scientific inference? We think this is indeed possible. Our approach is to regard the model as representational of phenomenon aspects only as a whole—we call this holistic representationality (HR). The idea behind HR is not new; it underlies, for example, modern causal inference (Van der Laan et al., 2011). HR has its roots in what Heckman (2000) calls Marschak’s Maxim – it is possible to directly describe an aspect of the data distribution without first identifying its individual components through a parametric model. ML models represent one paradigmatic case of HR models (Starmans, 2011).

The current research program in IML takes an HR perspective on ML models: Many IML methods analyze the entire trained ML model post-hoc just as an input-output mapping (Scholbeck et al., 2019), sometimes leveraging additional useful properties such as differentiability (Alqaraawi et al., 2020).

Initial definitions of, for example, global feature effects (Friedman et al., 1991), feature importance (Breiman, 2001a), local feature contribution (Štrumbelj & Kononenko, 2014) or model behavior (Ribeiro et al., 2016) have been presented. However, in recent years, many researchers have pointed out that these methods lead to counterintuitive results, and offered alternative definitions (Alqaraawi et al., 2020; Apley & Zhu, 2020; Janzing et al., 2020; Goldstein et al., 2015; König et al., 2021; Molnar et al., 2023; Strobl et al., 2008; Slack et al., 2020).

We believe that these controversies stem from a lack of clarity about the goal of model interpretation. Are we interested in model properties to learn about the model (model audit) or do we want to use them as a gateway to learn about the underlying phenomenon (scientific inference)? These two goals must not be conflated.

The auditor examines model properties e.g. for debugging, to check if the model satisfies legal or ethical norms, or to improve understanding of the model by intervening on it (Raji et al., 2020). Another function that an audit can have is to assess the alignment between model properties and our (causal) background knowledge, such as certain monotonicity constraints (Tan et al., 2017), which is particularly important when ML is used in high-stakes decision making. In all those use-cases, auditors even take interest in model properties that have no corresponding phenomenon counterpart, such as single model elements or the behavior of the model when applied on unrealistic combinations of features. The scientist, on the other hand, wants to learn about model properties only in so far as they can be interpreted in terms of the phenomenon.

This does not imply that model audit is irrelevant for scientists. In fact, a careful model audit that shows what the model does and how reliable it is appears as a prerequisite for trustworthy scientific inference.

7 How to Think of Scientific Inference with HR Models?

We just argued that ML models are generally not ER and therefore do not allow for scientific inference in the traditional way. HR offers a viable alternative, however, the IML research community currently conflates different goals of model interpretation. Our plan in the following sections is to show how a HR perspective enables scientific inference using IML methods. Particularly, we show which IML methods qualify as property descriptors, i.e. quantify properties of the phenomenon, not just the model. Figure 3 describes our conceptual move: instead of matching phenomenon properties with model parameters as in ER models, we match them with external descriptions of the whole model.

In what follows, we will focus on scientific inference with trained ML models, as these constitute a paradigmatic and highly relevant category of HR models, even though our theory of property descriptors is generally applicable to any HR model as long as we know what it holistically represents.

In scientific inference, we start from a question concerning a phenomenon and some relevant data about that phenomenon. Even though in simple cases fitting properly constructed ER models can provide interpretable answers, for complex phenomena, a lack of model capacity often results in poor answers. In contrast, while recent ML models are opaque, they have the required representational capacity, and multiple IML methods already exist to probe them in various ways. What is missing, and what we are proposing here, is a way to map the initial question to a relevant IML method so that we can perform scientific inference even with models just aimed at predictive performance.

The key missing ingredient for scientific inference is to link the phenomenon and the model. We propose to draw this link using statistical learning theory, which characterizes what optimal ML models can holistically represent. If the question can be in principle answered with an ideal model, we expect an approximate model to be able to answer it approximately. The problem is now reduced to qualifying what an approximate model is, and quantifying the error induced by the approximation. The same schema applies to using limited data as opposed to infinite data: we are fine with an approximate answer, so long as we can quantify the uncertainty.

8 What Aspects of Phenomena Do ML Models Holistically Represent?

Which aspects of a phenomenon ML models can represent under ideal circumstances depends on the data setting, the learning paradigm, and the loss function. We focus here on supervised learning from identically and independently distributed (i.i.d.) samples. In this widely useful setting, statistical learning theory provides us with optimal predictors (Hastie et al., 2009, p. 18–22) as a rigorous tool to address model representation.

Using the notation ${\varvec{X}{:}{=}(X_1,\ldots ,X_n)}$ for the input variables and $Y$ for the target variable with ranges $\mathcal {X}$, and $\mathcal {Y}$, we now formalize what characterizes the optimal prediction model and how to train such models from labeled data.

Optimal Predictors

An optimal predictor $m(\varvec{x})$ predicts realizations of the target $Y$ from realizations of the input $\varvec{X}$ with minimal expected prediction error i.e. $m=\underset{{\hat{m}}\in \mathcal {M}}{\mathop {\textrm{arg}\,\textrm{min}}}\,\textrm{EPE}_{Y|\varvec{X}}({\hat{m}})$, with

$$\begin{aligned} \textrm{EPE}_{Y|\varvec{X}}({\hat{m}}){:}{=}\int _{Y} L(Y,{\hat{m}}(\varvec{X}))\ \mathbb {P}_{Y | \varvec{X}}(y|\varvec{x})\, \textrm{d} y, \end{aligned}$$

where L is the loss function $L(Y,m(\varvec{X})): \mathcal {X}\times \mathcal {Y} \rightarrow \mathbb {R}^+$ and ${\hat{m}}$ a model in the class $\mathcal {M}$ of mappings from $\mathcal {X}$ to $\mathcal {Y}$. Table 1 recapitulates the optimal predictors for some standard loss functions.

Table 1 The optimal predictors for standard loss functions reflect aspects of $\mathbb {P}(Y\,|\,\varvec{X})$

Full size table

Supervised Learning from Finite Data

In supervised learning, we seek to approximate an optimal predictor m by a model ${\hat{m}}$ based on a specific finite dataset ${\mathcal {D}{:}{=}\{ (\varvec{x}^{(1)},y^{(1)}),\dotsc ,(\varvec{x}^{(k)},y^{(k)})\}}$. This ‘training’ is carried out by a learning algorithm $I:\Delta \rightarrow \mathcal {M}$, which maps the class $\Delta$ of datasets of i.i.d. draws^{Footnote 6}, $(\varvec{x}^{(i)},y^{(i)})\sim (\varvec{X},Y$), to a class of models $\mathcal {M}$. Instead of the EPE itself, the learning algorithm minimizes the empirical risk on training data and is then evaluated based on the empirical risk on test data (i.e. data not contained in the training data), which constitutes a finite-data estimate of the EPE (Hastie et al., 2009).

9 Scientific Inference with ML in Four Steps

We have just outlined how (optimal) predictive models represent holistically aspects of the conditional distribution of the data, $\mathbb {P}(Y\,|\,\varvec{X})$, see Table 1. In this section, we finally introduce property descriptors as the tools that allow us to investigate these aspects by describing their relevant properties. Property descriptors exploit the ML model to study general associations in our data, providing insight beyond raw prediction. We provide a four-step plan to construct such descriptors that is illustrated in Fig. 4.

We assume the following scenario: interested in answering a scientific question about a specific phenomenon, a researcher seeks to exploit an informative labeled dataset about the phenomenon, ${\mathcal {D}}$ together with a highly predictive (ML) supervised model trained on it, ${\hat{m}}$.

Our strategy encompasses the following key steps: 1. outlining the solution in an idealized context with an optimal predictor m together with prior probabilistic knowledge K, rather than a real model and real data, 2. demonstrating the feasibility of approximating this ideal under certain assumptions, 3. estimating the solution from finite data and a trained model, and 4. quantifying the uncertainty inherent in these approximations.

Step 1: Formalize Scientific Question As an Estimand

The first step in our approach is to formalize the scientific question, Q, as a statistical estimand (Lundberg et al., 2021). This estimand represents a probabilistic query on $\varvec{X}$ and $Y$.

Example: The Link Between Language and Math Skills

An educator is interested in how math skills relate to language skills. She has access to a relevant labeled dataset, consisting of student grades in Portuguese and math, represented by the random variables $X_p$ and $Y$ respectively. The question of what is the expected grade in math for any given Portuguese grade can be formalized as an estimand, the conditional expectation, $Q:=\mathbb {E}_{Y\,|\,X_p}[Y\,|\,X_p]$.

Step 2: Identify the Estimand with a Property Descriptor

Having Q, we now ask if it can be derived from an optimal model m (e.g. one of those in Table 1). Clearly, if Q cannot be derived from m, using the actually available approximate ${\hat{m}}$ will be inviable. Even m alone will often not suffice. Since m represents aspects of $\mathbb {P}(Y\,|\,\varvec{X})$, we also require probabilistic knowledge K about $\mathbb {P}(\varvec{X})$ and sometimes even of $\mathbb {P}(Y)$ to derive relevant inferences from m. Note that K is generally not available but must again be estimated from data. Following the reasoning with the ideal model, we consider whether the problem could be resolved assuming we have additional probabilistic knowledge K.

We deem an estimand identifiable with respect to available knowledge K if it can be derived from the optimal predictor m and K. To establish identifiability means to provide a constructive transformation of m and K into Q, ideally, the one with the most parsimonious requirements on K, since in the real world K is obtained by collecting data or positing assumptions. We call this constructive transformation a property descriptor, and demand that it be a continuous function $g_K$ w.r.t. metrics $d_{\mathcal {M}}$ and $d_{\mathcal {Q}}$:^{Footnote 7}

$$\begin{aligned} g_K: \mathcal {M}&\rightarrow \mathcal {Q}\quad \text { with } \quad g_K(m)=Q. \end{aligned}$$

The output space $\mathcal {Q}$ remains deliberately unspecified; depending on the particular scientific question, we might want $\mathcal {Q}$ to be a set of real numbers, vectors, functions, probability distributions, etc.

Example: Property Descriptor for a Multivariable Model

Consider a multivariable model trained to minimize the MSE loss, such as our ANN (Eq. 2), which predicts math grades from all available features. The model approximates the conditional expectation $m(\varvec{X})=\mathbb {E}_{Y\,|\,\varvec{X}}[Y\,|\,\varvec{X}]$.^{Footnote 8} This is an optimal predictor, but in contrast to our desired estimand Q, it uses the Portuguese grade $X_p$ as well as other features, denoted $\varvec{X}_{-p}$. We can obtain Q from m by marginalizing over $\varvec{X}_{-p}$, i.e. our Q is identifiable if we have the necessary $\mathbb {P}(\varvec{X}_{-p}\,|\,X_p):=K$ to compute the expectation^{Footnote 9}

$$\begin{aligned} Q&{:}{=}\,\mathbb {E}_{Y\,|\,X_p}[Y\,|\,X_p]\\&=\mathbb {E}_{\varvec{X}_{-p}\,|\,X_p}[\mathbb {E}_{Y\,|\,\varvec{X}}[Y\,|\,\varvec{X}]\,|\,X_p] \qquad \text {(by the tower rule })\\&=\mathbb {E}_{\varvec{X}_{-p}\,|\,X_p}[m(\varvec{X})\,|\,X_p]\\&=g_K(m). \end{aligned}$$

Here $g_K$ is our property descriptor that takes m and K into Q. It is generally defined for an arbitrary model ${\hat{m}}\in \mathcal {M}$ as

$$\begin{aligned} g_K({\hat{m}})(x_p){:}{=}\mathbb {E}_{\varvec{X}_{-p}\,|\,X_p}[{\hat{m}}(\varvec{X})\,|\,X_p{=}x_p]. \end{aligned}$$

(3)

Note that $g_K({\hat{m}})$ is continuous on $\mathcal {M}$ and in fact corresponds to the well-known conditional partial dependence plot (cPDP; also known as Apley & Zhu, 2020 M-plot, 2020).

Step 3: Estimate Property Descriptions from a Trained Model and Data

In reality, we rarely have optimal predictors, they are theoretical constructs. Instead, we have trained ML models that approximate these theoretical constructs. We call the application of a property descriptor to our concrete ML model, $g_K({\hat{m}})$, a property description. Having assumed the continuity of property descriptors above guarantees that we obtain an approximately correct estimate when our ML model is close to the optimal model.

Similarly, we do not have access to an ideal K. Instead, we have to estimate it using data and inductive modeling assumptions (e.g. Rothfuss et al., 2019). Ultimately, our model and property descriptions may be evaluated using not just the training and test dataset $\mathcal {D}$ (see Sect. 8), but also relevant available unlabeled data or artificially generated data. We call this bundle the evaluation data $\mathcal {D}^*\supseteq \mathcal {D}$.

A way to estimate property descriptions with access only to the ML model and evaluation data is provided by property description estimators, which we assume to be unbiased estimators of $g_K$, i.e. the function ${\hat{g}}_{\mathcal {D}^*}:\mathcal {M}\rightarrow \mathcal {Q}$ fulfills

$$\begin{aligned} \mathbb {E}_{D^*} [{\hat{g}}_{D^*}({\hat{m}})]=g_K({\hat{m}}) \quad \text {for all }{\hat{m}}\in \mathcal {M}. \end{aligned}$$

Example: Practical Estimates of Property Descriptions

Our evaluation dataset $\mathcal {D}^*$ comprises the initial training and test dataset $\mathcal {D}$ as well as artificial instances created by the following manipulation: we make six full copies of the data, and jitter the Portuguese grade by $1,-1,2,-2,3$ or $-3$ respectively in each of them. This augmentation strategy reflects how we understand the Portuguese grade as noisy based on our background knowledge of how much student performance varies daily and teachers grade inconsistently. Let the students with jittered Portuguese grade i be $\mathcal {D}^*_{|x_p=i}{:}{=}(x\in \mathcal {D}^* \,|\,x_p=i)$, then, we can calculate the property description estimator at i as the following conditional mean (an unbiased estimator of the conditional expectation):

$$\begin{aligned} {\hat{g}}_{\mathcal {D}^*}({\hat{m}})(i){:}{=}\frac{1}{|\mathcal {D}^*_{|x_p{=}i}|}\sum \limits _{x\in \mathcal {D}^*_{|x_p=i}}{\hat{m}}(x). \end{aligned}$$

(4)

The estimated answer to our initial question is plotted in Figure 5a. Math grades appear to depend on Portuguese grades only in the range $x_p\in (8\text {--}17)$. However, Figure 5b shows that we have very sparse data in some regions (e.g. very few students scored below $x_p=8$), a fact that we must take into account before confirming this first impression.

Step 4: Quantify the Uncertainties in Property Descriptions

We have shown how we can estimate Q using an approximate ML model paired with a suitable evaluation dataset. But how good is our estimate?

In estimating Q using the ML model ${\hat{m}}$ instead of the optimal model m, we introduce a model error, $\textrm{ME}[{\hat{m}}]:=d_{\mathcal {Q}}\bigl (g_K(m),g_K({\hat{m}})\bigr )$. Moreover, by using the evaluation data $\mathcal {D}^*$ instead of knowledge K, we also introduce an estimation error, $\textrm{EE}[\mathcal {D}^*]:=d_\mathcal {Q}\bigl (g_K({\hat{m}}),{\hat{g}}_{\mathcal {D}^*}({\hat{m}})\bigr )$.

In theory, the model error and the estimation error can be cleanly separated. In practice, however, they are often statistically dependent because the training and the evaluation data overlap. Fortunately, there exist various sample splitting approaches that allow to circumvent or minimize this dependence (James et al., 2023). Generally, neither the model error nor the estimation error can be computed perfectly; this would require access to the optimal model m and infinitely many data instances. Nevertheless, we can quantify them in expectation.

An intuitive approach to quantifying the expected errors is to decompose them into bias and variance contributions. For the bias-variance decomposition, we assume the metric $d_\mathcal {Q}$ to be the squared error.^{Footnote 10} Considering the dataset that we entered into the learning algorithm as a random variable $D$, we can decompose the expected model error as follows

$$\begin{aligned} \mathbb {E}_{D}[\textrm{ME}[{\hat{m}}]]=\underbrace{(g_K(m)-\mathbb {E}_{D} [g_K({\hat{m}})])^2}_{\textrm{Bias}^2}\;+\;\underbrace{\mathbb {V}_{D}[g_K({\hat{m}})]}_{\textrm{Variance}}, \end{aligned}$$

where ${\hat{m}}{:}{=}I(\mathcal {D})$ is the trained model (output of a machine learning algorithm I for dataset $\mathcal {D}$, Sect. 8). Similarly, considering the evaluation data as a random variable $D^*$, we can also decompose the expected estimation error as follows

$$\begin{aligned} \mathbb {E}_{D^*}[\textrm{EE}[D^*]]=\underbrace{(g_K({\hat{m}}) -\mathbb {E}_{D^*}[{\hat{g}}_{D^*}({\hat{m}})])^2}_{\textrm{Bias}^2}\; +\;\underbrace{\mathbb {V}_{D^*}[{\hat{g}}_{D^*}({\hat{m}})]}_{\textrm{Variance}} =\mathbb {V}_{D^*}[{\hat{g}}_{D^*}({\hat{m}})]. \end{aligned}$$

In this case, the bias term vanishes because the property description estimator is by definition unbiased w.r.t. the property descriptor.

There are indeed different approaches to quantify the uncertainties of property descriptions. The standard frequentist approach is to estimate the variance in above equations using refitted models and property descriptions with resampled data (Molnar et al., 2023). But there are also Bayesian approaches where uncertainty is directly baked into the prediction model: BART by Chipman et al. (2012), for example, uses a sum-of-trees to perform Bayesian inference and directly provides uncertainties for inferential quantities like marginal effects. Similarly, Gaussian processes provide predictive error-bars (Rasmussen & Nickisch, 2010), which offer a natural confidence measure for property descriptions (Moosbauer et al., 2021). Finally, Bayesian posteriors can even be obtained for neural network architectures (Gal & Ghahramani, 2016; Van Amersfoort et al., 2020) and leveraged to estimate the uncertainty of property descriptions like counterfactuals (Höltgen et al., 2021; Schut et al., 2021).

Example: Uncertainty in Property Descriptions

We will certainly obtain different cPDPs (Fig. 5a) if we use different models with similar performance, or different subsets of evaluation data, so how much can we then rely on these cPDPs?

The estimates of the variances of the cPDP by Molnar et al. (2023) allow us to calculate pointwise confidence intervals (Fig. 6). We can calculate a confidence interval at significance $\alpha$ that only incorporates the estimation uncertainty due to finite data as follows:

$$\begin{aligned} \textrm{CI}_{\textrm{EE}[D^*]}{:}{=}{\hat{g}}_{\mathcal {D}^*}({\hat{m}})(i)\pm t_{1-\frac{\alpha }{2}} \sqrt{\hat{\mathbb {V}}_{D^*}[{\hat{g}}_{D^*}({\hat{m}})(i)]}, \end{aligned}$$

(5)

where $t_{1-\alpha /2}$ is the critical value of the t-statistic. We can also obtain a confidence interval that incorporates both model and estimation uncertainty:

$$\begin{aligned} \textrm{CI}_{\textrm{ME}[{\hat{m}}]\wedge \textrm{EE}[D^*]}{:}{=}{\hat{g}}_{\mathcal {D}^*}({\hat{m}})(i)\pm t_{1-\frac{\alpha }{2}} \sqrt{\hat{\mathbb {V}}_{D,D^*}[{\hat{g}}_{D^*}({\hat{m}})(i)]}. \end{aligned}$$

(6)

For this combined confidence interval to be valid, the estimation of the property descriptions must be unbiased. While unbiasedness of the property description estimator and the ML algorithm^{Footnote 11} is sometimes sufficient to prove the unbiaseness of property descriptors (Molnar et al., 2023), there often is a tension between the bias-variance trade-off made to obtain global model fit versus the best estimate of the more localized property descriptions (Van der Laan et al., 2011). Fortunately, there exist various debiasing strategies using influence functions (Hines et al., 2022) like the targeted maximum likelihood update (Van Der Laan & Rubin, 2006) or orthogonalization (Chernozhukov et al., 2018).

Figure 6 shows that for students with Portuguese grades between 8 and 17, we can be very confident in our model and the relationship it identifies between math and Portuguese grade.^{Footnote 12} However, for Portuguese grades outside this range, the true value might be far off from our estimated value, as we can see from the width of the confidence intervals. For these grade ranges, gathering more data may reduce our uncertainty.

Synopsis of the Steps

We provide in Fig. 7 an overview of the functions and spaces involved in IML for scientific inference. We started from a phenomenon and formalized a scientific question—our estimand Q. Using a learning algorithm I on a dataset $\mathcal {D}$ representative of the phenomenon, we trained an ML model ${\hat{m}}$ that approximates the optimal model m. We then set out to estimate Q from ${\hat{m}}$. We defined a property descriptor $g_K$, that is, a function that allows to compute Q from m given K, respectively approximates Q from ${\hat{m}}$ given K. Because $g_K$ requires probabilistic knowledge about $\mathbb {P}(\varvec{X},Y)$, which can only be obtained from data, we introduced a property description estimator ${\hat{g}}_{\mathcal {D}^*}$—a function estimating Q solely from a finite evaluation dataset $\mathcal {D}^*$. Finally, we showed how the expected error due to our approximate modeling and finite-data-estimation can be quantified with respective confidence intervals $\textrm{CI}_{\textrm{ME}[{\hat{m}}]}$ and $\textrm{CI}_{\textrm{EE}[D^*]}$.

10 Some IML Methods Already Allow Scientific Inference

Many estimands are relevant across a wide variety of scientific domains. The goal of practical IML research for inference should be to define practical property descriptors for such estimands and provide accessible implementations of these descriptors, including quantification of uncertainty. To find out about scientifically relevant estimands, IML researchers, statisticians, and scientists must interact closely.

Table 2 Global and local questions with their matching estimands and property descriptors

Full size table

In Table 2 we present a few examples of practical inference questions that can be addressed by existing IML methods, i.e., these methods can operate as property descriptors already. We distinguish between global and local questions about the phenomenon: global questions concern general associations (e.g.between math and Portuguese grade), local questions concern associations in the local neighborhood of an instance (e.g. between study time and math grade for a specific student), and appear with gray background. The last column identifies current IML methods that provide approximate answers, albeit often without uncertainty quantification. To draw scientific inferences, we ultimately need versions of IML methods that account for the dependencies in the data (Hooker et al., 2021).

Not only the IML literature has worked on property descriptors, but the fairness literature also discussed measures that can be described as property descriptors, e.g. statistical parity (see Verma and Rubin (2018) for an overview). Similarly, robustness measures under distribution shifts (see Freiesleben & Grote, (2023) for an overview) as well as methods that examine the system stability in physics-informed neural networks (Chen et al., 2018; Raissi et al., 2019) can be seen as property descriptors.

Example: Illustrating the Methods from Table 2 on the Grading Example

To illustrate how the IML methods from Table 2 can help to address general inferential questions, we will introduce them along our grading example. We will begin with a discussion of global questions before moving on to local questions.

cFI

We have seen in the pedagogical example above that the association between language and math skills can be inferred using the cPDP. Another question is whether language skill provides information about math skill that is not contained in other student features (e.g., study-time, absences, and parents educational background). This is a common question among education scientists (Peng, Lin, Ünal et al., 2020) and can be formalized by asking if language skill $X_p$ is independent of math skill $Y$, conditional on the remaining features $\varvec{X}_{-p}$:

$$\begin{aligned} H_0: X_p \perp Y | \varvec{X}_{-p}. \end{aligned}$$

To test conditional independence, conditional feature importance (cFI) can be used (Ewald et al., 2024; König et al., 2021; Strobl et al., 2008; Watson & Wright, 2021). cFI compares the performance of the model before and after perturbing the feature of interest while preserving the dependencies with the remaining features. If the Portuguese grade is conditionally independent of the math grade, all relevant information in the Portuguese grade can be reconstructed from the remaining features so that the performance is not affected by the perturbation. Thus, if the cFI is nonzero, Portuguese grade must be conditionally dependent with the math grade. To account for the uncertainties, we rely on Molnar et al. (2023).

In Figure 8, we applied cFI to our ML model ${\hat{m}}$ and computed the respective confidence interval (quantifying both model and estimation uncertainty). The importance of the Portuguese grade for the math grade is significant according to the 90% confidence intervals, as zero is not contained in the interval. On this basis, the scientist may reject $H_0$.

SAGE

Similar inferential questions as with the cFI can be addressed with Shapley additive explanations (SAGE, Covert et al., 2020. In contrast to the cFI, however, SAGE averages conditional importance relative to each subset of the features, often referred to as coalitions. Even if a feature has a cFI of zero, the corresponding SAGE values can be positive because the feature has positive importance in at least one of the coalitions (Ewald et al., 2024).

SAGE values are based on so-called surplus contributions of a feature of interest p (say the Portuguese grade) with respect to a coalition S (e.g., the number of absences and neighborhood). More specifically, surplus contributions quantify how model performance changes when the model, which only has access to features in the coalition S, additionally gets access to j.^{Footnote 13}

Notably, the surplus of a feature depends on the coalition S: Adding a dependent feature to the coalition decreases the surplus, e.g. the surplus of the current Portuguese grade is lower if we already know the Portuguese grade from last term. Conversely, adding a collaborating feature can increase a feature’s surplus, e.g., the effect of the mother being unemployed may be larger if the father is also unemployed. By averaging over the surplus of a feature across all possible coalitions (weighted by the number of possible feature orderings in which the coalition precedes the feature of interest), SAGE provides a broad insight into the importance of a feature. The idea behind SAGE is based on the Shapley value (Covert et al., 2020), a concept from cooperative game theory (Shapley et al., 1953).

PRIM

What defines the optimal math student? Education scientists have clear ideas about general indicators of student success in math, such as parents’ social and economic status, students’ habits and motivation, and cultural factors (Kuh et al., 2006). On this basis, they may construct a hypothetically optimal math student: highly educated parents who work in STEM professions, high study time and frustration tolerance, and a cultural environment that values education. One approach to test scientists’ intuitions about the optimal math student is to compare her expected success with the expected success of the optimal student(s) according to the data distribution. Patient rule induction (PRIM, Friedman & Fisher, 1999) estimates the student(s) with optimal conditions according to the data distribution, thus allowing scientists to test their hypotheses. Note however that the evaluation of such a test can be difficult: The optimal math student(s) according to scientists may lie outside of the data distribution, making their expected success difficult to estimate. In general, the estimation of high-dimensional feature vectors rather than scalars may be statistically less stable, leading to greater uncertainty.

ICE

Education scientists may wish to infer how study time statistically influences expected student success of individual students in math (Rohrer & Pashler, 2007). One way to formalize this question is by the conditional expectation of the math grade given a set of input features where study time is varied. This is exactly the quantity that individual conditional effect curves (ICE, Goldstein et al., 2015) estimate. However, caution is advised, as the variation of study time may break dependencies with other features, forcing the model to extrapolate (Hooker et al., 2021). On the basis of ICE curves, education scientists may hypothesize that for a subset of students, namely efficient learners, the effect of increasing study time on students’ performance saturates (Rohrer & Pashler, 2007); a hypothesis that can then be tested in a different study.

cSV and ICI

Education scientists want to understand the reasons behind the low expected success of some specific students in math (Saha et al., 2024). One way to approach this question is to investigate how knowing certain features (e.g. study time, language grades, or absences) affects the expected math grade for specific low performing students. One property descriptor that allows to approach this question is the conditional Shapley Value (cSV, Aas et al., 2021).

Like SAGE values, cSV are Shapley value methods based on averaging the surplus contribution of a feature j across all possible coalitions S; In contrast to SAGE surplus contributions, cSV surplus contributions do not quantify the surplus in prediction performance for all students, but the change in the predicted value for a specific student. In our example, the cSV of the Portuguese grade p for an individual x tells us how the expected value of the student’s math grade changes when we get access to the Portuguese grade $x_p$, averaged over all possible coalitions $x_S$ of features that we already know.

Individual conditional importance (ICI, Casalicchio et al., 2019) also estimates the importance of individual features for the expected math grade of a given student, but not the effect on the prediction itself, but on the accuracy of the prediction (in terms of its loss).

Counterfactuals

Education scientists are interested in identifying feasible conditions that increase the expected success in math of individual students (Saha et al., 2024). For a given student, this question can be framed as a search for similar alternative student features with higher expected math grades. This is the target estimand of plausible counterfactual explanations (Dandl et al., 2020). Analyzing the difference between the original instance and the alternative student features allows education scientists to identify features that are locally associated with higher student success. For example, plausible counterfactuals for a specific student with a bad math grade, 6 absences and a mediocre Portuguese grade suggest that a reduced number of absences to 3 would have yielded a better expected math grade. Counterfactuals do only reflect associations in the data and should not be interpreted causally (see Sect. 11), however, based on counterfactuals, scientists can derive hypotheses about causally relevant factors, which may then be tested in an experimental study. However, it should be noted that there can be (infinitely) large numbers of such counterfactual instances, and selecting among them requires domain knowledge (Mothilal et al., 2020).

The Delicate Line Between Exploration and Inference

Some of the examples just presented, particularly regarding local descriptors, are concerned with exploring data properties rather than drawing concrete inferences using statistical tests. A delicate distinction, according to which inference requires a constrained set of hypotheses to be tested, whereas exploration describes the search for hypotheses (Tredennick et al., 2021).^{Footnote 14} Our notion of inference—investigating unobserved variables and parameters to draw conclusions from data—is broader and encompasses data exploration. We motivated local property descriptors with these rather exploratory questions because the data support for local estimands is very limited, which leads to greater uncertainty and, consequently, to inferential tests with low statistical power. For this reason, local descriptors are rarely used in practice to test concrete hypotheses.

Disagreement Between Descriptors

There is a growing concern in the IML community about the fact that different IML methods disagree. For example, disagreement has been demonstrated for common feature attribution techniques such as Shapley values and LIME, but also for commonly used saliency maps (Adebayo et al., 2018; Ghorbani et al., 2019; Krishna et al., 2022). While at first glance this may indicate that the methods have limited use in science, they can only diverge meaningfully if they address different inferential questions. Given two property descriptors have the same estimand, they must by definition (given enough data) converge to the same property descriptions.

11 Property Descriptors Do Not Generally Provide Causal Inferences

With property descriptors, we can access a wealth of properties of the observational joint distribution that answer various scientific questions. While the observational joint probability distribution is indeed interesting, it remains on rung 1 of the so-called ladder of causation—the associational level (Pearl & Mackenzie, 2018). What scientists are often much more interested in, is answering causal questions, such as what is the average treatment effect (rung 2) (Imbens & Rubin, 2015) or even counterfactual, what-if questions (rung 3; Salmon, 1998, Woodward & Ross, 2021). In our example, we may be interested not only in how strongly students’ language and math skills are associated (rung 1), but also in how much the provision of tutoring in Portuguese affects students’ math skills (rung 2) or whether a specific student (who is not a native Portuguese speaker) would have done better in math had she received Portuguese tutoring at a young age (rung 3).

Supervised ML models only represent aspects of the observational distribution (rung 1) and, therefore, do not directly provide causal insights (Pearl, 2019). Consequently, property descriptions do not provide causal insights either. Many IML works that discuss causality (Janzing et al., 2020; Schwab & Karlen, 2019; Wang et al., 2021) are only concerned with causal effects on model predictions, which do not necessarily translate into a causal insight into the phenomenon (König et al., 2023)—they address only model audit.

Machine learning methods can still be used to gain causal insights into natural phenomena, however, only if additional causal assumptions are posed. For example, if the so-called backdoor criterion is fulfilled in the causal graph, we can identify the average causal effect from observational data using the backdoor adjustment formula (Pearl, 2009). Or, formulated in terms of the Potential Outcomes (PO) framework^{Footnote 15} by Rubin (1974): if conditional exchangeability is fulfilled, we can use the adjustment formula. Given the causal effect can be identified, there are various ML-based approaches to estimate it, like the T-learner, the S-learner, and doubly robust methods (see Künzel et al. (2019), Knaus (2022), Dandl (2023) for an overview). Prominently, there is double ML (Chernozhukov et al., 2018; Fink et al., 2023) and targeted learning (Van der Laan et al., 2011) that provide unbiased estimates of various identifiable causal estimands. Note, however, that to arrive at the necessary knowledge, we require interventional data and/or have to make strong, untestable assumptions (Rubin, 1974; Holland, 1986; Meek, 2013; Spirtes et al., 2000). Further, even if a causal estimand is identifiable and can therefore be estimated with ML, estimation from finite data may be challenging (Künzel et al., 2019).

In certain simplified scenarios, IML methods applied to associational ML models can be helpful for causal inference. Firstly, if all predictor variables are causally independent and the features cause the prediction target, the causal model interpretation implies the causal phenomenon interpretation. Secondly, associative models in combination with IML can help estimate causal effects even in the absence of causal independence if they are, in principle, identifiable by observation. For example, the partial dependence plot coincides with the so-called adjustment formula; It, therefore, identifies a causal effect if the backdoor criterion (conditional exchangeability) is met and the model optimally predicts the conditional expectation (Zhao & Hastie, 2021). Thirdly, when there is access to observational and interventional data during training, ML models trained with invariant risk minimization predict accurately in interventional environments (Arjovsky et al., 2019; Peters et al., 2016; Pfister et al., 2021). For such intervention-stable models, IML methods that provide insight into the effect of interventions on the prediction also describe causal effects on the underlying real-world components (König et al., 2023).

While supervised learning learns from a fixed dataset, reinforcement learning (RL) systems are designed to act and can, therefore, assess the effect of their interventions. As such, RL models can be designed to provide causal interpretations (Bareinboim et al., 2015; Gasse et al., 2021; Zhang & Bareinboim, 2017).

Finally, another way in which ML supports causal inference is by facilitating practical scientific inference based on potentially complex, but still ER, mechanistic models that are frequently implemented as numerical simulators. Indeed, simulators can represent complex, causal dynamics in an ER fashion, but often at the price of an intractable likelihood and, thus, expensive or even intractable inference. A variety of new methods for likelihood-free inference (Cranmer et al., 2020) allows us to estimate a full posterior distribution over ER parameters for increasingly complex models using ML.

12 Discussion

ER models enable straightforward scientific inference because their elements are meaningful: they directly represent elements of the underlying phenomenon. While ML models are generally not ER, property descriptors can offer an indirect route to scientific inference, provided whole-model properties have a corresponding phenomenon counterpart. We have shown how phenomenon representation can be accessed through optimal predictors and described how to practically construct property descriptors following four steps: the first two steps clarify how domain questions and property descriptors can be theoretically connected, step three shows how to practically estimate property descriptions with ML models and data, and step four allows to evaluate how much the estimated property descriptions may deviate from the ground truth. We highlighted some current IML methods that can already be seen as property descriptors and identified what inference questions they answer.

Scope for HR Modeling in Science

ML models represent just an extreme case where almost no element of the model is representational. There is a continuum between ER models and HR-only models—ranging from full ER models like Newton’s gravitational laws to intermediate statistical models containing higher-order interaction terms to full-blown HR models like deep neural networks. Our main message is: the four-step approach can be used to extend inference to any non ER model, whether ML or not.

Why should scientists use HR models at all? Rudin (2019) argued that whenever there is an HR model with high predictive accuracy, there is also an ER model that achieves similar performance. She backs up her argument with several examples, e.g. loan risk prediction (Rudin & Radin, 2019), recidivism prediction Zeng et al. (2017), finding pattern in EEG data (Li et al., 2017), and even image classification (Chen et al., 2019), where (relatively) interpretable models achieve performance comparable to less interpretable HR models. She therefore recommends favoring ER models in high-stakes settings.

In science, the stakes are quite high and interpretability is highly valued. Therefore, we believe that highly-predictive ER models, when available, should be favored in science over HR models. The problem is that while such powerful ER models may exist, they can be hard to find, especially in model classes with high complexity (Nearing et al., 2021). While reducing the complexity of the model class can help to find high-performing ER models in noisy environments (Semenova et al., 2024), this requires substantial domain knowledge about the data generating process, which is often not available. Choosing simplified models with little predictive performance just for the sake of inherent interpretability is no viable path (Breiman, 2001b). Interpreting models that poorly approximate the phenomenon will lead to unreliable conclusions (Cox, 2006; Good & Hardin, 2012).^{Footnote 16}

However, there could be a reasonable compromise between ER and HR modeling, where some parts of the model are ER, while other parts, where less parametric assumptions are justified, are filled with powerful HR models: This approach is commonly taken in the intersection between causal modeling and ML, where the causal graph is ER whereas the functional dependencies are modeled with HR models (Peters et al., 2017). Similarly, using flexible ML models while enforcing constraints in the training process like additivity can lead to partially ER models (Hothorn et al., 2010; Van der Laan et al., 2011). In physics we may want to model a phenomenon with a classical ER model, namely ordinary differential equations, without constraining the function (Dai & Li, 2022), or we may want to enforce a preference for certain functions (e.g. exponential or sinus) without limiting the possible dependencies (Martius & Lampert, 2016).

Limitations of Our Framework

How Useful are Property Descriptors for Scientific Inference in Practice?

We have shown that, under certain assumptions, property descriptors provide insight into the phenomenon. However, we have not shown that property descriptors are the best approach to gain this insight. The cPDP, for example, could also be estimated directly from the data without interposing an ML model and a property descriptor. Does it make sense to take the detour via the ML model and property descriptors instead of directly estimating the quantity of interest, e.g. using targeted learning (Van der Laan et al., 2011)?

One case is when access to the model, the data, or computational resources is limited, which is common with proprietary models like ChatGPT or other sophisticated models like Alphafold (Senior et al., 2020) or ResNet (He et al., 2016). Due to the data and expertise that went into these models, they are ideal candidates for mining them for insights with IML tools. However, computing interpretations for these models can again be computationally very expensive. Another use case of property descriptors is when scientists have set out to obtain predictions but want to gain additional insights from their model at low cost.

Irrespective of whether scientists should use property descriptors to make concrete scientific inferences, ample evidence in the published scientific literature shows that scientists use IML tools and draw inferences based on these interpretations (Gibson et al., 2021; Roscher et al., 2020; Shahhosseini et al., 2020). Our paper can help clarifying what inferences scientists are enabled to draw from interpretations and what IML tools to use.

To make a fair comparison between the inferential qualities of targeted learning and property descriptors a systematic study would be needed. Under what conditions do estimates based on property descriptors (e.g. conditional feature importance, Strobl et al., 2008) differ from targeted learning approaches (e.g. LOCO, Lei et al., 2018, Verdinelli & Wasserman, 2024)? First works on this question (Hooker et al., 2021; Verdinelli & Wasserman, 2024) indicate that targeted learning may be better suited for standard inferential tasks like estimating feature importance.

How to Obtain Realistic Data?

Many IML methods (e.g. Shapley values, LIME) rely on probing the ML model on modified instances (Scholbeck et al., 2019). These artificial “data” may be useful to audit the model, even though may never occur in the real world. However, if we want to learn about the world, artificial data is supposed to credibly supplement observations. Like others (Hooker & Mentch, 2021; Hooker et al., 2021; Mentch & Hooker, 2016), we recommend respecting the dependency structure in the data if we strive to draw valid scientific inferences with IML.

However, obtaining realistic data is hard. Strategies such as our grade jitter strategy are useful, but require expert domain knowledge of the dependency structure in the data. Conditional density estimation techniques (e.g. probabilistic circuits Choi et al., 2020) or generative models (e.g. generative adversarial networks, normalizing flows, variational autoencoders, etc.) provide paths to generate realistic data without presupposing expert knowledge. Unfortunately, they are often computationally intensive. Also, current IML software implementations often only offer marginal versions of IML methods that are unsuited as property descriptors. We urge the IML research community to provide efficient implementations of conditional samplers and integrate them into IML packages.

Can We Use Property Descriptors to Encode Background Knowledge?

We showed how to use property descriptors to extract knowledge from models. However, the converse direction of incorporating knowledge into models is also central for scientific progress (Dwivedi et al., 2021; Nearing et al., 2021; Razavi, 2021). There are already approaches that allow to enforce monotonicity constraints (Chipman et al., 2022) or sparsity in the training process (Martius & Lampert, 2016). But also property descriptors can be used to constrain the set of allowable models.^{Footnote 17} For example, we could promote specific property descriptions during training by modifying the loss function to penalize models that deviate from them. Such strategies are indeed common in the fairness literature, where loss functions are designed to optimize for certain fairness metrics, which can be seen as property descriptors [see (Pessach & Shmueli, 2022) for an overview].

What About Non-Tabular Data?

For some data types, such as images, audio, or video data, it is extremely difficult to formalize the estimand only in terms of low-level features such as pixels or audio frequencies. To follow our approach, we need a translation of high-level concepts (e.g. objects in images or words in audio) that scientists can use to formulate their questions in terms of the low-level features (e.g. pixels or audio frequencies) that the model works with. Such translations are notoriously difficult to find, proposals either rely on labeled data to learn such representations (Jia et al., 2013; Koh et al., 2020; Zaeem & Komeili, 2021) or discuss constraints to learn them via unsupervised learning (Bengio et al., 2013; Schölkopf et al., 2021). We think that such a translation between low-level features and high-level concepts is one of the most pressing problems of IML research.

13 Conclusion

Traditional scientific models were designed to satisfy elementwise representationality, allowing scientists to learn about Nature by direct inspection of model elements. While ML models trained for prediction do not satisfy elementwise representationality, they do offer a unique ability to represent complex phenomena by digesting enormous amounts of noisy, multivariate and even multimodal observations. We have shown that it is still possible to learn about the phenomenon using them: all we need to do is to interrogate the model with suitable property descriptors. Our approach provides philosophers, IML researchers and scientists with a novel philosophical perspective on scientific representation with ML models and a valuable methodology for gaining insight into phenomena from such models using interpretation methods.

Notes

The double ML methodology has also been applied for other inferential problems, see Fink et al. (2023).
Note that our toy model is for illustration only, and not meant to reflect social science methodology.
This means that the proportion of CIs (each calculated from a theoretical, newly-sampled dataset) that contain the true parameter value $\beta$ tends in the long-run to 95% (Heumann et al., 2016).
The expected math grade for a zero in Portuguese is ${\hat{\beta }}_0=0.80$. However, regressing on $x_p-{\bar{x}}_p$ instead gives the more useful expected grade of an average Portuguese student (with ${\bar{x}}_p=12.55$), ${\hat{\beta }}^\textrm{avg}_0=10.46$, with a 95% CI of $10.05<\beta ^\textrm{avg}_0<10.88$.
Using a multiple linear model for fairer comparison results in a MSE of 12.4, reflecting a limited capacity to capture complex relationships.
The domain of I is only completely specified when the parameters that define the learning procedure and the search space of the algorithm (called hyperparameters in the context of ${\hat{m}}$) are fixed. For our discussion, the reader may assume hyperparameters to have been fixed a priori by a human or an automated ML algorithm (Hutter et al., 2019).
The function $d_{\mathcal {M}}$ is a metric on the function space $\mathcal {M}$, $d_{\mathcal {M}}(m_1,m_2){:}{=}\int _{\varvec{X}}L(m_1(x),m_2(x))\,\mathbb {P}_{\varvec{X}}(x)\ \textrm{d}x$ for $m_1,m_2\in \mathcal {M}$, while $d_{\mathcal {Q}}$ describes a metric appropriate for the space $\mathcal {Q}$.
Note the same discussion applies to a multiple linear regression, but, in contrast to the neural network, the coefficients of the linear model remain ER: the multiple regression coefficient ${\hat{\beta }}_j$ summarizes the additional contribution of $X_j$ to change $Y$ after $X_j$ has been linearly adjusted for change in $\varvec{X}_{-i}$ (in the data at hand; Hoaglin, 2016, Hastie et al., 2009, sec. 3.2.3) to account for correlations among predictors.
In the second line we use the tower rule or rule of total expectation whereby, for arbitrary random variables $\varvec{X},\varvec{Y},\varvec{Z}$, it holds $\mathbb {E}_{\varvec{Y}\,|\,\varvec{X}}[\varvec{Y}\,|\,\varvec{X}]=\mathbb {E}_{\varvec{Z}\,|\,\varvec{X}} \Bigl [\mathbb {E}_{\varvec{Y}\,|\,\varvec{X},\varvec{Z}}[Y\,|\,\varvec{X},\varvec{Z}]\,|\,\varvec{X}\Bigr ]$: it does not matter whether we directly take the expectation of $\varvec{Y}$ on $\varvec{X}$ or if we first take the expectation of $\varvec{Y}$ conditioned on a set of random variables $\varvec{X},\varvec{Z}$ that includes $\varvec{X}$ and then, “integrate $\varvec{Z}$ out”.
A bias-variance decomposition is also possible for other loss functions, including the 0–1 loss (Domingos, 2000).
Unbiasedness of the algorithm means that, in expectation over training sets, the ML algorithm learns the optimal model, i.e. $m=\mathbb {E}_{D}[{\hat{m}}]$. Since unbiasedness is context-specific, there is no conflict with the no-free-lunch theorems (Sterkenburg & Grünwald, 2021).
We used bootstrapping to estimate the two variances. In non-synthetic-data settings, it is generally not possible to always sample new data for the model training and the evaluation. Although bootstrapping may underestimate the variance, our goal here is simply to illustrate the process of quantifying uncertainty for a concrete IML method.
In contrast to cFI, where features are removed from the model via perturbations, SAGE removes variables from the model by marginalizing them out using the conditional expectation.
The motivation for the distinction between inference and exploration is the multiple comparisons problem (Tredennick et al., 2021)—if many hypotheses are tested simultaneously, the probability that some inferences will be false discoveries is very high. Note that there are further approaches to tackle the multiple-comparison problem, such as the Bonferroni correction (Curran-Everett, 2000; Lindquist & Mejia, 2015).
Pearl’s causal modeling framework and Rubin’s potential outcomes framework both concern causal inference and are logically equivalent—which one to choose depends on personal preference and the practical use case (Pearl, 2010; Imbens, 2020).
Note however that even for the optimal model, there remains the so-called Bayes error rate, an irreducible error arising from the fact that $\varvec{X}$ does not completely determine $Y$ (Hastie et al., 2009). Thus, high error does not necessarily flag a low-quality model, but rather may indicate that $\varvec{X}$ provides insufficient information about $Y$.
Fixing sufficiently many property descriptions even allows to completely determine the model in the case of the FANOVA decomposition (Apley & Zhu, 2020; Hooker, 2004).

References

Aas, K., Jullum, M., & Løland, A. (2021). Explaining individual predictions when features are dependent: More accurate approximations to shapley values. Artificial Intelligence, 298, 103502.
Article MathSciNet Google Scholar
Achinstein, P. (1968). Concepts of science: A philosophical analysis. Johns Hopkins University Press.
Book Google Scholar
Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., & Kim, B. (2018). Sanity checks for saliency maps. Advances in Neural Information Processing Systems, 31.
Alqaraawi, A., Schuessler, M., Weiß, P., Costanza, E., & Berthouze, N. (2020). Evaluating saliency map explanations for convolutional neural networks: a user study. In Proceedings of the 25th International Conference on Intelligent User Interfaces (pp. 275–285).
Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired magazine, 16(7), 16–07.
Google Scholar
Apley, D. W., & Zhu, J. (2020). Visualizing the effects of predictor variables in black box supervised learning models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(4), 1059–1086.
Article MathSciNet Google Scholar
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. Preprint retrieved from arxiv:1907.02893
Bailer-Jones, D. M. (2003). When scientific models represent. International Studies in the Philosophy of Science, 17(1), 59–74.
Article Google Scholar
Bailer-Jones, D. M., & Bailer-Jones, C. A. (2002). Modeling data: Analogies in neural networks, simulated annealing and genetic algorithms, model-based reasoning (pp. 147–165). Springer.
Google Scholar
Bareinboim, E., Forney, A., & Pearl, J. (2015). Bandits with unobserved confounders: A causal approach. Advances in Neural Information Processing Systems, 28.
Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 6541–6549).
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.
Article Google Scholar
Bickler, S. H. (2021). Machine learning arrives in archaeology. Advances in Archaeological Practice, 9(2), 186–191.
Article Google Scholar
Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., & Saunders, W. (2023). Language models can explain neurons in language models. Retrieved 2023, from https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html.
Boge, F. J. (2022). Two dimensions of opacity and the deep learning predicament. Minds and Machines, 32(1), 43–75.
Article Google Scholar
Bokulich, A. (2011). How scientific models can explain. Synthese, 180(1), 33–45.
Article Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Article Google Scholar
Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231.
Article MathSciNet Google Scholar
Buckner, C. (2018). Empiricism without magic: Transformational abstraction in deep convolutional neural networks. Synthese, 195(12), 5339–5372.
Article Google Scholar
Buckner, C., & Garson, J. (2019). Connectionism. In E. N. Zalta (Ed.), The stanford encyclopedia of philosophy. Metaphysics Research Lab, Stanford University.
Google Scholar
Casalicchio, G., Molnar, C., & Bischl, B. (2019). Visualizing the feature importance for black box models. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part I 18, (pp. 655–670). Springer
Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., & Su, J.K. (2019). This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32.
Chen, H., Janizek, J.D., Lundberg, S., & Lee, S.I. (2020). True to the model or true to the data? Preprint retrieved from arXiv:2006.16234
Chen, R.T., Rubanova, Y., Bettencourt, J., & Duvenaud, D.K. (2018). Neural ordinary differential equations. Advances in neural information processing systems, 31.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters.
Chipman, H. A., George, E. I., & McCulloch, R. E. (2012). Bart: Bayesian additive regression trees. Annals of Applied Statistics, 6(1), 266–298.
MathSciNet Google Scholar
Chipman, H. A., George, E. I., McCulloch, R. E., & Shively, T. S. (2022). mbart: Multidimensional monotone bart. Bayesian Analysis, 17(2), 515–544.
Article MathSciNet Google Scholar
Choi, Y., Vergari, A., & Van den Broeck, G. (2020). Probabilistic circuits: A unifying framework for tractable probabilistic models. UCLA. http://starai.cs.ucla.edu/papers/ProbCirc20.pdf.
Cichy, R. M., & Kaiser, D. (2019). Deep neural networks as scientific models. Trends in Cognitive Sciences, 23(4), 305–317.
Article Google Scholar
Contessa, G. (2007). Scientific representation, interpretation, and surrogative reasoning. Philosophy of Science, 74(1), 48–68.
Article Google Scholar
Cortez, P. and A. Silva. 2008, 01. Using data mining to predict secondary school student performance. EUROSIS.
Covert, I., Lundberg, S. M., & Lee, S. I. (2020). Understanding global feature contributions with additive importance measures. Advances in Neural Information Processing Systems, 33, 17212–17223.
Google Scholar
Cox, D. R. (2006). Principles of statistical inference. Cambridge University Press.
Book Google Scholar
Cranmer, K., Brehmer, J., & Louppe, G. (2020). The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48), 30055–30062.
Article MathSciNet Google Scholar
Curran-Everett, D. (2000). Multiple comparisons: Philosophies and illustrations. American Journal of Physiology-Regulatory, Integrative and Comparative Physiology, 279(1), R1-8.
Article Google Scholar
Dai, X., & Li, L. (2022). Kernel ordinary differential equations. Journal of the American Statistical Association, 117(540), 1711–1725.
Article MathSciNet Google Scholar
Dandl, S. (2023). Causality concepts in machine learning: heterogeneous treatment effect estimation with machine learning & model interpretation with counterfactual and semi-factual explanations. Ph. D. thesis, lmu.
Dandl, S., Molnar, C., Binder, M., & Bischl, B. (2020). Multi-objective counterfactual explanations. In International Conference on Parallel Problem Solving from Nature, (pp. 448–469). Springer.
Díaz, I. (2020). Machine learning in the estimation of causal effects: Targeted minimum loss-based estimation and double/debiased machine learning. Biostatistics, 21(2), 353–358.
MathSciNet Google Scholar
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th international conference on machine learning, (pp. 231–238). Morgan Kaufmann Stanford.
Donnelly, J., & Roegiest, A. (2019). On interpretability and feature representations: an analysis of the sentiment neuron. In European Conference on Information Retrieval, (pp. 795–802). Springer.
Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. Preprint retrieved from arxiv:1702.08608
Douglas, H. E. (2009). Reintroducing prediction to explanation. Philosophy of Science, 76(4), 444–463.
Article Google Scholar
Ducheyne, S. (2012). Scientific representations as limiting cases. Erkenntnis, 76, 73–89.
Article MathSciNet Google Scholar
Dwivedi, D., Nearing, G., Gupta, H., Sampson, A. K., Condon, L., Ruddell, B., Klotz, D., Ehret, U., Read, L., Kumar, P., (2021). Knowledge-guided machine learning (kgml) platform to predict integrated water cycle and associated extremes. Artificial Intelligence for Earth System Predictability: Technical report.
Book Google Scholar
Edmonds, B. (2006). The nature of noise. In International workshop on epistemological aspects of computer simulation in the social sciences, (pp. 169–182). Springer.
Ewald, F.K., Bothmann, L., Wright, M.N., Bischl, B., Casalicchio, G., & König, G. (2024). A guide to feature importance methods for scientific inference. Preprint retrieved from arXiv:2404.12862
Farrell, S., Calafiura, P., Mudigonda, M., Anderson, D., Vlimant, J.R., Zheng, S., Bendavid, J., Spiropulu, M., Cerati, G., Gray, L. (2018). Novel deep learning methods for track reconstruction. Preprint retrieved from arXiv:1810.06111
Fink, D., Johnston, A., Strimas-Mackey, M., Auer, T., Hochachka, W. M., Ligocki, S., Oldham Jaromczyk, L., Robinson, O., Wood, C., Kelling, S., (2023). A double machine learning trend model for citizen science data. Methods in Ecology and Evolution, 14(9), 2435–2448.
Article Google Scholar
Fisher, A., Rudin, C., & Dominici, F. (2019). All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research, 20(177), 1–81.
MathSciNet Google Scholar
Freiesleben, T. (2023). Artificial neural nets and the representation of human concepts. Preprint retrieved from arXiv:2312.05337
Freiesleben, T., & Grote, T. (2023). Beyond generalization: a theory of robustness in machine learning. Synthese, 202(4), 109.
Article MathSciNet Google Scholar
Friedman, J. H., (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1–67. https://doi.org/10.1214/aos/1176347963
Article MathSciNet Google Scholar
Friedman, J. H., & Fisher, N. I. (1999). Bump hunting in high-dimensional data. Statistics and computing, 9(2), 123–143.
Article Google Scholar
Frigg, R., & Hartmann, S. (2020). Models in Science. In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University.
Google Scholar
Frigg, R., & Nguyen, J. (2021). Scientific Representation. In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University.
Google Scholar
Gal, Y., & Ghahramani, Z. (2016). A theoretically grounded application of dropout in recurrent neural networks. Advances in neural information processing systems, 29.
Gale, E. M., Martin, N., Blything, R., Nguyen, A., & Bowers, J. S. (2020). Are there any ‘object detectors’ in the hidden layers of cnns trained to identify objects or scenes? Vision Research, 176, 60–71. https://doi.org/10.1016/j.visres.2020.06.007
Article Google Scholar
Gasse, M., Grasset, D., Gaudron, G., & Oudeyer, P.Y. (2021). Causal reinforcement learning using observational and interventional data. Preprint retrieved from arxiv:2106.14421
Ghorbani, A., Abid, A., & Zou, J. (2019). Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, 33, 3681–3688.
Article Google Scholar
Gibson, P., Chapman, W., Altinok, A., Delle Monache, L., DeFlorio, M., & Waliser, D. (2021). Training machine learning models on climate model output yields skillful interpretable seasonal precipitation forecasts. Communications Earth & Environment, 2(1), 1–13.
Article Google Scholar
Goldstein, A., Kapelner, A., Bleich, J., & Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1), 44–65.
Article MathSciNet Google Scholar
Good, P. I., & Hardin, J. W. (2012). Common errors in statistics (and how to avoid them). Wiley.
Book Google Scholar
Gühring, I., Raslan, M., & Kutyniok, G. (2022). Expressivity of deep neural networks (pp. 149–199). Cambridge University Press.
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction, (Vol. 2). Springer.
Book Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 770–778).
Heckman, J. J. (2000). Causal parameters and policy analysis in economics: A twentieth century retrospective. The Quarterly Journal of Economics, 115(1), 45–97.
Article Google Scholar
Heumann, C., Schomaker, M., (2016). Introduction to statistics and data analysis. Springer.
Book Google Scholar
Hines, O., Dukes, O., Diaz-Ordaz, K., & Vansteelandt, S. (2022). Demystifying statistical learning based on efficient influence functions. The American Statistician, 76(3), 292–304.
Article MathSciNet Google Scholar
Hoaglin, D. C. (2016). March Regressions are commonly misinterpreted. The Stata Journal: Promoting Communications on Statistics and Stata, 16(1), 5–22. https://doi.org/10.1177/1536867x1601600103
Article Google Scholar
Holland, P. W. (1986). Statistics and causal inference. Journal of the American statistical Association, 81(396), 945–960.
Article MathSciNet Google Scholar
Höltgen, B., Schut, L., Brauner, J.M., & Gal, Y. (2021). Deduce: generating counterfactual explanations efficiently. Preprint retrieved from arxiv:2111.15639
Hooker, G. (2004). Discovering additive structure in black box functions. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, (pp. 575–580).
Hooker, G., & Hooker, C. (2017). Machine learning and the future of realism. Preprint retrieved from arxiv:1704.04688
Hooker, G., & Mentch, L. (2021). Bridging breiman’s brook: From algorithmic modeling to statistical learning. Observational Studies, 7(1), 107–125.
Article Google Scholar
Hooker, G., Mentch, L., & Zhou, S. (2021). Unrestricted permutation forces extrapolation: Variable importance requires at least one more model, or there is no free variable importance. Statistics and Computing, 31(6), 1–16.
Article MathSciNet Google Scholar
Hothorn, T., Bühlmann, P., Kneib, T., Schmid, M., & Hofner, B. (2010). Model-based boosting 2.0.
Hughes, R. I. (1997). Models and representation. Philosophy of Science, 64(S4), S325–S336.
Article Google Scholar
Hutter, F., Kotthoff, L., & Vanschoren, J. (2019). Automated machine learning: Methods, systems, challenges. Springer.
Book Google Scholar
Imbens, G. W. (2020). Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature, 58(4), 1129–1179.
Article Google Scholar
Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
Book Google Scholar
James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). Resampling methods, An Introduction to Statistical Learning: with Applications in Python, 201–228. Springer.
Book Google Scholar
Janzing, D., Minorics, L., & Blöbaum, P. (2020). Feature relevance quantification in explainable ai: A causal problem. In International Conference on artificial intelligence and statistics, (pp. 2907–2916). PMLR.
Jia, Y., Abbott, J.T., Austerweil, J.L., Griffiths, T., & Darrell, T. (2013). Visual concept learning: Combining machine vision and bayesian generalization on concept hierarchies. Advances in Neural Information Processing Systems, 26.
Kawamleh, S. (2021). Can machines learn how clouds work? the epistemic implications of machine learning methods in climate science. Philosophy of Science, 88(5), 1008–1020.
Article Google Scholar
Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, (pp. 2668–2677). PMLR.
Kitchin, R. (2014). Big data, new epistemologies and paradigm shifts. Big Data & Society, 1(1), 2053951714528481.
Article Google Scholar
Knaus, M. C. (2022). Double machine learning-based programme evaluation under unconfoundedness. The Econometrics Journal, 25(3), 602–627.
Article MathSciNet Google Scholar
Koh, P.W., Nguyen, T., Tang, Y.S., Mussmann, S., Pierson, E., Kim, B., & Liang, P. (2020). Concept bottleneck models. In International Conference on Machine Learning, (pp. 5338–5348). PMLR.
König, G., Freiesleben, T., & Grosse-Wentrup, M. (2023). Improvement-focused causal recourse (icr). In Proceedings of the AAAI Conference on Artificial Intelligence, 37, 11847–11855.
Article Google Scholar
König, G., Molnar, C., Bischl, B., & Grosse-Wentrup, M. (2021). Relative feature importance. In 2020 25th International Conference on Pattern Recognition (ICPR), (pp. 9318–9325). IEEE.
Krishna, S., Han, T., Gu, A., Pombra, J., Jabbari, S., Wu, S., & Lakkaraju, H. (2022). The disagreement problem in explainable machine learning: A practitioner’s perspective. Preprint retrieved from arxiv:2202.01602
Kuh, G. D., Kinzie, J. L., Buckley, J. A., Bridges, B. K., & Hayek, J. C. (2006). What matters to student success: A review of the literature (Vol. 8). National Postsecondary Education Cooperative.
Google Scholar
Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10), 4156–4165.
Article Google Scholar
Leavitt, M. L., & Morcos, A. S. (2020). Selectivity considered harmful: evaluating the causal impact of class selectivity in dnns.
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523), 1094–1111.
Article MathSciNet Google Scholar
Levy, A. (2012). Models, fictions, and realism: Two packages. Philosophy of Science, 79(5), 738–748.
Article Google Scholar
Li, Y., Dzirasa, K., Carin, L. and Carlson, D.E., 2017. Targeting EEG/LFP synchrony with neural nets. Advances in neural information processing systems, 30.
Lindquist, M. A., & Mejia, A. (2015). Zen and the art of multiple comparisons. Psychosomatic Medicine, 77(2), 114–125.
Article Google Scholar
Lipton, Z. C. (2018). The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3), 31–57.
Article Google Scholar
Longino, H. E. (2018). The fate of knowledge. Princeton University Press.
Book Google Scholar
Luan, H., & Tsai, C. C. (2021). A review of using machine learning approaches for precision education. Educational Technology & Society, 24(1), 250–266.
Google Scholar
Luk, R. W. (2017). A theory of scientific study. Foundations of Science, 22(1), 11–38.
Article MathSciNet Google Scholar
Lundberg, I., Johnson, R., & Stewart, B. M. (2021). What is your estimand? defining the target quantity connects statistical evidence to theory. American Sociological Review, 86(3), 532–565.
Article Google Scholar
Martius, G., & Lampert, C.H. (2016). Extrapolation and learning equations. Preprint retrieved from arxiv:1610.02995
Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt.
Google Scholar
McClelland, J. L., Rumelhart, D. E., & Group, R. R. (1987). Parallel Distributed Processing, Volume 2 Explorations in the Microstructure of Cognition: Psychological and Biological Models. MIT press.
Book Google Scholar
Meek, C. (2013). Strong completeness and faithfulness in bayesian networks. Preprint retrieved from arXiv:1302.4973
Mentch, L., & Hooker, G. (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. Journal of Machine Learning Research, 17(1), 841–881.
MathSciNet Google Scholar
Molnar, C. (2020). Interpretable machine learning. Lulu. com.
Molnar, C., Casalicchio, G., & Bischl, B. (2020). Interpretable machine learning–a brief history, state-of-the-art and challenges. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 417–431. Springer.
Molnar, C., & Freiesleben, T. (2024). Supervised Machine Learning For Science.
Molnar, C., Freiesleben, T., König, G., Herbinger, J., Reisinger, T., Casalicchio, G., Wright, M. N., & Bischl, B. (2023). Relating the partial dependence plot and permutation feature importance to the data generating process. In L. Longo (Ed.), Explainable Artificial Intelligence (pp. 456–479). Springer.
Chapter Google Scholar
Molnar, C., König, G., Bischl, B., & Casalicchio, G. (2023). Model-agnostic feature importance and effects with dependent features: A conditional subgroup approach. Data Mining and Knowledge Discovery. https://doi.org/10.1007/s10618-022-00901-9
Article Google Scholar
Molnar, C., König, G., Herbinger, J., Freiesleben, T., Dandl, S., Scholbeck, C. A., Casalicchio, G., Grosse-Wentrup, M., & Bischl, B. (2022). General pitfalls of model-agnostic interpretation methods for machine learning models. In A. Holzinger, R. Goebel, R. Fong, T. Moon, K.-R. Müller, & W. Samek (Eds.), xxAI - Beyond Explainable AI: International Workshop, Held in Conjunction with ICML 2020, July 18, 2020, Vienna, Austria, Revised and Extended Papers, Cham (pp. 39–68). Springer International Publishing.
Chapter Google Scholar
Moosbauer, J., Herbinger, J., Casalicchio, G., Lindauer, M., & Bischl, B. (2021). Explaining hyperparameter optimization via partial dependence plots. Advances in Neural Information Processing Systems, 34, 2280–2291.
Google Scholar
Mothilal, R.K., Sharma, A., Tan, C. (2020). Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 607–617.
Mu, J., & Andreas, J. (2020). Compositional explanations of neurons. Advances in Neural Information Processing Systems, 33, 17153–17163.
Google Scholar
Murphy, K. P. (2022). Probabilistic machine learning: An introduction. MIT Press.
Google Scholar
Nearing, G. S., Kratzert, F., Sampson, A. K., Pelissier, C. S., Klotz, D., Frame, J. M., Prieto, C., & Gupta, H. V. (2021). What role does hydrological science play in the age of machine learning? Water Resources Research, 57(3), e2020WR028091.
Article Google Scholar
Nguyen, A., Yosinski, J., & Clune, J. (2016). Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. Preprint retrieved from arXiv:1602.03616
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom in: An introduction to circuits. Distill, 5(3), e00024-001.
Article Google Scholar
Olah, C., Mordvintsev, A., & Schubert, L. (2017). Feature visualization. Distill, 2(11), e7.
Google Scholar
Pearl, J. (2009). Causality. Cambridge University Press.
Book Google Scholar
Pearl, J. (2010). Causal inference. Causality: objectives and assessment: 39–58.
Pearl, J. (2019). The limitations of opaque learning machines. Possible minds: twenty-five ways of looking at AI: 13–19.
Pearl, J., & Mackenzie, D. (2018). The book of why: The new science of cause and effect. Basic books.
Google Scholar
Peng, P., Lin, X., Ünal, Z. E., Lee, K., Namkung, J., Chow, J., & Sales, A. (2020). Examining the mutual relations between language and mathematics: A meta-analysis. Psychological Bulletin, 146(7), 595.
Article Google Scholar
Pessach, D., & Shmueli, E. (2022). A review on fairness in machine learning. ACM Computing Surveys (CSUR), 55(3), 1–44.
Article Google Scholar
Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5), 947–1012.
Article MathSciNet Google Scholar
Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms. The MIT Press.
Google Scholar
Pfister, N., Williams, E. G., Peters, J., Aebersold, R., & Bühlmann, P. (2021). Stabilizing variable selection and regression. The Annals of Applied Statistics, 15(3), 1220–1246.
Article MathSciNet Google Scholar
Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378, 686–707.
Article MathSciNet Google Scholar
Raji, I.D., Smart, A., White, R.N., Mitchell, M., Gebru, T., Hutchinson, B., Smith-Loud, J., Theron, D., & Barnes, P. (2020). Closing the ai accountability gap: Defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 33–44.
Rasmussen, C. E., & Nickisch, H. (2010). Gaussian processes for machine learning (gpml) toolbox. The Journal of Machine Learning Research, 11, 3011–3015.
MathSciNet Google Scholar
Räz, T. (2022). Understanding deep learning with statistical relevance. Philosophy of Science, 89(1), 20–41.
Article Google Scholar
Räz, T. (2023). Methods for identifying emergent concepts in deep neural networks. Patterns. https://doi.org/10.1016/j.patter.2023.100761
Article Google Scholar
Razavi, S. (2021). Deep learning, explained: Fundamentals, explainability, and bridgeability to process-based modelling. Environmental Modelling & Software, 144, 105159.
Article Google Scholar
Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., (2019). Deep learning and process understanding for data-driven earth system science. Nature, 566(7743), 195–204.
Article Google Scholar
Ribeiro, M.T., Singh, S., & Guestrin, C. (2016). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM.
Ritchey, T. (2012). Outline for a morphology of modelling methods. Acta Morphologica Generalis AMG, 1(1), 1012.
Google Scholar
Rohrer, D., & Pashler, H. (2007). Increasing retention without increasing study time. Current Directions in Psychological Science, 16(4), 183–186.
Article Google Scholar
Roscher, R., Bohn, B., Duarte, M. F., & Garcke, J. (2020). Explainable machine learning for scientific insights and discoveries. IEEE Access, 8, 42200–42216.
Article Google Scholar
Rothfuss, J., Ferreira, F., Boehm, S., Walther, S., Ulrich, M., Asfour, T., & Krause, A. (2019). Noise regularization for conditional density estimation. Preprint retrieved from arXiv:1907.08982
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5), 688.
Article Google Scholar
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215.
Article Google Scholar
Rudin, C., & Radin, J. (2019). Why are we using black box models in ai when we don’t need to? a lesson from an explainable ai competition. Harvard Data Science Review, 1(2), 10–1162.
Article Google Scholar
Saha, M., Islam, S., Akhi, A.A., & Saha, G. (2024). Factors affecting success and failure in higher education mathematics: Students’ and teachers’ perspectives. Heliyon 10(7).
Salmon, W. C. (1979). Why ask, ‘why?’? an inquiry concerning scientific explanation, Hans Reichenbach: logical empiricist, 403–425. Springer.
Book Google Scholar
Salmon, W. C. (1998). Causality and explanation. Oxford University Press.
Book Google Scholar
Schmidt, J., Marques, M. R., Botti, S., & Marques, M. A. (2019). Recent advances and applications of machine learning in solid-state materials science. npj Computational Materials, 5(1), 1–36.
Article Google Scholar
Scholbeck, C.A., Molnar, C., Heumann, C., Bischl, B., & Casalicchio, G. (2019). Sampling, intervention, prediction, aggregation: a generalized framework for model-agnostic interpretations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 205–216. Springer.
Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., & Bengio, Y. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5), 612–634.
Article Google Scholar
Schut, L., Key, O., Mc Grath, R., Costabello, L., Sacaleanu, B., Gal, Y. (2021). Generating interpretable counterfactual explanations by implicit minimisation of epistemic and aleatoric uncertainties. In International Conference on Artificial Intelligence and Statistics, pp. 1756–1764. PMLR.
Schwab, P., & Karlen, W. (2019). Cxplain: Causal explanations for model interpretation under uncertainty. Advances in Neural Information Processing Systems 32.
Schwarz, C. V., Reiser, B. J., Davis, E. A., Kenyon, L., Achér, A., Fortus, D., Shwartz, Y., Hug, B., & Krajcik, J. (2009). Developing a learning progression for scientific modeling: Making scientific modeling accessible and meaningful for learners. Journal of Research in Science Teaching: The Official Journal of the National Association for Research in Science Teaching, 46(6), 632–654.
Article Google Scholar
Semenova, L., Chen, H., Parr, R., & Rudin, C. (2024). A path to simpler models starts with noise. Advances in Neural Information Processing Systems 36.
Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žídek, A., Nelson, A. W., Bridgland, A., (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), 706–710.
Article Google Scholar
Shahhosseini, M., Hu, G., & Archontoulis, S.V. (2020). Forecasting corn yield with machine learning ensembles. Preprint retrieved from arXiv:2001.09055
Shapley, L.S. (1953). A value for n-person games.
Shmueli, G. (2010). To explain or to predict? Statistical science, 25(3), 289–310.
Article MathSciNet Google Scholar
Slack, D., Hilgard, S., Jia, E., Singh, S., & Lakkaraju, H. (2020). Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 180–186.
Spinney, L. (2022). Are we witnessing the dawn of post-theory science? The Guardian.
Google Scholar
Spirtes, P., Glymour, C. N., & Scheines, R. (2000). Causation, prediction, and search. MIT press.
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
MathSciNet Google Scholar
Stachl, C., Au, Q., Schoedel, R., Gosling, S. D., Harari, G. M., Buschek, D., Völkel, S. T., Schuwerk, T., Oldemeier, M., Ullmann, T., Hussmann, H., Bischl, B., & Bühner, M. (2020). Predicting personality from patterns of behavior collected with smartphones. Proceedings of the National Academy of Sciences, 117(30), 17680–17687.
Article Google Scholar
Stachowiak, H. (1973). Allgemeine modelltheorie. Springer.
Book Google Scholar
Starmans, R. (2011). Models, inference, and truth: probabilistic reasoning in the information era (pp. 1–20). Causal Inference for Observational and Experimental Studies: Targeted Learning.
Google Scholar
Sterkenburg, T. F., & Grünwald, P. D. (2021). The no-free-lunch theorems of supervised learning. Synthese, 199(3), 9979–10015.
Article MathSciNet Google Scholar
Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(1), 1–11.
Article Google Scholar
Štrumbelj, E., & Kononenko, I. (2014). Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems, 41(3), 647–665. https://doi.org/10.1007/s10115-013-0679-x
Article Google Scholar
Sullivan, E. (2022). Understanding from machine learning models. The British Journal for the Philosophy of Science 73(1).
Swoyer, C. (1991). Structural representation and surrogative reasoning. Synthese, 87, 449–508.
Article MathSciNet Google Scholar
Tan, S., Caruana, R., Hooker, G., & Lou, Y. (2017). Auditing black-box models using transparent model distillation with side information. Preprint retrieved from arxiv:1710.06169
Toulmin, S. E. (1961). Foresight and understanding: An enquiry into the aims of science. Greenwood Press.
Google Scholar
Tredennick, A. T., Hooker, G., Ellner, S. P., & Adler, P. B. (2021). A practical guide to selecting models for exploration, inference, and prediction in ecology. Ecology, 102(6), e03336.
Article Google Scholar
Van Amersfoort, J., Smith, L., Teh, Y.W., & Gal, Y. (2020). Uncertainty estimation using a single deep deterministic neural network. In International conference on machine learning, pp. 9690–9700. PMLR.
Van der Laan, M. J., & Rose, S. (2018). Targeted learning in data science. Springer.
Google Scholar
Van der Laan, M. J., Rose, S., (2011). Targeted learning: causal inference for observational and experimental data, (Vol. 4). Springer.
Book Google Scholar
Van Der Laan, M. J., & Rubin, D. (2006). Targeted maximum likelihood learning. TheInternational Journal of Biostatistics. https://doi.org/10.2202/1557-4679.1043
Article Google Scholar
Van Noorden, R., & Perkel, J. M. (2023). Ai and science: what 1,600 researchers think. Nature, 621(7980), 672–675.
Article Google Scholar
Verdinelli, I., & Wasserman, L. (2024). Decorrelated variable importance. Journal of Machine Learning Research, 25(7), 1–27.
MathSciNet Google Scholar
Verma, S., & Rubin, J. (2018). Fairness definitions explained. In Proceedings of the international workshop on software fairness, pp. 1–7.
Voss, C., Cammarata, N., Goh, G., Petrov, M., Schubert, L., Egan, B., Lim, S. K., & Olah, C. (2021). Visualizing weights. Distill, 6(2), e00024-007.
Google Scholar
Wang, J., Wiens, J., & Lundberg, S. (2021). Shapley flow: A graph-based approach to interpreting model predictions. In International Conference on Artificial Intelligence and Statistics, pp. 721–729. PMLR.
Watson, D. S. (2022). Conceptual challenges for interpretable machine learning. Synthese, 200(1), 1–33.
MathSciNet Google Scholar
Watson, D. S., & Wright, M. N. (2021). Testing conditional independence in supervised learning algorithms. Machine Learning, 110(8), 2107–2129.
Article MathSciNet Google Scholar
Woodward, J., & Ross, L. (2021). Scientific explanation In E. N. Zalta (Ed.), The stanford encyclopedia of philosophy. Metaphysics Research Lab, Stanford University.
Google Scholar
Zaeem, M.N., & Komeili, M. (2021). Cause and effect: Concept-based explanation of neural networks. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), (pp. 2730–2736). IEEE.
Zednik, C. (2021). Solving the black box problem: a normative framework for explainable artificial intelligence. Philosophy & Technology, 34(2), 265–288.
Article Google Scholar
Zednik, C., & Boelsen, H. (2022). Scientific exploration and explainable artificial intelligence. Minds and Machines, 32, 1–21.
Article Google Scholar
Zeng, J., Ustun, B., & Rudin, C. (2017). Interpretable classification models for recidivism prediction. Journal of the Royal Statistical Society Series A: Statistics in Society, 180(3), 689–722.
Article MathSciNet Google Scholar
Zhang, J., & Bareinboim, E. (2017). Transfer learning in multi-armed bandit: a causal approach. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 1778–1780.
Zhang, Z., Jin, Y., Chen, B., & Brown, P. (2019). California almond yield prediction at the orchard level with a machine learning approach. Frontiers in Plant Science, 10, 809. https://doi.org/10.3389/fpls.2019.00809
Article Google Scholar
Zhao, Q., & Hastie, T. (2021). Causal interpretations of black-box models. Journal of Business & Economic Statistics, 39(1), 272–281.
Article MathSciNet Google Scholar
Zhou, B., Sun, Y., Bau, D., & Torralba, A. (2018). Revisiting the importance of individual units in cnns via ablation. Preprint retrieved from arXiv:1806.02891

Download references

Acknowledgements

We are very significantly indebted to Tom Sterkenburg, Seth Axen, Thomas Grote, Alexandra Gessner, Nacho Molina and Christian Scholbeck for their comments on the manuscript and their hints to related literature. We would also like to thank the two anonymous reviewers for their extremely valuable and constructive feedback, especially for pointing us to indeed highly connected targeted learning framework.

Funding

Open Access funding enabled and organized by Projekt DEAL. This work was supported by the Carl Zeiss stiftung (Project: Certification and Foundations of Safe Machine Learning Systems in Healthcare) and by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC number 2064/1 – Project number 390727645.

Author information

Authors and Affiliations

Cluster of Excellence Machine Learning for Science, University of Tübingen, Maria-von-Linden-Straße 6, 72076, Tübingen, Germany
Timo Freiesleben, Gunnar König & Álvaro Tejero-Cantero
Independent Researcher, Munich, Germany
Christoph Molnar

Authors

Timo Freiesleben
View author publications
You can also search for this author in PubMed Google Scholar
Gunnar König
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Molnar
View author publications
You can also search for this author in PubMed Google Scholar
Álvaro Tejero-Cantero
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timo Freiesleben.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Dataset

Figure 9 gives a descriptions of the different features and is copied from Cortez and Silva (2008). In our trained models, we only used the final G3 student grades. The data was collected during 2005 and 2006 from two public schools, from the Alentejo region in Portugal. The database is collected from a variety of sources from both school reports and questionnaires. Cortez and Silva (2008) integrated the information into a mathematics dataset (with 395 examples) and a Portuguese language dataset (649 records).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Freiesleben, T., König, G., Molnar, C. et al. Scientific Inference with Interpretable Machine Learning: Analyzing Models to Learn About Real-World Phenomena. Minds & Machines 34, 32 (2024). https://doi.org/10.1007/s11023-024-09691-z

Download citation

Received: 05 January 2024
Accepted: 27 June 2024
Published: 15 July 2024
DOI: https://doi.org/10.1007/s11023-024-09691-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Scientific Inference with Interpretable Machine Learning: Analyzing Models to Learn About Real-World Phenomena

Abstract

Similar content being viewed by others

The Automated Laplacean Demon: How ML Challenges Our Views on Prediction and Explanation

Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges

Conceptual challenges for interpretable machine learning

1 Introduction

1.1 Contributions

1.2 Roadmap

1.3 Terminology

2 Related Work

3 The Traditional Approach to Scientific Inference Requires Model Elements That Meaningfully Represent

4 The Elements of ML Models Do Not Meaningfully Represent

5 But Do ML Model Elements Really Not Represent?

6 IML Analyzes the Model as a Whole, but Does It Allow for Scientific Inference?

7 How to Think of Scientific Inference with HR Models?

8 What Aspects of Phenomena Do ML Models Holistically Represent?

9 Scientific Inference with ML in Four Steps

10 Some IML Methods Already Allow Scientific Inference

11 Property Descriptors Do Not Generally Provide Causal Inferences

12 Discussion

13 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix A: Dataset

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scientific Inference with Interpretable Machine Learning: Analyzing Models to Learn About Real-World Phenomena

Abstract

Similar content being viewed by others

The Automated Laplacean Demon: How ML Challenges Our Views on Prediction and Explanation

Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges

Conceptual challenges for interpretable machine learning

1 Introduction

1.1 Contributions

1.2 Roadmap

1.3 Terminology

2 Related Work

3 The Traditional Approach to Scientific Inference Requires Model Elements That Meaningfully Represent

4 The Elements of ML Models Do Not Meaningfully Represent

5 But Do ML Model Elements Really Not Represent?

6 IML Analyzes the Model as a Whole, but Does It Allow for Scientific Inference?

7 How to Think of Scientific Inference with HR Models?

8 What Aspects of Phenomena Do ML Models Holistically Represent?

9 Scientific Inference with ML in Four Steps

10 Some IML Methods Already Allow Scientific Inference

11 Property Descriptors Do Not Generally Provide Causal Inferences

12 Discussion

13 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix A: Dataset

Appendix A: Dataset

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation