1 Introduction

The research on and application of artificial intelligence (AI) has triggered a comprehensive scientific, economic, social and political discussion. Here we argue that statistics, as an interdisciplinary scientific field, plays a substantial role, both for the theoretical and practical understanding of AI and for its further development.

Contrary to the public perception, AI is not a new phenomenon. AI was already mentioned in 1956 at the Dartmouth Conference (Moor 2006; Solomonoff 1985), and the first data-driven algorithms such as Perceptron (Rosenblatt 1958), backpropagation (Kelley 1960) and the so-called ‘Lernmatrix’, an early neural system (Steinbuch 1961; Hilberg 1995), were developed in the 50s and 60s. The Lighthill Report in 1973 made a predominantly negative judgment on AI research in Great Britain and led to the fact that the financial support for AI research was almost completely stopped (the so-called first AI winter). The following phase of predominantly knowledge-based development ended in 1987 with the so-called second AI winter. A period of reduced public interest and funding in AI began. Nonetheless, in 1988, Judea Pearl published his book ‘Probabilistic Reasoning in Intelligent Systems’, for which he received the Turing Award in 2011 (Pearl 1988). From the beginning of the 1990s, AI has been developing again with major breakthroughs like Support Vector Machines (Cortes and Vapnik 1995), Random Forest (Breiman 2001), Bayesian Methods (Zhu et al. 2017), Boosting and Bagging (Freund and Schapire 1997; Breiman 1996), Deep Learning (Schmidhuber 2015) and Extreme Learning Machines (Huang et al. 2006).

Today, AI plays an increasingly important role in many areas of life. International organizations and national governments have currently positioned themselves or introduced new regulatory frameworks for AI. Examples are, among others, the AI strategy of the German government (Bundesregierung 2018), the statement of the Data Ethics Commission (Data Ethics Commission of the Federal Government, Federal Ministry of the Interior, Building and Community 2019) from 2019 and the report of the Nuffield Foundation in the UK (Nuffield Foundation 2019). Similarly, the European Commission recently published a white paper on AI (European Commission 2020b). Furthermore, regulatory authorities such as the US Food and Drug Administration (FDA) are now also dealing with AI topics and their evaluation. In 2018, for example, the electrocardiogram function of the Apple Watch was the first AI application to be approved by the FDA (MedTechIntelligence 2018).

There is no unique and comprehensive definition of artificial intelligence. Two concepts are commonly used distinguishing weak and strong AI. Searle (1980) defined them as follows: ‘According to weak AI, the principal value of the computer in the study of the mind is that it gives us a very powerful tool. [...] But according to strong AI, the computer is not merely a tool in the study of the mind; rather, the appropriately programmed computer really is a mind [...]’. Thus, strong AI essentially describes a form of machine intelligence that is equal to human intelligence or even improves upon it, while weak AI (sometimes also referred to as narrow AI) is limited to tractable applications in specific domains. Following this definition, we will focus on weak AI in this paper in the sense that we consider self-learning systems, which are solving specific application problems based on methods from mathematics, statistics and computer science. Consequently, we will focus on the data-driven aspects of AI in this paper. In addition, there are many areas in AI that deal with the processing of and drawing inference from symbolic data (Bock and Diday 2000; Billard and Diday 2006). In contrast to standard data tables, symbolic data may consist of, e.g. lists, intervals, etc. Thus, special methods for data aggregation and analysis are necessary, which will not be discussed here.

As for AI, there is neither a single definition nor a uniform assignment of methods to the field of machine learning (ML) in literature and practice. Often, ML is considered a subset of AI approaches. Based on Simon’s definition from 1983 (Simon 1983), learning describes changes of a system in such a way that a similar task can be performed more effectively or efficiently the next time it is performed. Bishop (2006) describes machine learning as the ‘automatic discovery of regularities in data through the use of computer algorithms [...]’. Following these concepts, we use AI in a very general sense in this paper whereas ML is used to refer to more specific (statistical) algorithms.

Often the terms AI and ML are mentioned along with Big Data (Gudivada et al. 2015) or Data Science, sometimes even used interchangeably. However, neither are AI methods necessary to solve Big Data problems, nor are methods from AI only applicable to Big Data. Data Science, on the other hand, is usually considered as an intersection of computer science, statistics and the respective scientific discipline. Therefore, it is not bound to certain methods or certain data conditions.

This paper aims at contributing to the current discussion about AI by highlighting the relevance of statistical methodology in the context of AI development and application. Statistics can make important contributions to a more successful and secure use of AI systems, for example with regard to

  1. 1.

    Design (Sect. 3): bias reduction; validation; representativity; selection of variables

  2. 2.

    Assessment of data quality (Sect. 4): standards for the quality of diagnostic tests and audits; dealing with missing values

  3. 3.

    Differentiation between causality and associations (Sect. 5): consideration of covariate effects; answering causal questions; simulation of interventions

  4. 4.

    Assessment of certainty or uncertainty in results (Sect. 6): Increasing interpretability; mathematical validity proofs or theoretical properties in certain AI contexts; providing stochastic simulation designs; accurate analysis of the quality criteria of algorithms in the AI context

The remainder of the paper is organized as follows: First, we present an overview of AI applications and methods in Sect. 2. We continue by expanding on the points 1.–4. in Sects. 3, 4, 5 and 6. We conclude with Sect. 7. There, we also discuss the increased need for teaching and further education targeting the increase of AI-related literacy (particularly with respect to the underlying statistical concepts) at all educational levels.

2 Applications and methods of AI

Important categories of AI approaches are supervised learning, unsupervised learning and reinforcement learning (Sutton and Barto 2018). In supervised learning, AI systems learn from training data with known output such as true class labels or responses. Thus, the aim is to learn some function \(g: X \rightarrow Y\) describing the relationship between an \(n \times p\) matrix of given features \({\mathbf {X}} \subset X\) and the vector of labels \({\mathbf {Y}}= (y_1, \dots , y_n)' \subset Y\). Here, n denotes the number of observations, p is the number of features and X and Y describe the input and output space, respectively. Examples include, among others, support-vector machines, linear and logistic regression or decision trees. In contrast, unsupervised learning extracts patterns from unlabeled data, i.e. without the \(y_i\)s in the notation above. The most well-known examples include principal component analysis and clustering. Finally, reinforcement learning originates from robotics and describes the situation where an ‘agent’ (i.e. an autonomous entity with the ability to act and direct its activity towards achieving goals) learns through trial-and-error search. Markov decision processes from probability theory play an important role here (Sutton and Barto 2018). The input data to an AI algorithm can be measured values such as stock market prices, audio signals, climate data or texts, but may also describe very complex relationships, such as chess games. In the following, we provide some specific examples of AI applications.

2.1 Applications of AI

AI has made remarkable progress in various fields of application. These include automated face recognition, automated speech recognition and translation (Barrachina et al. 2009), object tracking in film material, autonomous driving, and the field of strategy games such as chess or go, where computer programs now beat the best human players (Koch 2016; Silver et al. 2018).

Especially for tasks in speech recognition as well as text analysis and translation, Hidden Markov models from statistics are used and further developed with great success (Juang and Rabiner 1991; Kozielski et al. 2013) because they are capable of representing grammars. Nowadays, automatic language translation systems can even translate languages such as Chinese into languages of the European language family in real time and are used, for example, by the EU (European Commission 2020a). Another growing area for AI applications is medicine. Here, AI is used, e.g., to improve the early detection of diseases, for more accurate diagnoses, or to predict acute events (Burt et al. 2018; Chen et al. 2018), see also Friedrich et al. (2021) for a recent overview. Directions for future developments include personalized medicine aiming at tailoring treatments to patient subgroups (strata) or even individual patients (Hamburg and Collins 2010; Blasiak et al. 2020; Schork 2019). Furthermore, official statistics uses AI methods for classification as well as for recognition, estimation and/or imputation of relevant characteristic values of statistical units (Beck et al. 2018; Ramosaj and Pauly 2019b; Ramosaj et al. 2020; UNECE 2020; Thurow et al. 2021). In economics and econometrics, AI methods are also applied and further developed, for example, to draw conclusions about macroeconomic developments from large amounts of data on individual consumer behavior (McCracken and Ng 2016; Ng 2018).

Despite these positive developments that also dominate the public debate, some caution is advisable. There are a number of reports about the limits of AI, e.g., in the case of a fatal accident involving an autonomously driving vehicle (Wired.com 2019). Due to the potentially serious consequences of false positive or false negative decisions in AI applications, careful consideration of these systems is required (AInow 2020). This is especially true in applications such as video surveillance of public spaces. For instance, a pilot study conducted by the German Federal Police at the Südkreuz suburban railway station in Berlin has shown that automated facial recognition systems for identification of violent offenders currently have false acceptance rates of 0.67% (test phase 1) and 0.34% (test phase 2) on average (Bundespolizeipräsidium Potsdam 2018). This means that almost one in 150 (or one in 294) passers-by is falsely classified as a violent offender. In medicine, wrong decisions can also have drastic and negative effects, such as an unnecessary surgery and chemotherapy in the case of wrong cancer diagnoses. Corresponding test procedures for assessing such diagnostic tests for medicine are currently being developed by regulators such as the US FDA (FDA 2019).

2.2 Methods for AI and the role of statistics

Even though many of the contributions to AI systems originate from computer science, statistics has played an important role throughout. Early examples occurred in the context of realizing the relationship between backpropagation and nonlinear least squares methods, see, e.g., Warner and Misra (1996). Important ML methods such as random forests (Breiman 2001) or support vector machines (Cortes and Vapnik 1995) were developed by statisticians. Others, like radial basis function networks (Chen et al. 1991), can also be considered and studied as nonlinear regression models in statistics. Recent developments such as extreme learning machines or broad learning systems (Chen and Liu 2018) have close links to multiple multivariate and ridge regression, i.e. to statistical methods. The theoretical validity of machine learning methods, e.g., through consistency statements and generalization bounds (Györfi et al. 2002; Vapnik 1998), also requires substantial knowledge of mathematical statistics and probability theory.

To capture the role and relevance of statistics, we consider the entire process of establishing an AI application. As illustrated in Fig. 1, various steps are necessary to examine a research question empirically. For more details on these steps see, e.g., Weihs and Ickstadt (2018). Starting with the precise formulation of the research question the process then runs through a study design stage (including sample size planning and bias control) to the mathematical formulation (e.g. as an optimization problem) and the numerical analysis. Finally, the results must be interpreted. AI often focuses on the step of data analysis while the other stages receive less attention or are even ignored. This may result in critical issues and possibly misleading interpretations, such as sampling bias or the application of inappropriate analysis tools requiring assumptions not met by the chosen design.

Fig. 1
figure 1

Flow chart of study planning, design, analysis and interpretation

3 Statistical approaches for study design and validation

The design of a study and the data to be considered is the basis for the validity of the conclusions. Unfortunately, AI applications often use data that were collected for a different purpose (so-called secondary data, observational studies). The collection and compilation of secondary data is in general not based on a specific purpose or a research question. Instead, it is collected for other purposes such as accounting or storage purposes. A typical case is the scientific use of routine data. For example, the AI models in a recent study about predicting medical events (such as hospitalization) are based on medical billing data (Lin et al. 2019). Another typical case concerns convenience samples, that is, samples that are not randomly drawn but instead depend on ‘availability’. Well-known examples are online questionnaires, which only reach those people who visit the corresponding homepage and take the time to answer the questions. The concept of knowledge discovery in databases (Fayyad et al. 1996) very clearly reflects the assumption that data are regarded as a given basis from which information and knowledge can be extracted by AI procedures. This is contrary to the traditional empirical research process, in which empirically testable research questions are derived from theoretical questions by conceptualizing and operationalizing. Importantly, the resulting measurement variables are then collected for this specific purpose.

3.1 Validation

Statistics distinguishes between two types of validity (Shadish et al. 2002):

  1. 1.

    Internal validity is the ability to attribute a change in an outcome of a study to the investigated causes. In clinical research, e.g., this type of validity is ensured through randomization in controlled trials. Internal validity in the context of AI and ML can also refer to avoiding systematic bias (such as systematically underestimated risks).

  2. 2.

    External validity is the ability to transfer the observed effects and relationships to larger or different populations, environments, situations, etc. In the social sciences (e.g. in the context of opinion research), an attempt to achieve this type of validity is survey sampling, which comprises sampling methods that aim at representative samples in the sense of Gabler and Häder (2018), see also Kruskal and Mosteller (1979a, 1979b, 1979c, 1980).

These validation aspects are important, but different traditions exist for AI algorithms and statistics: While ML has a longstanding benchmarking tradition and often uses many datasets for evaluation, statistics tends to rely on theory and simulations augmented by one or two convincing data examples. Here, statistics makes use of probabilistic models in order to reflect a diversity of real life situations. In addition to mathematical validity proofs and theoretical investigations, detailed simulation studies are carried out to evaluate the methods’ limits (by exceeding the assumptions made) and finite sample properties in situations where certain properties can only be proven asymptotically. This statistical perspective provides useful insights.

Concepts and guidelines for designing, structuring and reporting simulation studies have a longstanding tradition in medical statistics, see for example Burton et al. (2006), Friede et al. (2010), Benda et al. (2010), Morris et al. (2019).

A particular challenge for validation of AI systems are the ever faster development cycles which require continuous investigations. This can even be aggravated when turning to development processes of mobile apps or online learning systems such as recommender systems in online shopping portals. Here, the developments are dynamic, de facto never ending processes, which therefore require continuous validation.

Another important factor for the validity and reliability of a study is the sample size (Meng 2018). For high-dimensional models, an additional factor is ‘sparsity’: In many applications, the input data for the AI techniques are very high-dimensional, i.e. a large number of variables p (also called features) are observed with diverse ranges of possible values. In addition, non-linear relationships with complex interactions are often considered for prediction. It is well known that high-dimensional data are difficult to handle if sample sizes are small, i.e. if \(p \gg n\). However, even with sample sizes in the order of millions, the problem of the curse of dimensionality arises (Bellman 1957), because data is thin and sparse in a high-dimensional space, i.e. only few variables are related to the outcome. Therefore, learning the structure of the high-dimensional space from these thin data typically requires an enormous amount of training data. Through statistical models and corresponding mathematical approximations or numerical simulations, statisticians can assess the potentials and limits of an AI application for a given number of cases or estimate the necessary number of cases in the planning stage of a study. This is not routine work; instead, it requires advanced statistical training, competence, and experience.

Thus, statistics can help in collecting and processing data for subsequent use in AI pipelines. Basic statistical techniques that are relevant for this aspect include, for example, the modeling of the data generating process, restrictions on data sets (Rubin 2008), and factorial design of experiments, which is a controlled variation of factors highlighting their respective influence. In addition, the various phases in the development of a diagnostic test are well known in statistics (Pepe 2003), with (external) validation on independent data playing a crucial role. In many AI applications, however, the final evaluation phase on external data is never reached, since the initial algorithms have been replaced in the meantime. Also, statistical measures of quality such as sensitivity, specificity, ROC curves and calibration are used in the evaluation of AI methods. And finally, statistics can help in the assessment of uncertainty (Sect. 6).

3.2 Representativity

The naive expectation that sufficiently large data automatically leads to representativity is wrong (Meng 2018; Meng and Xie 2014). A prominent example is Google Flu (Lazer et al. 2014), where flu outbreaks were predicted on the basis of search queries: it turned out that the actual prevalence of the flu was overestimated considerably. Another example is Microsoft’s chatbot Tay (Davis 2016; Wolf et al. 2017), which was designed to mimic the speech pattern of a 19-year-old American girl and to learn from interactions with human users on Twitter: after only a short time, the bot posted offensive and insulting tweets, forcing Microsoft to shut down the service just 16 hours after it started. And yet another example is the recently published Apple Heart Study (Perez et al. 2019), which examined the ability of Apple Watch to detect atrial fibrillation: there were more than 400,000 participants, but the average age was 41 years, which is particularly problematic in view of atrial fibrillation occurring almost exclusively in people over 65 years of age.

3.3 Bias

If careful data collection is not accounted for, spurious correlations and bias can falsify the conclusions. Many forms of bias exist, such as selection, attribution, performance, and detection bias. While bias in the context of statistics usually refers to the deviation between the estimated and the true value of a parameter, there are also other concepts such as cognitive biases or, as Ntoutsi et al. (2020) put it ‘the inclination or prejudice of a decision made by an AI system which is for or against one person or group, especially in a way considered to be unfair’. A classic example of such falsification is Simpson’s paradox (Simpson 1951), which describes a reversal of group-specific trends when subgroups are disregarded, see Fig. 2. Further examples are biases inflicted by how the data are collected, such as length time bias (Porta 2016) or prejudices introduced by AI, see Ntoutsi et al. (2020) for a recent overview of this topic.

Statistics provides methods and principles for minimizing bias. Examples include the assessment of the risk of bias in medicine (Higgins et al. 2011), stratification, marginal analyses, consideration of interactions, meta-analyses, and techniques specifically designed for data collection such as (partial) randomization, (partial) blinding, and methods of so-called optimal designs (Karlin and Studden 1966). Statistics also provides designs that allow for the verification of internal and external validity (Bartels et al. 2018; Braver and Smith 1996; Roe and Just 2009).

Fig. 2
figure 2

Simpson’s paradox for continuous data: a positive trend is visible for both groups individually (red and blue), but a negative trend (dashed line) appears when the data are pooled across groups (Wikipedia 2020) (color figure online)

3.4 Model stability and reproducibility

Whether there is interest in a model for prediction or in a descriptive model, model stability, i.e. the robustness of the model towards small changes in the input values, plays an important role. Variable selection methods are used to derive descriptive models and model complexity has an important influence on the choice of the methods. In a recent review of variable selection procedures, Heinze et al. (2018) emphasize the important role of stability investigations. This issue is also mentioned in Sauerbrei et al. (2020) as one of the main target parameters for the comparison of variable selection strategies. Statistical concepts have been introduced by Meinshausen and Bühlmann (2010) as well as in the context of random forests (Breiman 2001).

Conscientiously sticking to the principles mentioned above and adhering to a previously defined study design also counteracts the so-called replication crisis (Pashler and Wagenmakers 2012). In this methodological crisis, which has been ongoing since the beginning of the 2010s, it has become clear that many studies, especially in medicine and the social sciences, are difficult or impossible to reproduce. Since reproducibility of experimental results is an essential part of scientific methodology (Staddon 2017), an inability to replicate the studies of others can have grave consequences for many fields of science. The replication crisis has been particularly widely discussed in psychology and medicine, where a number of efforts have been made to re-investigate previous findings in order to determine their reliability (Begley and Ellis 2012; Makel et al. 2012). A related issue is transparency. While this is an important concept in any empirical analysis, it has especially become an issue discussed in the context of AI applications, see for example Flake and Fried (2020), Haibe-Kains et al. (2020), Simons et al. (2017).

4 Statistics for the assessment of data quality

‘Data is the new oil of the global economy.’ According to, e.g., the New York Times (New York Times 2018) or the Economist (The Economist 2017), this credo echoes incessantly through start-up conferences and founder forums. This metaphor is not only popular but false. First of all, data in this context corresponds to crude oil, which needs further refining before it can be used. In addition, the resource crude oil is limited. ‘For a start, while oil is a finite resource, data is effectively infinitely durable and reusable’ [Bernard Marr in Forbes (2018)]. All the more important is a responsible approach to data preprocessing (Fig. 3).

Fig. 3
figure 3

Data relevancy and quality are equivalent components of a fit-for-purpose real-world data set. Figure according to Duke-Margolis (2018)

Ensuring data quality is of great importance in all analyses, according to the popular slogan ‘Garbage in, garbage out.’ As already mentioned in the previous section, we mainly use secondary data in the context of AI. In AI, the process of operationalization is often replaced by the ETL process: ‘Extract, Transform, Load’ (Theodorou et al. 2017). Relevant measurements are to be extracted from the data lake(s), then transformed and finally loaded into the (automated) analysis procedures. Many AI procedures are thereby expected to be able to distill relevant influencing variables from high-dimensional data.

The success of this procedure fundamentally depends on the quality of the data. In line with Karr et al. (2006), data quality is defined here as the ability of data to be used quickly, economically and effectively for decision-making and evaluation (Karr et al. 2006). In this sense, data quality is a multi-dimensional concept that goes far beyond measurement accuracy and includes aspects such as relevance, completeness, availability, timeliness, meta-information, documentation and, above all, context-dependent expertise (Duke-Margolis 2018, 2019). In official statistics, relevance, accuracy and reliability, timeliness and punctuality, coherence and comparability, accessibility and clarity are defined as dimensions of data quality (European Statistical System 2019).

Increasing automation of data collection, e.g., through sensor technology, may increase measurement accuracy in a cost-effective and simple way. Whether this will achieve the expected improvement in data quality remains to be checked in each individual application. Missing values are a common problem of data analyses. In statistics, a variety of methods have been developed to deal with these, including imputation procedures, or methods of data enhancement (Rubin 1976; Seaman and White 2013; Van Buuren 2018). The AI approach of ubiquitous data collection allows the existence of redundant data, which can be used in a preprocessing step with appropriate context knowledge to complete incomplete data sets. However, this requires a corresponding integration of context knowledge into the data extraction process.

The data-hungry decision-making processes of AI and statistics are subject to a high risk with regard to relevance and timeliness, since they are implicitly based on the assumption that the patterns hidden in the data should perpetuate themselves in the future. In many applications, this leads to an undesirable entrenchment of existing stereotypes and resulting disadvantages, e.g., in the automatic granting of credit or the automatic selection of applicants. A specific example is given by the gender bias in Amazon’s AI recruiting tool (Dastin 2018).

In the triad ‘experiment - observational study - convenience sample (data lake)’, the field of AI, with regard to its data basis, is moving further and further away from the classical ideal of controlled experimental data collection to an exploration of given data based on pure associations. However, only controlled experimental designs guarantee an investigation of causal questions. This topic will be discussed in more detail in Sect. 5. Causality is crucial if the aim of the analysis is to explain relationships such as the function \(g: X \rightarrow Y\) linking the feature vector \(\{x_1,\dots , x_n\} \subset X\) to the outcome \(\{y_1, \dots , y_n\} \subset Y\). There are, however, other situations where one might not be primarily interested in causal conclusions. Good prediction, for example, can also be obtained by using variables that are not themselves causally related to the outcome but strongly correlated with some causal predictor instead.

Exploratory data analysis (Tukey 1962) provides a broad spectrum of tools to visualize the empirical distributions of the data and to derive corresponding key figures. This can be used in preprocessing to detect anomalies or to define ranges of typical values in order to correct input or measurement errors and to determine standard values. In combination with standardization in data storage, data errors in the measurement process can be detected and corrected at an early stage. This way, statistics helps to assess data quality with regard to systematic, standardized and complete recording. Survey methodology primarily focuses on data quality. The insights gained in statistical survey research to ensure data quality with regard to internal and external validity provide a profound foundation for corresponding developments in the context of AI. Furthermore, various procedures for imputing missing data are known in statistics, which can be used to complete the data depending on the existing context and expertise (Rubin 1976; Seaman and White 2013; Van Buuren 2018). Statisticians have dealt intensively with the treatment of missing values under different development processes [non-response, missing not at random, missing at random, missing completely at random (Rubin 1976; Molenberghs et al. 2014)], selection bias and measurement error (Keogh et al. 2020; Shaw et al. 2020).

Another point worth mentioning is parameter tuning, i.e. the determination of so-called hyperparameters, which control the learning behavior of ML algorithms: comprehensive parameter tuning of methods in the AI context often requires very large amounts of data. For smaller data volumes it is almost impossible to use such procedures. However, certain model-based (statistical) methods can still be used in this case (Richter et al. 2019).

5 Distinguishing between causality and association

Only a few decades ago, the greatest challenge of AI research was to program machines to associate a potential cause to a set of observable feature values, e.g. through Bayesian networks (Pearl 1988). The rapid development of AI in recent years (both in terms of the theory and methodology of statistical learning processes and the computing power of computers) has led to a multitude of algorithms and methods that have now mastered this task. One example are deep learning methods, which are used in robotics (Levine et al. 2018) and autonomous driving (Teichmann et al. 2018), as well as in computer-aided detection and diagnostic systems [e.g., for breast cancer diagnosis (Burt et al. 2018)], drug discovery in pharmaceutical research (Chen et al. 2018) and agriculture (Kamilaris and Prenafeta-Boldú 2018). With their often high predictive power, AI methods can uncover structures and relationships in large volumes of data based on associations. Due to the excellent performance of AI methods in large data sets, they are also frequently used in medicine to analyze register and observational data that have not been collected within the strict framework of a randomized study design (Sect. 3). However, the discovery of correlations and associations (especially in this context) is not equivalent to establishing causal claims.

An important step in the further development of AI is therefore to replace associational argumentation with causal argumentation. Pearl (2010) describes the difference as follows: ‘An associational concept is any relationship that can be defined in terms of a joint distribution of observed variables, and a causal concept is any relationship that cannot be defined from the distribution alone.’

Even the formal definition of a causal effect is not trivial. The fields of statistics and clinical epidemiology, for example, use the Bradford Hill criteria (Hill 1965) and the counterfactual framework introduced by Rubin (1974). The central problem in observational data are covariate effects, which, in contrast to the randomized controlled trial, are not excluded by design and whose (non-)consideration leads to distorted estimates of causal effects. In this context, a distinction must be made between confounders, colliders, and mediators (Pearl 2009). Confounders are unobserved or unconsidered variables that influence both the exposure and the outcome, see Fig. 4a. This can distort the effects of exposure if naively correlated. Fisher identified this problem in his book ‘The Design of Experiments’ published in 1935. A formal definition was developed in the field of epidemiology in the 1980s (Greenland and Robins 1986). Later, graphical criteria such as the Back-Door Criterion (Greenland et al. 1999; Pearl 1993) were developed to define the term confounding.

In statistics, the problem of confounding is taken into account either in the design (e.g., randomized study, stratification, etc.) or evaluation [propensity score methods (Cochran and Rubin 1973), marginal structural models (Robins et al. 2000), graphical models (Didelez 2007)]. In this context, it is interesting to note that randomized studies (which have a long tradition in the medical field) have recently been increasingly used in econometric studies (Athey and Imbens 2017; Duflo et al. 2007; Kohavi et al. 2020). In the case of observational data, econometrics has made many methodological contributions to the identification of treatment effects, e.g., via the potential outcome approach (Rosenbaum 2017, 2002, 2010; Rubin 1974, 2006) as well as the work on policy evaluation (Heckman 2001).

Fig. 4
figure 4

Covariate effects in observational data, according to Catalogue of bias collaboration (2019)

In contrast to confounders, colliders and mediators lead to distorted estimates of causal effects precisely when they are taken into account during estimation. Whereas colliders represent common consequences of treatment and outcome (Fig. 4b), mediators are variables that represent part of the causal mechanism by which the treatment affects the outcome (Fig. 4c). Especially in the case of longitudinal data, it is therefore necessary to differentiate in a theoretically informed manner which relationships the covariates in the observed data have with the treatment and outcome variables, thus avoiding bias in the causal effect estimates by (not) having taken them into account.

By integrating appropriate statistical theories and methods into AI, it will be possible to answer causal questions and simulate interventions. In medicine, e.g., questions such as ‘What would be the effect of a general smoking ban on the German health care system’ could then be investigated and reliable statements could be made, even without randomized studies which would not be possible here. Pearl’s idea goes beyond the use of ML methods in causal analyses (which are used, for example, in connection with targeted learning (Van der Laan and Rose 2011) or causal random forest (Athey and Imbens 2015)). His vision is rather to integrate the causal framework (Pearl 2010) described by him with ML algorithms to enable the machines to draw causal conclusions and simulate interventions.

The integration of statistical methods to detect causality in AI also contributes to increasing its transparency and thus the acceptance of AI methods, since a reference to probabilities or statistical correlations in the context of an explanation is not as effective as a reference to causes and causal effects (Miller 2019).

6 Statistical approaches for evaluating uncertainty and interpretability

Uncertainty quantification is often neglected in AI applications. One reason may be the above discussed misconception that ‘Big Data’ automatically leads to exact results, making uncertainty quantification redundant. Another key reason is the complexity of the methods which hampers the construction of statistically valid uncertainty assessments. However, most statisticians would agree that any comprehensive data analysis should contain methods to quantify the uncertainty of estimates and predictions. Its importance is also stressed by the American statistician David B. Dunson who writes that: ‘it is crucial to not over-state the results and appropriately characterize the (often immense) uncertainty to avoid flooding the scientific literature with false findings.’ (Dunson 2018).

In fact, in order to achieve the main goal of highly accurate predictions, assumptions about underlying distributions and functional relationships are deliberately dropped in AI applications. On the one hand, this allows for a greater flexibility of the procedures. On the other hand, however, this also complicates an accurate quantification of uncertainty, e.g., to specify valid prediction and confidence regions for target variables and parameters of interest. As Bühlmann and colleagues put it: ‘The statistical theory serves as guard against cheating with data: you cannot beat the uncertainty principle.’ (Bühlmann and van de Geer 2018). In recent years, proposals for uncertainty quantification in AI methods have already been developed by invoking Bayesian approximations, bootstrapping, jackknifing and other cross-validation techniques, Gaussian processes, Monte Carlo dropout etc., see e. g., Gal and Ghahramani (2016), Garnelo et al. (2018), Osband et al. (2016), Srivastava et al. (2014), Wager et al. (2014). However, their theoretical validity (e.g., that a prediction interval actually covers future values 95% of the time) has either not been demonstrated yet or has only been proven under very restrictive or at least partially unrealistic assumptions.

In contrast, algorithmic methods could be embedded in statistical models. While potentially less flexible, they permit a better quantification of the underlying uncertainty by specifying valid prediction and confidence intervals or allow for a better interpretation of the results. We give two examples: In time-to-event analyses mathematically valid simultaneous confidence bands for cumulative incidence functions can be constructed by combinations of nonparametric estimators of Kaplan-Meier or Aalen-Johansen-type and algorithmic resampling (Bluhmki et al. 2018; Dobler et al. 2017). Similarly, in the context of time series prediction, hybrid combinations of artificial neural networks with ARIMA models or within hierarchical structures allow for better explainability (Aburto and Weber 2007; Wickramasuriya et al. 2019).

Moreover, the estimated parameters of many AI approaches (such as deep learning) are difficult to interpret. Pioneering work from computer science on this topic is, for example, Valiant (1984, 2013), for which Leslie Valiant was awarded the Turing Award in 2010. Further research is nevertheless needed to improve interpretability. This also includes uncertainty quantification of patterns identified by an AI method, which heavily rely on statistical techniques. A tempting approach to achieve more interpretable AI methods is the use of auxiliary models. These are comparatively simple statistical models which, after adaptation of a deep learning approach, describe the most important patterns represented by the AI method and potentially can also be used to quantify uncertainty (Molnar 2019; Peltola 2018; Ribeiro et al. 2016a, b). In fact, as in computational and statistical learning theory (Györfi et al. 2002; Kearns and Vazirani 1994; Vapnik 1998), statistical methods and AI learning approaches can (and should) complement each other. Another important aspect is the model complexity which can, e.g., be captured by entropies (such as VC dimensions) or compression barriers (Langford 2005). These concepts as well as different forms of regularization (Tibshirani 1996; Wager et al. 2013; Zaremba et al. 2014), i.e. the restriction of the parameter space, allow to recognize or even to correct an overfitting of a learning procedure. Here, the application of complexity reducing concepts can be seen as a direct implementation of the Lex Parsimoniae principle and often increases the interpretability of resulting models (Ross et al. 2017; Tibshirani 1997). In fact, regularization and complexity reducing concepts are an integral part of many AI methods. However, they are also basic principles of modern statistics, which were already proposed before their introduction to AI. Examples are given in connection with empirical Bayesian or shrinkage methods (Röver and Friede 2020). In addition to that, AI and statistics have numerous concepts in common which give rise to an exchange of methods in these fields.

Furthermore, uncertainty aspects also apply to quality criteria (e.g., accuracy, sensitivity and specificity) of AI algorithms. The corresponding estimators are also random but their uncertainty is usually not quantified at all.

Statistics can help to increase the validity and interpretability of AI methods by providing contributions to the quantification of uncertainty. To achieve this, we can assume specific probabilistic and statistical models or dependency structures which allow comprehensive mathematical investigations (Athey et al. 2019; Bartlett et al. 2004; Devroye et al. 2013; Györfi et al. 2002; Scornet et al. 2015; Wager and Athey 2018; Ramosaj and Pauly 2019a), e.g., by investigating robustness properties, proving asymptotic consistency or (finite) error bounds. On the other hand this also includes the elaboration of (stochastic) simulation designs (Morris et al. 2019) and the specification of easy to interpret auxiliary statistical models. Finally, it allows for a detailed analysis of quality criteria of AI algorithms.

7 Conclusion and discussion

AI has been a growing research area for years, and its development will probably continue in the coming decades. In addition to ethical and legal problems, there are still many open questions regarding the collection and processing of data. Statistical methods must be considered as integral part of AI systems, from the formulation of the research questions, the development of the research design, through the analysis up to the interpretation of the results. Particularly in the field of methodological development, statistics can, e.g., serve as multiplier and strengthen the scientific exchange by establishing broad and strongly interconnected networks between users and developers.

In the context of clinical trials, statistics also provides guidelines for important aspects of trial design, data analysis and reporting. Many of these guidelines are currently being extended for AI applications, e.g. the TRIPOD statement (Collins et al. 2015; Collins and Moons 2019) or the CONSORT and SPIRIT guidelines (Liu et al. 2020; Rivera et al. 2020). Moreover, initiatives such as STRATOS (STRengthening Analytical Thinking for Observational Studies, https://stratos-initiative.org/) aim to provide guidance for applied statisticians and other data analysts with varying levels of statistical education.

As a core element of AI, statistics is the natural partner for other disciplines in teaching, research and practice. Therefore, it is advisable to incorporate statistical aspects into AI teaching and to bridge the gap between the two disciplines. This begins with school education, where statistics and computer science should be integral elements of the curricula, and continues with higher education as well as professional development and training. By developing professional networks, participating methodologists can be brought together with users/experts to establish or maintain a continuous exchange between the disciplines. In addition to AI methods, these events should also cover the topics of data curation, management of data quality and data integration.

Statistics is a broad cross-scientific discipline. Statisticians provide knowledge and experience of all aspects of data evaluation: starting with the research question through design and analysis to the interpretation. In particular, the following contributions of statistics to the field of artificial intelligence can be summarized:

  1. 1.

    Methodological development: The development of AI systems and their theoretical underpinning has benefited greatly from research in computer science and statistics, and many procedures have been developed by statisticians. Recent advances such as extreme learning machines show that statistics also provides important contributions to the design of AI systems, for example, by improved learning algorithms based on penalized or robust estimation methods.

  2. 2.

    Planning and design: Statistics can help to optimize data collection or preparation (sample size, sampling design, weighting, restriction of the data set, design of experiments, etc.) for subsequent evaluation with AI methods. Furthermore, the quality measures of statistics and their associated inference methods can help in the evaluation of AI models.

  3. 3.

    Assessment of data quality and data collection: Exploratory data analysis provides a wide range of tools to visualize the empirical distribution of the data and to derive appropriate metrics, which can be used to detect anomalies or to define ranges of typical values, to correct input errors, to determine norm values and to impute missing values. In combination with standardization in data storage, errors in the measurement process can be detected and corrected at an early stage. With the help of model-based statistical methods, comprehensive parameter tuning is also possible, even for small data sets.

  4. 4.

    Differentiation of causality and associations: In statistics, methods for dealing with covariate effects are known. Here, it is important to differentiate theoretically informed between the different relationships covariates can have to treatment and outcome in order to avoid bias in the estimation of causal effects. Pearl’s causal framework enables the analysis of causal effects and the simulation of interventions. The integration of causal methods into AI can also contribute to the transparency and acceptance of AI methods.

  5. 5.

    Assessment of certainty or uncertainty in results: Statistics can help to enable or improve the quantification of uncertainty in and the interpretability of AI methods. By adopting specific statistical models, mathematical proofs of validity can also be provided. In addition, limitations of the methods can be explored through (stochastic) simulation designs.

  6. 6.

    Conscientious implementation of points 2 to 5, including a previously defined evaluation plan, also counteracts the replication crisis (Pashler and Wagenmakers 2012) in many scientific disciplines. This aspect does not only hold for AI applications, but generally concerns all empirical studies.

  7. 7.

    Education, advanced vocational training and public relations: With its specialized knowledge, statistics is the natural partner for other disciplines in teaching and training. Especially in the further development of methods of artificial intelligence, statistics can strengthen scientific exchange.

With respect to some points raised in this paper, a few comments are in place. First, as mentioned in the introduction there is no unique definition of AI or ML according to the literature and distinguishing between the two is not easy. A broader consensus in the scientific community is necessary to facilitate common discussions. Second, as an anonymous referee commented, it might be helpful to distinguish between different frameworks concerning data and problems. The proposal is to distinguish between (a) problems and data with random or partly random aspects and (b) problems with a deterministic background such as graph theoretical structures or optimum configurations. While the first is a natural field of application for statistics, the second may also benefit from statistical approaches, e.g. concerning robustness or sensitivity. A related issue concerns the fact that the evaluation of AI methods must be seen in the context of the corresponding application. In life sciences and medicine we often assume the existence of some underlying ‘ground truth’ which needs to be estimated. Thus, modeling concepts such as bias or accuracy can be used for evaluation. In other areas such as economy or marketing, the idea rather is to derive a somewhat ‘useful’ or ‘effective’ strategy from the data. In such a situation, statistics can still be used for evaluation, for example by making predictions and comparing their accuracy with the observed data.

Another important aspect concerns the combination of data and results obtained from different studies. In evidence based medicine, systematic reviews and meta-analyses play a key role in combining results from multiple studies to give a quantitative summary of the literature. In contrast, meta-analysis methods to combine results from AI applications have not been developed yet. Initiatives to enable the sharing of data and models in AI include federated learning and software frameworks such as DataSHIELD (DataSHIELD 2018; Gaye et al. 2014), which enables remote and nondisclosive analysis of sensitive data, see also Bonofiglio et al. (2020). Thus, both fields could profit from an exchange of methods in this context.

The objective of statistics related to AI must be to facilitate or enable the interpretation of data. As Pearl puts it: ‘Data alone are hardly a science, regardless how big they get and how skillfully they are manipulated’ (Pearl 2018). What is important is the knowledge gained that will enable future interventions.