Is there a role for statistics in artificial intelligence?

The research on and application of artificial intelligence (AI) has triggered a comprehensive scientific, economic, social and political discussion. Here we argue that statistics, as an interdisciplinary scientific field, plays a substantial role both for the theoretical and practical understanding of AI and for its future development. Statistics might even be considered a core element of AI. With its specialist knowledge of data evaluation, starting with the precise formulation of the research question and passing through a study design stage on to analysis and interpretation of the results, statistics is a natural partner for other disciplines in teaching, research and practice. This paper aims at contributing to the current discussion by highlighting the relevance of statistical methodology in the context of AI development. In particular, we discuss contributions of statistics to the field of artificial intelligence concerning methodological development, planning and design of studies, assessment of data quality and data collection, differentiation of causality and associations and assessment of uncertainty in results. Moreover, the paper also deals with the equally necessary and meaningful extension of curricula in schools and universities.


Introduction
The research on and application of artificial intelligence (AI) has triggered a comprehensive scientific, economic, social and political discussion. Here we argue that statistics, as an interdisciplinary scientific field, plays a substantial role, both for the theoretical and practical understanding of AI and for its further development.
Contrary to the public perception, AI is not a new phenomenon. AI was already mentioned in 1956 at the Dartmouth Conference [86,132], and the first data-driven algorithms such as Perceptron [113] and backpropagation [63] were developed in the 50s and 60s. The Lighthill Report in 1973 made a predominantly negative judgment on AI research in Great Britain and led to the fact that the financial support for AI research was almost completely stopped (the so-called first AI winter). The following phase of predominantly knowledge-based development ended in 1987 with the so-called second AI winter. In 1988, Judea Pearl published his book 'Probabilistic Reasoning in Intelligent Systems', for which he received the Turing Award in 2011 [94]. From the beginning of the 1990s, AI has been developing again with major breakthroughs like Support Vector Machines [23], Random Forest [13], Bayesian Methods [155], Boosting and Bagging [43,12], Deep Learning [121] and Extreme Learning Machines [56].
Today, AI plays an increasingly important role in many areas of life. International organizations and national governments have currently positioned themselves or introduced new regulatory frameworks for AI. Examples are, among others, the AI strategy of the German government [16] and the statement of the Data Ethics Commission [25] from 2019. Similarly, the European Commission recently published a white paper on AI [35]. Furthermore, regulatory authorities such as the US Food and Drug Administration (FDA) are now also dealing with AI topics and their evaluation. In 2018, for example, the electrocardiogram function of the Apple Watch was the first AI application to be approved by the FDA [80].
There is no unique and comprehensive definition of artificial intelligence. Two concepts are commonly used distinguishing weak and strong AI. Searle (1980) defined them as follows: 'According to weak AI, the principal value of the computer in the study of the mind is that it gives us a very powerful tool. [...] But according to strong AI, the computer is not merely a tool in the study of the mind; rather, the appropriately programmed computer really is a mind [...]' [124]. In the following, we adopt the understanding of the German Government, which is the basis of Germany's AI strategy [16]. In this sense, (weak) AI 'is focused on solving specific application problems based on methods from mathematics and computer science, whereby the developed systems are capable of self-optimization' [16]. An important aspect thus is that AI systems are self-learning. Consequently, we will focus on the data-driven aspects of AI in this paper. In addition, there are many areas in AI that deal with the processing of and drawing inference from symbolic data [120], which will not be discussed here.
As for AI, there is neither a single definition nor a uniform assignment of methods to the field of machine learning (ML) in literature and practice. Based on Simon's definition from 1983 [130], learning describes changes of a system in such a way that a similar task can be performed more effectively or efficiently the next time it is performed.
Often the terms AI and ML are mentioned along with Big Data [50] or Data Science, sometimes even used interchangeably. However, neither are AI methods necessary to solve Big Data problems, nor are methods from AI only applicable to Big Data. Data Science, on the other hand, is usually considered as an intersection of computer science, statistics and the respective scientific discipline. Therefore, it is not bound to certain methods or certain data conditions. This paper aims at contributing to the current discussion by highlighting the relevance of statistical methodology in the context of AI development and application. Statistics can make important contributions to a more successful and secure use of AI systems, for example with regard to 1. Design: bias reduction; validation; representativity 2. Assessment of data quality and data collection: standards for the quality of diagnostic tests and audits; dealing with missing values 3. Differentiation between causality and associations: consideration of covariate effects; answering causal questions; simulation of interventions 4. Assessment of certainty or uncertainty in results: Increasing interpretability; mathematical validity proofs or theoretical properties in certain AI contexts; providing stochastic simulation designs; accurate analysis of the quality criteria of algorithms in the AI context The remainder of the paper is organized as follows: First, we present an overview of AI applications and methods in Section 2. We continue by expanding on the points 1.-4. in Sections 3 -6. Furthermore, we discuss the increased need for teaching and further education targeting the increase of AI-related literacy at all educational levels in Section 7. We conclude with Section 8.

Applications and Methods of AI
Depending on the specific AI task (e. g., prediction, explanation, classification), different approaches are used, ranging from regression to deep learning algorithms. Important categories of AI approaches are supervised learning, unsupervised learning and reinforcement learning [135]. Many AI systems learn from training data with predefined solutions such as true class labels or responses. This is called supervised learning, whereas unsupervised learning does not provide solutions. In contrast, reinforcement learning has no predefined data and learns from errors within an agentenvironment system, where Markov decision processes from probability theory play an important role. The input data can be measured values, stock market prices, audio signals, climate data or texts, but may also describe very complex relationships, such as chess games. Table 1 gives an overview of examples of statistical methods and models used in AI systems. The models could be further classified as data-based and theory-based models. However, we have omitted this classification in the table.
Even though many of the contributions to AI systems originate from computer science, statistics has played an important role throughout. Early examples occurred in the context of realizing the relationship between backpropagation and nonlinear least squares methods, see, e.g., [148]. Machine learning (ML) plays a distinctive role within AI. For instance, important ML methods such such as random forests [13] or support vector machines [23], were developed by statisticians. Others, like radial basis function networks [21], can also be considered and studied as nonlinear regression models. Recent developments such as extreme learning machines or broad learning systems [19] have close links to multiple multivariate and ridge regression, i.e. to statistical methods. The theoretical validity of machine learning methods, e.g., through consistency statements and generalization bounds [51,144], also requires substantial knowledge of mathematical statistics and probability theory.
To capture the role and relevance of statistics, we consider the entire process of establishing an AI application. As illustrated in Figure 1, various steps are necessary to examine a research question empirically. For more details on these steps see, e.g., [149]. Starting with the precise formulation of the research question the process then runs through a study design stage (including sample size planning and bias control) to the analysis. Finally, the results must be interpreted. AI often focuses on the step of data analysis while the other stages receive less attention or are even ignored. This may result in critical issues and possibly misleading interpretations, such as sampling bias or the application of inappropriate analysis tools requiring assumptions not met by the chosen design. In many applications, the input data for the AI techniques are very high-dimensional, i.e. numerous variables (also called features) are observed with diverse ranges of possible values. In addition, non-linear relationships with complex interactions are often considered for prediction. However, even with sample sizes in the order of millions, the problem of the curse of dimensionality arises [8], because data is thin and sparse in a high-dimensional space. Therefore, learning the structure of the high-dimensional space from these thin data typically requires an enormous amount of training data.
AI has made remarkable progress in various fields of application. These include automated face recognition, automated speech recognition and translation [5], object tracking in film material, autonomous driving, and the field of strategy games such as chess or go, where computer programs now beat the best human players [64,129].
Especially for tasks in speech recognition as well as text analysis and translation, Hidden Markov models from statistics are used and further developed with great success [58,66] because they are capable of representing grammars. Nowadays, automatic language translation systems can even translate languages such as Chinese into languages of the European language family in real time and are used, for example, by the EU [34]. Another growing and promising area for AI applications is medicine. Here, AI is used, e.g., to improve the early detection of diseases, for Table 1 Overview of statistical methods and models used in AI systems [33,44,57,77] Purpose

ML approach Statistical methods & models (examples) Examples
Recognizing similarities in data unsupervised learning Cluster analysis, factor analysis Personalized medicine [90], customer analysis, development of psychometric tests [125] Prediction of events/conditions Supervised Learning: Regression ML systems Data-driven model selection Sales forecast, economic development [128], Weather/climate forecast [40], forecast of migration movements [108] Explanation of events/conditions  [76,93], diagnosis and differential diagnosis [37,42,153] more accurate diagnoses, or to predict acute events [17,20]. Official statistics uses AI methods for classification as well as for recognition, estimation and/or imputation of relevant characteristic values of statistical units [134,127,103,101]. In economics and econometrics, AI methods are also applied and further developed, for example, to draw conclusions about macroeconomic developments from large amounts of data on individual consumer behavior [79,89].
Despite these positive developments that also dominate the public debate, some caution is advisable. There are a number of reports about the limits of AI, e.g., in the case of a fatal accident involving an autonomously driving vehicle [151]. Due to the potentially serious consequences of false positive or false negative decisions in AI applications, careful consideration of these systems is required [1]. This is especially true in applications such as video surveillance of public spaces. For instance, a pilot conducted by the German Federal Police at the Südkreuz suburban railway station in Berlin has shown that automated facial recognition systems for identification of violent offenders currently have false acceptance rates of 0.67% (test phase 1) and 0.34% (test phase 2) on average [15]. This means that almost one in 150 (or one in 294) passers-by is falsely classified as a violent offender. In medicine, wrong decisions can also have drastic and negative effects, such as an unnecessary surgery and chemotherapy in the case of wrong cancer diagnoses. Corresponding test procedures for medicine are currently being developed by regulators such as the US FDA [39].
Finally, ethical questions arise regarding the application of AI systems [25]. Apart from fundamental considerations (which decisions should machines make for us and which not?), even a principally socially accepted application with justifiable false decision rates can raise serious ethical questions. This is particularly true if the procedure in use discriminates against certain groups or minorities (e.g., the application of AI-based methods of predictive policing may induce racial profiling [10]) or if there is no sufficient causality.

Study design
The design of a study is the basis for the validity of the conclusions. However, AI applications often use data that were collected for a different purpose (so-called secondary data). A typical case is the scientific use of routine data. For example, the AI models in a recent study about predicting medical events (such as hospitalization) are based on medical billing data [78]. Another typical case are convenience samples, that is, samples that are not randomly drawn but instead depend on 'availability'. Well-known examples are online questionnaires, which only reach those people who visit the corresponding homepage and take the time to answer the questions. Statistics distinguishes between two types of validity [126]: 1. Internal validity is the ability to attribute a change in an outcome of a study to the investigated causes. In clinical research, e.g., this type of validity is ensured through randomization in controlled trials. Internal validity in the context of AI and ML can also refer to avoiding systematic bias (such as systematically underestimated risks). 2. External validity is the ability to transfer the observed effects and relationships to larger or different populations, environments, situations, etc. In the social sciences (e.g. in the context of opinion research), an attempt to achieve this type of validity is survey sampling, which comprises sampling methods that aim at representative samples in the sense of [45,67,68,69,70].
The naive expectation that sufficiently large data automatically leads to representativity is false [81,82]. A prominent example is Google Flu [73], where flu outbreaks were predicted on the basis of search queries: it turned out that the actual prevalence was overestimated considerably. Another example is Microsoft's chatbot Tay [26,152], which was designed to mimic the speech pattern of a 19-year-old American girl and to learn from interactions with human users on Twitter: after only a short time, the bot posted offensive and insulting tweets, forcing Microsoft to shut down the service just 16 hours after it started. And yet another example is the recently published Apple Heart Study [100], which examined the ability of Apple Watch to detect atrial fibrillation: there were more than 400,000 participants, but the average age was 41 years, which is particularly problematic in view of atrial fibrillation occurring almost exclusively in people over 65 years of age.
If data collection is not accounted for, spurious correlations and bias (in its many forms, such as selection, attribution, performance, and detection bias) can falsify the conclusions. A classic example of such falsification is Simpson's paradox [131], which describes a reversal of trends when subgroups are disregarded, see Figure 2. Further examples are biases inflicted by how the data are collected, such as length time bias.
Statistics provides methods and principles for minimizing bias. Examples include the assessment of the risk of bias in medicine [53], stratification, marginal analyses, consideration of interactions, and meta-analyses, and techniques specifically designed for data collection such as (partial) randomization, (partial) blinding, and methods of so-called optimal designs [60]. Statistics also provides designs that allow for the verification of internal and external validity [6,11,109].
Another important factor for the significance and reliability of a study is the sample size [81]. For high-dimensional models, an additional factor is 'sparsity' (see the introduction). Through statistical models and corresponding mathematical approximations or numerical simulations, statisticians can assess the potentials and limits of an AI application for a given number of cases or estimate the necessary number of cases in the planning stage of a study. This is not routine work; instead, it requires advanced statistical training, competence, and experience. Thus, statistics can help in collecting and processing data for subsequent use in AI pipelines. Basic statistical techniques that are relevant for this also include the modeling of data generating processes, restrictions on data sets, and factorial design of experiments, which is a controlled variation of factors that highlights their influence [119]. In addition, the various phases in the development of a diagnostic tests are specified in statistics [18]. In many AI applications, however, the final evaluation phase on external data is never reached, since the initial algorithms have been replaced in the meanwhile. Also, statistical measures of quality such as sensitivity, specificity, and ROC curves are used in the evaluation of AI methods. And finally, statistics can help in the assessment of uncertainty (Section 6).

Assessment of data quality and data collection
'Data is the new oil of the global economy.' According to, e.g., the New York Times [88] or the Economist [137], this credo echoes incessantly through start-up conferences and founder forums. However, this metaphor is both popular and false. First of all, data in this context corresponds to crude oil, which needs further refining before it can be used. In addition, the resource crude oil is limited. 'For a start, while oil is a finite resource, data is effectively infinitely durable and reusable' (Bernard Marr in Forbes, [41]). All the more important is a responsible approach to data processing.  [30] Ensuring data quality is of great importance in all analyses, according to the popular slogan 'Garbage in, garbage out.' As already mentioned in the previous section, we mainly use secondary data in the context of AI. The collection and compilation of secondary data is in general not based on a specific question. Instead, it is collected for other purposes such as accounting or storage purposes. The concept of knowledge discovery in databases [38] very clearly reflects the assumption that data are regarded as a given basis from which information and knowledge can be extracted by AI procedures. This notion is contrary to the traditional empirical research process, in which empirically testable research questions are derived from theoretical questions by conceptualizing and operationalizing. The resulting measurement variables are then collected for this specific purpose. In the context of AI, the process of operationalization is replaced by the ETL process: 'Extract, Transform, Load' [138]. Relevant measurements are to be extracted from the data lake(s), then transformed and finally loaded into the (automated) analysis procedures. The AI procedures are thereby expected to be able to distill relevant influencing variables from high-dimensional data.
The success of this procedure fundamentally depends on the quality of the data. In line with Karr et al. (2006), data quality is defined here as the ability of data to be used quickly, economically and effectively for decision-making and evaluation [61]. In this sense, data quality is a multi-dimensional concept that goes far beyond measurement accuracy and includes aspects such as relevance, completeness, availability, timeliness, meta-information, documentation and, above all, context-dependent expertise [30,31]. In official statistics, relevance, accuracy and reliability, timeliness and punctuality, coherence and comparability, accessibility and clarity are defined as dimensions of data quality [36].
Increasing automation of data collection, e.g., through sensor technology, may increase measurement accuracy in a cost-effective and simple way. Whether this will achieve the expected improvement in data quality remains to be seen in each individual application. Missing values are a common problem of data analyses. In statistics, a variety of methods have been developed to deal with these, including imputation procedures, or methods of data enhancement [117,123,143]. The AI approach of ubiquitous data collection allows the existence of redundant data, which can be used with appropriate context knowledge to complete incomplete data sets. However, this requires a corresponding integration of context knowledge into the data extraction process.
The data-hungry decision-making processes of AI and statistics are subject to a high risk with regard to relevance and timeliness, since they are implicitly based on the assumption that the patterns hidden in the data should perpetuate themselves in the future. In many applications, this leads to an undesirable entrenchment of existing stereotypes and resulting disadvantages, e.g., in the automatic granting of credit or the automatic selection of applicants. A specific example is given by the gender bias in Amazon's AI recruiting tool [24].
In the triad 'experiment -observational study -convenience sample (data lake)', the field of AI, with regard to its data basis, is moving further and further away from the classical ideal of controlled experimental data collection to an exploration of given data based on pure associations. However, only controlled experimental designs guarantee an investigation of causal questions. This topic will be discussed in more detail in Section 5.
Exploratory data analysis provides a broad spectrum of tools to visualize the empirical distributions of the data and to derive corresponding key figures. This can be used in preprocessing to detect anomalies or to define ranges of typical values in order to correct input or measurement errors and to determine standard values. In combination with standardization in data storage, data errors in the measurement process can be detected and corrected at an early stage. This way, statistics helps to assess data quality with regard to systematic, standardized and complete recording. Survey methodology primarily focuses on data quality. The insights gained in statistical survey research to ensure data quality with regard to internal and external validity provide a profound foundation for corresponding developments in the context of AI. Furthermore, various procedures for imputing missing data are known in statistics, which can be used to complete the data depending on the existing context and expertise [117,123,143]. Statisticians have dealt intensively with the treatment of missing values under different development processes (non-response, missing not at random, missing at random, missing completely at random [117,84]), selection bias and measurement error.
Another point worth mentioning is parameter tuning, i.e. the determination of so-called hyperparameters, which control the learning behavior of ML algorithms: comprehensive parameter tuning of methods in the AI context often requires very large amounts of data. For smaller data volumes it is almost impossible to use such procedures. However, certain model-based (statistical) methods can still be used in this case [106].

Distinction between causality and associations
Only a few decades ago, the greatest challenge of AI was to enable machines to associate a possible cause with a set of observable conditions. The rapid development of AI in recent years (both in terms of the theory and methodology of statistical learning processes and the computing power of computers) has led to a multitude of algorithms and methods that have now mastered this task. One example are deep learning methods, which are used in robotics [75] and autonomous driving [136], as well as in computer-aided detection and diagnostic systems (e.g., for breast cancer diagnosis [17]), drug discovery in pharmaceutical research [20] and agriculture [59]. With their often high predictive power, AI methods can uncover structures and relationships in large volumes of data based on associations. Due to the excellent performance of AI methods in large data sets, they are also frequently used in medicine to analyze register and observational data that have not been collected within the strict framework of a randomized study design (Section 3). However, the discovery of correlations and associations (especially in this context) is not equivalent to establishing causal claims.
An important step in the further development of AI is therefore to replace associational argumentation with causal argumentation. Pearl [97]

describes the difference as follows: 'An associational concept is any relationship that can be defined in terms of a joint distribution of observed variables, and a causal concept is any relationship that cannot be defined from the distribution alone.'
Even the formal definition of a causal effect is not trivial. The fields of statistics and clinical epidemiology, for example, use the Bradford Hill criteria [54] and the counterfactual framework introduced by Rubin [116]. The central problem in observational data are covariate effects, which, in contrast to the randomized controlled trial, are not excluded by design and whose (non-)consideration leads to distorted estimates of causal effects. In this context, a distinction must be made between confounders, colliders, and mediators [96]. Confounders are unobserved or unconsidered variables that influence both the exposure and the outcome, see Figure 4 (a). This can distort the effects of exposure if naively correlated. Fisher identified this problem in his book 'The Design of Experiments' published in 1935. A formal definition was developed in the field of epidemiology in the 1980s [48]. Later, graphical criteria such as the Back-Door Criterion [49,95] were developed to define the term confounding.
In statistics, the problem of confounding is taken into account either in the design (e.g., randomized study, stratification, etc.) or evaluation (propensity score methods [22], marginal structural models [107], graphical models [28]). In this context, it is interesting to note that randomized studies (which have a long tradition in the medical field) have recently been increasingly used in econometric studies [3,29,65]. In the case of observational data, econometrics has made many methodological contributions to the identification of treatment effects, e.g., via the potential outcome approach [110,111,112,116,118] as well as the work on policy evaluation [52].
In contrast to confounders, colliders and mediators lead to distorted estimates of causal effects precisely when they are taken into account during estimation. Whereas colliders represent common consequences of treatment and outcome (Figure 4 (b)), mediators are variables that represent part of the causal mechanism by which the Fig. 4 Covariate effects in observational data, according to [18] treatment affects the outcome (Figure 4 (c)). Especially in the case of longitudinal data, it is therefore necessary to differentiate in a theoretically informed manner which relationship covariates in the observed data have with the treatment and outcome, thus avoiding bias in the causal effect estimates by (not) having taken them into account.
By integrating appropriate statistical theories and methods into AI, it will be possible to answer causal questions and simulate interventions. In medicine, e.g., questions such as 'What would be the effect of a general smoking ban on the German health care system' can then be investigated and reliable statements can be made, even without randomized studies. Of course, this would not be possible here. Pearl's idea goes beyond the use of ML methods in causal analyses (which are used, for example, in connection with targeted learning [71] or causal random forest [2]). His vision is rather to integrate the causal framework [97] described by him with ML algorithms to enable the machines to draw causal conclusions and simulate interventions.
The integration of causal methods in AI also contributes to increasing its transparency and thus the acceptance of AI methods, since a reference to probabilities or statistical correlations in the context of an explanation is not as effective as a reference to causes and causal effects [83].

Evaluating uncertainty, interpretability and validation
Uncertainty quantification is often neglected in AI applications. One reason may be the above discussed misconception that 'Big Data' automatically leads to exact results, making uncertainty quantification redundant. Another key reason is the complexity of the methods which hampers the construction of statistically valid uncertainty regions. However, most statisticians would agree that any comprehensive data analysis should contain methods to quantify the uncertainty of estimates and predictions. Its importance is also stressed by the American statistician David B. Dunson who writes that: 'it is crucial to not over-state the results and appropriately characterize the (often immense) uncertainty to avoid flooding the scientific literature with false findings.' [32] In fact, in order to achieve the main goal of highly accurate predictions, assumptions about underlying distributions and functional relationships are deliberately dropped in AI applications. On the one hand, this allows for a greater flexibility of the procedures. On the other hand, however, this also complicates an accurate quantification of uncertainty, e.g., to specify valid prediction and confidence regions for target variables and parameters of interest. As Bühlmann and colleagues put it: 'The statistical theory serves as guard against cheating with data: you cannot beat the uncertainty principle.' [14] In recent years, proposals for uncertainty quantification in AI methods have already been developed (by combinations with Bayesian approximations, bootstrapping, jackknifing and other cross-validation techniques, Gaussian processes, Monte Carlo dropout etc., see e. g., [46,47,91,133,146]). However, their theoretical validity (e.g., that a prediction interval actually covers future values 95% of the time) has either not been demonstrate yet or only been proven under very restrictive or at least partially unrealistic assumptions.
In contrast, algorithmic methods could be embedded in statistical models. While potentially less flexible, they permit a better quantification of the underlying uncertainty by specifying valid prediction and confidence intervals. Thus, they allow for a better interpretation of the results.
In comparison, the estimated parameters of many AI approaches (such as deep learning) are difficult to interpret. Pioneering work from computer science on this topic is, for example, [141,142], for which Leslie Valiant was awarded the Turing Award in 2010. Further research is nevertheless needed to improve interpretability. This also includes uncertainty quantification of patterns identified by an AI method, which heavily rely on statistical techniques. A tempting approach to achieve more interpretable AI methods is the use of auxiliary models. These are comparatively simple statistical models which, after adaptation of a deep learning approach, describe the most important patterns represented by the AI method and potentially can also be used to quantify uncertainty [85,99,104,105]. In fact, as in computational and statistical learning theory [51,62,144], statistical methods and AI learning approaches can (and should) complement each other. Another important aspect is the model complexity which can, e.g., be captured by entropies (such as VC dimensions) or compression barriers [72]. These concepts as well as different forms of regularization [139,147,154], i.e. the restriction of the parameter space, allow to recognize or even to correct an overfitting of a learning procedure. Here, the application of complexity reducing concepts can be seen as a direct implementation of the Lex Parsimoniae principle and often increases the interpretability of resulting models [114,140]. In fact, regularization and complexity reducing concepts are an integral part of many AI methods. However, they are also basic principles of modern statistics, which were already proposed before their introduction to AI. Examples are given in connection with empirical Bayesian or shrinkage methods [115]. In addition to that, AI and statistics have numerous concepts in common which give rise to an exchange of methods in these fields.
Beyond interpretability and uncertainty quantification, the above-mentioned validation aspects are of immense importance. Often in AI, validation is only carried out on single, frequently used 'established' data sets. Thus, a certain stability or variability of the results cannot be reliably assessed due to the lack of generalizability. To tackle this problem we can again turn to statistical concepts: In order to reflect a diversity of real life situations, statistics makes use of probabilistic models. In addition to mathematical validity proofs and theoretical investigations, detailed simulation studies are carried out to evaluate the methods' limits (by exceeding the assumptions made). This statistical perspective provides extremely useful insights. Furthermore, validation aspects also apply to quality criteria (e.g., accuracy, sensitivity and specificity) of AI algorithms. The corresponding estimators are also random but their uncertainty is usually not quantified at all.
A particular challenge is posed by the ever faster development cycles of AI systems which need ongoing investigations on their adequate validation. This can even be aggravated when turning to development processes of mobile apps or online learning systems such as Amazon's recommender systems. Here, the developments are dynamic, de facto never ending processes, which therefore require continuous validation.
Statistics can help to increase the validity and interpretability of AI methods by providing contributions to the quantification of uncertainty. To achieve this, we can assume specific probabilistic and statistical models or dependency structures which allow comprehensive mathematical investigations [4,7,27,51,122,145,102], e.g., by investigating robustness properties, proving asymptotic consistency or (finite) error bounds. On the other hand this also includes the preparation of (stochastic) simulation designs [87] and the specification of easy to interpret auxiliary statistical models. Finally, it allows for a detailed analysis of quality criteria of AI algorithms.

Education and training
AI has been a growing research area for years, and its development is far from complete. In addition to ethical and legal problems, it has been shown that there are still many open questions regarding the collection and processing of data. Therefore, statistical methods must be considered as integral part of AI systems, from the formulation of the research questions, the development of the research design, through the analysis up to the interpretation of the results. Particularly in the field of methodological development, statistics can, e.g., serve as multiplier and strengthen the scientific exchange by establishing broad and strongly interconnected networks between users and developers. With its unique knowledge, statistics is a natural partner for other disciplines in teaching, research and practice.
School education: In modern contemporary societies, the systematic inclusion of AI and its underlying statistical framework is -adapted to the cognitive abilities of the respective target populations-essential at all levels of the educational system. Statistics and computer science should be integral elements of the curricula at schools. This would ensure that AI can be taught appropriately for the age in the different types of schools (primary, secondary, and vocational schools). For this purpose, school-related projects and teacher training initiatives must be initialized and promoted under the scientific supervision and based on the didactics of statistics and computer science. The goal is making the complex contents attractive and interesting for pupils by taking into account technical and social aspects. Pilot projects are, for example, the 'Project Data Science and Big Data in Schools' (PRODABI; www.prodabi.de) [9], the 'Introduction to Data Science' project (IDS; www.introdatascience.org), and the 'International Data Science in Schools Project' (IDSSP; www.idssp.org). Particularly the latter two international projects have a pioneering role in school-related projects of stimulating AI literacy. To sum up, statistics should contribute its expertise to a greater extent to school curricula to ensure a broad and sustainable increase in statistical and computer literacy as a key resource for the future development of seminal technologies. Furthermore, teachers particularly qualified in statistics and its didactics should be systematically involved in digital teaching and training offers (e.g., e-learning).
Higher education: For the tertiary sector, it is important to provide all relevant disciplines with the scientific foundations for high-level basic and applied research in the context of AI. This applies to both, Data Science subjects, such as such as mathematics, statistics and computer science, as well as corss-sectional areas, such as medicine, engineering, social and economic sciences. This includes adapting staffing policies at the universities as well as funding programs, possibly new courses of study, doctoral programs, research associations and research programs. In this regard, we see a particular need for expanding the qualification of teachers in statistical methodology and didactics due to the growing demand.
Training and interdisciplinary exchange: Continuing professional development and training should be expanded at various levels. Here, advanced training programs on AI in various forms and formats are conceivable: Informatics and Engineering departments of various universities already offer workshops/summer schools on AI, for example the AI-DLDA at the Universita degli studi di Udine, Italy, https://www. ip4fvg.it/summer-school/ or the AI Summer School in Singapore, https:// aisummerschool.aisingapore.org/. Professional development programs in this context include, for example, Coursera (https://www.coursera.org/), which offers online courses on various topics in collaboration with more than 200 leading universities and companies worldwide. Other possibilities include webinars, mentoring, laboratory visits, etc. Similar examples already exist for other areas of statistics such as Biometrics or Medical Informatics, see, for example, the certificates provided by the German Association for Medical Informatics, Biometry and Epidemiology (GMDS). It is important to expand existing offers and establish new ones with a special focus on the statistical contributions to AI. In particular, training should be offered for both methodologists such as computer scientists, statisticians, and mathematicians who are not yet working in the field of AI as well as for users such as clinicians, engineers, social scientists and economists.
By developing professional networks, participating methodologists can be brought together with users/experts to establish or maintain a continuous exchange between the disciplines. In addition to AI methods, these events should also cover the topics of data curation, management of data quality and data integration.

Conclusions
Statistics is a broad cross-scientific discipline. It provides knowledge and experience of all aspects of data evaluation: starting with the research question through design and analysis to the interpretation. As a core element of AI, it is the natural partner for other disciplines in teaching, research and practice. In particular, the following contributions of statistics to the field of artificial intelligence can be summarized: 1. Methodological development: The development of AI systems and their theoretical underpinning has benefited greatly from research in computer science and statistics, and many procedures have been developed by statisticians. Recent advances such as extreme learning machines show that statistics also provides important contributions to the design of AI systems, for example, by improved learning algorithms based on penalized or robust estimation methods. 2. Planning and design: Statistics can help to optimize data collection or preparation (sample size, sampling design, weighting, restriction of the data set, design of experiments, etc.) for subsequent evaluation with AI methods. Furthermore, the quality measures of statistics and their associated inference methods can help in the evaluation of AI models. 3. Assessment of data quality and data collection: Exploratory data analysis provides a wide range of tools to visualize the empirical distribution of the data and to derive appropriate metrics, which can be used to detect anomalies or to define ranges of typical values, to correct input errors, to determine norm values and to impute missing values. In combination with standardization in data storage, errors in the measurement process can be detected and corrected at an early stage. With the help of model-based statistical methods, comprehensive parameter tuning is also possible, even for small data sets. 4. Differentiation of causality and associations: In statistics, methods for dealing with covariate effects are known. Here, it is important to differentiate theoretically informed between the different relationships covariates can have to treatment and outcome in order to avoid bias in the estimation of causal effects. Pearl's causal framework enables the analysis of causal effects and the simulation of interventions. The integration of causal methods into AI can also contribute to the transparency and acceptance of AI methods. 5. Assessment of certainty or uncertainty in results: Statistics can help to enable or improve the quantification of uncertainty in and the interpretability of AI methods. By adopting specific statistical models, mathematical proofs of validity can also be provided. In addition, limitations of the methods can be explored through (stochastic) simulation designs. 6. Conscientious implementation of points 2 to 5, including a previously defined evaluation plan, also counteracts the replication crisis [92] in many scientific disciplines. In this methodological crisis, which has been ongoing since the beginning of the 2010s, it has become clear that many studies, especially in medicine and the social sciences, are difficult or impossible to reproduce. 7. Education, advanced vocational training and public relations: With its specialized knowledge, statistics is the natural partner for other disciplines in teaching and training. Especially in the further development of methods of artificial intelligence, statistics can strengthen scientific exchange.
The objective of statistics related to AI must be to facilitate the interpretation of data. Data alone is hardly a science, no matter how great their volume or how subtly they are manipulated. What is important is the knowledge gained that will enable future interventions [98].

Funding
Not applicable

Conflict of interest
The authors declare that they have no conflict of interest.
Availability of data and material (data transparency)

Not applicable
Code availability (software application or custom code) Not applicable