A modern retrospective on probabilistic numerics

This article attempts to place the emergence of probabilistic numerics as a mathematical–statistical research field within its historical context and to explore how its gradual development can be related both to applications and to a modern formal treatment. We highlight in particular the parallel contributions of Sul′\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$'$$\end{document}din and Larkin in the 1960s and how their pioneering early ideas have reached a degree of maturity in the intervening period, mediated by paradigms such as average-case analysis and information-based complexity. We provide a subjective assessment of the state of research in probabilistic numerics and highlight some difficulties to be addressed by future works.


Introduction
The field of probabilistic numerics (PN), loosely speaking, attempts to provide a statistical treatment of the errors and/or approximations that are made en route to the output of a deterministic numerical method, e.g. the approximation of an integral by quadrature, or the discretised solution of an ordinary or partial differential equation. This decade has seen a surge of activity in this field. In comparison with historical developments that can be traced back over more than a hundred years, the most recent developments are particularly interesting because they have been characterised by simultaneous input from multiple scientific disciplines: mathematics, statistics, machine learning, and computer science. The field has, therefore, advanced on a broad front, with contributions ranging from the building of overarching general theory to practical implementations in specific problems of interest. Over the same period of time, and because of increased interaction among researchers coming from different communities, the extent to which these developments were -or were not -presaged by twentieth-century researchers has also come to be better appreciated.
Thus, the time appears to be ripe for an update of the 2014 Tübingen Manifesto on probabilistic numerics [Hennig, 2014, Osborne, 2014d and the position paper  to take account of the developments between 2014 and 2019, an improved awareness of the history of this field, and a clearer sense of its future directions and potential.
In this article, we aim to summarise some of the history of probabilistic perspectives on numerics (Section 2), to place more recent developments into context (Section 3), and to articulate a vision for future research in, and use of, probabilistic numerics (Section 4).
1 Indeed, while an improper uniform prior distribution on R makes sense for each A k individually, no such countably additive uniform measure (an "infinite-dimensional Lebesgue measure") can exist on R ∞ for (A k ) ∞ k=0 [Sudakov, 1959]. That said, Poincaré does not impose any summability constraints on the h i either, so the covariance operator associated to his Gaussian prior may fail to be trace class.   [Kazan Federal University, reproduced with permission].
However, our focus here is on the development of probabilistic numerical methods for use on a computer. The limited nature of the earliest computers led authors to focus initially on the phenomenon of round-off error [Henrici, 1962, Hull and Swenson, 1966, von Neumann and Goldstine, 1947, whether of fixed-point or floating-point type, without any particular statistical inferential motivation; more recent contributions to the statistical study of round-off error include [Barlow and Bareiss, 1985, Chatelin and Brunet, 1990, Tienari, 1970. According to von Neumann and Goldstine, writing in 1947, "[round-off errors] are strictly very complicated but uniquely defined number theoretical functions [of the inputs], yet our ignorance of their true nature is such that we best treat them as random variables." [von Neumann andGoldstine, 1947, p. 1027].
Thus, von Neumann and Goldstine seem to have held a utilitarian view that probabilistic models in computation are useful shortcuts, simply easier to work with than the unwieldy deterministic truth. 2 Concerning the numerical solution of ordinary differential equations (ODEs), Henrici [Henrici, 1962[Henrici, , 1963] studied classical finite difference methods and derived expected values and covariance matrices for accumulated round-off error, under an assumption that individual round-off errors can be modelled as independent random variables. In particular, given posited means and covariance matrices of the individual errors, Henrici demonstrated how these moments can be propagated through the computation of a finite difference method. In contrast with more modern treatments, Henrici was concerned with the analysis of an established numerical method and did not attempt to statistically motivate the numerical method itself. Sul din (1959-1980) One of the earliest attempts to motivate a numerical algorithm from a statistical perspective was due to Al bert Valentinovich Sul din , working at Kazan State University in the USSR (now Kazan Federal University in the Russian Federation) [Norden et al., 1978, Zabotin et al., 1996. After first making contributions to the study of Lie algebras, towards the end of the 1950s Sul din turned his attention to computational and applied mathematics, and in particular to probabilistic and statistical methodology. His work in this direction led to the establishment of the Faculty of Computational Mathematics and Cybernetics (now Institute of Computational Mathematics and Information Technologies) in Kazan, of which he was the founding Dean.

The Parallel Contributions of
Sul din began by considering the problem of quadrature. Suppose that we wish to approximate the definite integral b a u(t) dt of a function u ∈ U := C 0 ([a, b]; R), the space of continuous real-valued functions on [a, b], under a statistical assumption that (u(t) − u(a)) t∈ [a,b] follows a standard Brownian motion (Wiener measure, µ W ). For this task we receive pointwise data about the integrand u in the form of the values of u at J ∈ N arbitrarily located nodes t 1 , . . . , t J ∈ [a, b], although for convenience we assume that a = t 1 < t 2 < · · · < t J = b.
In more statistical language, anticipating the terminology of Section 3.2, our observed data or information concerning the integrand u is y := (t j , u(t j )) J j=1 , which takes values in the space Y := ([a, b] × R) J . Since µ W is a Gaussian measure and both the integral and pointwise evaluations of u are linear functions of u, Sul din [Sul din, 1959[Sul din, , 1960[Sul din, , 1963b showed by direct calculation that the quadrature rule B : Y → R that minimises the mean squared error i.e. the definite integral of the piecewise linear interpolant of the observed data. This result was a precursor to a sub-field of numerical analysis that became known as average-case analysis; see Section 2.3. Sul din was aware of the connection between his methods and statistical regression [Sul din, 1963a] and conditional probability [Sul din, 1963c], although it is difficult to know whether he considered his work to be an expression of statistical inference as such. Indeed, since Sul din's methods were grounded in Hilbert space theory [Sul din, 1968, Sul din et al., 1969, the underlying mathematics (the linear conditioning of Gaussian measures on Hilbert spaces) is linear algebra which can be motivated without recourse to a probabilistic framework.
In any case, Sul din's contributions were something entirely novel. Up to this point, the role of statistics in numerical analysis was limited to providing insight into the performance of a traditional numerical method. The 1960s brought forth a new perspective, namely the statistically-motivated design of numerical methods. Indeed, "A.V. Sul din's 1969 habilitation thesis concerned the development of probabilistic methods for the solution of problems in computational mathematics. His synthesis of two branches of mathematics turned out to be quite fruitful, and deep connections were discovered between the robustness of approximation formulae and their precision. Building on the general concept of an enveloping Hilbert space, A.V. Sul din proved a projection theorem that enabled the solution of a number of approximation-theoretic problems. [Zabotin et al., 1996] However, Sul din was not alone in arriving at this point of view. On the other side of the Iron Curtain, between 1957, Frederick Michael ("Mike") Larkin (1936-1982 worked for the UK Atomic Energy Authority in its laboratories at Harwell and Culham (the latter as part of the Computing and Applied Mathematics Group), as well as working for two years at Rolls Royce, England. Following a parallel path to that of Sul din, over the next decade Larkin would further blend numerical analysis and statistical thinking [Kuelbs et al., 1972, Larkin, 1969, 1979b, arguably laying the foundations on which PN would be developed. At Culham, Larkin worked on building some of the first graphical calculators, called GHOST (short for graphical output system), and the GHOUL (graphical output language). It can be speculated that an intimate familiarity with the computational limitations of GHOST and GHOUL may have motivated Larkin to seek a richer description of the numerical error associated to their output.  Larkin (1936Larkin ( -1982 [Larkin et al., 1967, reproduced with permission]. The perspective developed by Larkin was fundamentally statistical and, in modern terminology, the probabilistic numerical methods he developed would be described as Bayesian 4 , which we discuss further in Section 3.2. Nevertheless, the pioneering nature of this research motivated Larkin to focus on specific numerical tasks, as opposed to establishing a unified framework. In particular, he considered in detail the problems of approximating a non-negative function [Larkin, 1969], quadrature [Larkin, 1972[Larkin, , 1974, and estimating the zeros of a complex function [Larkin, 1979b,a]. In the context of the earlier numerical integration example of Sul din, the alternative proposal of Larkin was to consider the Wiener measure as a prior, the information (t j , u(t j )) J j=1 as (noiseless) data, and to output the posterior marginal for the integral b a u(t) dt. That is, Larkin took the fundamental step of considering a distribution over the solution space of the numerical task to be the output of a computation -this is what we would now recognise as the defining property of a probabilistic numerical method : "Among other things, this permits, at least in principle, the derivation of joint probability density functions for [both observed and unobserved] functionals on the space and also allows us to evaluate confidence limits on the estimate of a required functional (in terms of given values of other functionals)." [Larkin, 1972] 5 Thus, in contrast to Sul din's description of the trapezoidal rule B tr from (2.2) as a frequentist point estimator obtained from minimising (2.1), which just happens to produce an unbiased estimator with variance 1 12 J−1 j=1 (t j+1 − t j ) 3 , the Larkin viewpoint is to see the normal distribution on R as the measure-valued output of a probabilistic quadrature rule, of which B tr (t j , z j ) J j=1 is a convenient point summary. Note also that the technical development in this pioneering work made fundamental contributions to the study of Gaussian measures on Hilbert spaces [Kuelbs et al., 1972, Larkin, 1972.
Larkin moved to Canada in 1969 to start work as a Consultant in Numerical Methods and Applied Mathematics within the Computing Centre and, subsequently in 1974, as Associate Professor in the Department of Computing and Information Science (now the School of Computing) at Queen's University in Kingston, Ontario. He received tenure in 1977 and was promoted to full professor in 1980.
"He worked in isolation at Queen's in that few graduate students and fewer faculty members were aware of the nature of his research contributions to the field. [. . . ] Michael pioneered the idea of using a probabilistic approach to give an alternative local approximation technique. In some cases this leads to the classical methods, but in many others leads to new algorithms that appear to have practical advantages over more classical methods. This work has finally begun to attract attention and I expect that the importance of his contribution will grow in time." [Queen's University at Kingston, 11 Feb. 1982] From our perspective, writing in 2019, it seems that Sul din and Larkin were working in parallel but were ahead of their time. Their probabilistic perspectives on approximation theory were similar, but limited to a Gaussian measure context. Naturally, given the linguistic barriers and nearly disjoint publication cultures of their time, it would not have been easy for Larkin and Sul din to be conversant with each other's work, though these barriers were not always as great as is sometimes thought [Hollings, 2016]. At least by 1972 [Larkin, 1972], Larkin was aware of and cited Sul din's work on minimal variance estimators for the values of linear functionals on Wiener space [Sul din, 1959[Sul din, , 1960, but apparently did not know of Sul din's 1969 habilitation thesis, which laid out a broader agenda for the role of probability in numerics. Conversely, Soviet authors writing in 1978 were aware of Sul din's influence on e.g. Ulf Grenander and Walter Freiberger at Brown University, but make no mention of Larkin [Norden et al., 1978]. Sul din, for his part, at least as judged by his publication record, seems to have turned his attention to topics such as industrial mathematics (perhaps an "easier sell" in the production-oriented USSR [Hollings, 2016]), mathematical biology, and of course the pressing concerns of faculty administration.
Finally, concerning the practicality of Sul din and Larkin's ideas, one has to bear in mind the limited computational resources available at even cutting-edge facilities in the 1960s: 6 probabilistic numerics was an idea ahead of its time, and the computational power needed to make it a reality simply did not exist. (1980)(1981)(1982)(1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990) In the main, research contributions until 1990 continued to focus on deriving insight into traditional numerical methods through probabilistic analyses. In particular, the average-case analysis (ACA) of numerical methods received interest and built on the work of Kolmogorov [Kolmogorov, 1936] and Sard [Sard, 1963]. In ACA the performance of a numerical method is assessed in terms of its average error over an ensemble of numerical problems, with the ensemble being represented by a probability measure over the problem set; a prime example is univariate quadrature with the average quadratic loss (2.1) given earlier. Root-finding, optimisation, etc. can all be considered similarly, and we defer to e.g. [Ritter, 2000, Traub et al., 1983 for comprehensive treatments of this broad topic.

Optimal Numerical Methods are Bayes Rules
A traditional (deterministic) numerical method can also be regarded as a decision rule and the probability measure used in ACA can be used to instantiate the Bayesian decision-theoretic framework [Berger, 1985]. The average error is then recognised as the expected loss, also called the risk. The fact that ACA is mathematically equivalent to Bayesian decision theory (albeit limited to the case of an experiment that produces a deterministic dataset) was noted in Wahba, 1970a,b, Parzen, 1970] and also in [Larkin, 1970].
Armed with an optimality criterion for a numerical method, it is natural to ask about the existence and performance of method(s) that minimise it. Such methods are called average-case optimal in ACA and are recognised as Bayes rules or Bayes acts in the decision-theoretic context. A key result in this area is the insight of [Kadane and Wasilkowski, 1985] that ACA-optimal methods coincide with (non-randomised) Bayes rules when the measure used to define the average error is the Bayesian prior; for a further discussion of the relationships among these optimality criteria, including the Bayesian probabilistic numerical methods of Section 3.2, see [Cockayne et al., 2019a.
Many numerical methods come in parametric families, being parametrised by e.g. the number of quadrature nodes, a mesh size, or a convergence tolerance. For any "sensible" method, the error can be driven to zero by sending the parameter to infinity or zero as appropriate. If one is prepared to pay an infinite computational cost, then essentially any method can be optimal! Thus, when asking about the optimality of a numerical method, it is natural to consider the optimality of methods of a given computational cost or complexity.
With such concerns in mind, the field of information-based complexity (IBC) [Novak, 1988, Traub et al., 1983, Traub and Woźniakowsi, 1980 developed simultaneously with ACA, with the aim of relating the computational complexity and optimality properties of algorithms to the available information on the unknowns, e.g. the partial nature of the information and any associated observational costs and errors. For example, Smale [Smale, 1985, Theorem D] compared the accuracies (with respect to mean absolute error) for a given cost of the Riemann sum, trapezoidal, and Simpson quadrature rules 7 ; in the same paper, Smale also considered root-finding, optimisation via linear programming, and the solution of systems of linear equations.
The example of Bayesian quadrature was again discussed in detail by Diaconis [Diaconis, 1988], who repeated Sul din's observation that the posterior mean for b a u(t) dt under the Wiener measure prior is the trapezoidal method (2.2), which is an ACA-optimal numerical method. However, Diaconis posed a further question: can other classical numerical integration methods, or numerical methods for other tasks, be similarly recovered as Bayes rules in a decision-theoretic framework? For linear cubature methods, a positive and constructive answer was recently provided in [Karvonen et al., 2018b], but the question remains open in general.

Probabilistic Numerical Methods (1991-2009)
After a period in which probabilistic numerical methods were all but forgotten, research interest was again triggered by contributions from [Minka, 2000, O'Hagan, 1991, Rasmussen and Ghahramani, 2003 on numerical integration, each to a greater or lesser extent a rediscovery of earlier work due to Larkin [Larkin, 1972]. In each case the output of computation was considered to be a probability distribution over the quantity of interest.
The 1990s saw an expansion in the PN agenda, first with early work on an area that was to become Bayesian optimisation [Močkus, 1975[Močkus, , 1977[Močkus, , 1989 and then with an entirely novel contribution on the numerical solution of ODEs by Skilling [Skilling, 1992]. Skilling presented a Bayesian 8 perspective on the numerical solution of initial value problems of the form and considered, for example, how regularity assumptions on f should be reflected in correlation functions and the hypothesis space, how to choose a prior and likelihood, and potential sampling strategies. Despite this work's then-new explicit emphasis on its Bayesian statistical character, Skilling himself considered his contributions to be quite natural: "This paper arose from long exposure to Laplace/Cox/Jaynes probabilistic reasoning, combined with the University of Cambridge's desire that the author teach some (traditional) numerical analysis. The rest is common sense. [. . . ] Simply, Bayesian ideas are 'in the air'." [Skilling, 1992] 2.5 Modern Perspective (2010-) The last two decades have seen an explosion of interest in uncertainty quantification (UQ) for complex systems, with a great deal of research taking place in this area at the meeting point of applied mathematics, statistics, computational science, and application domains [Le Maître and Knio, 2010, Smith, 2014, Sullivan, 2015: "UQ studies all sources of error and uncertainty, including the following: systematic and stochastic measurement error; ignorance; limitations of theoretical models; limitations of numerical representations of those models; limitations of the accuracy and reliability of computations, approximations, and algorithms; and human error. A more precise definition is UQ is the end-to-end study of the reliability of scientific inferences." [U. S. Department of Energy, 2009, p. 135] Since 2010, perhaps stimulated by this activity in the UQ community, a perspective on PN has emerged that sees PN part of UQ (broadly understood) and should be performed with a view to propagating uncertainty in computational pipelines. This is discussed further in Sections 3.1 and 3.2.
A notable feature of PN research since 2010 is the way that it has advanced on a broad front. The topic of quadrature/cubature, in the tradition of Sul din and Larkin, continues to be well represented: see, e.g., [Briol et al., 2019, Gunter et al., 2014, Karvonen et al., 2018b, Osborne et al., 2012a,b, Särkkä et al., 2016, Xi et al., 2018 and [Ehler et al., 2019, Jagadeeswaran and Hickernell, 2018, Karvonen et al., 2018a, 2019. The Bayesian approach to global optimisation continues to be widely used [Chen et al., 2018, Snoek et al., 2012, whilst probabilistic perspectives on quasi-Newton methods [Hennig and Kiefel, 2013] and line search methods [Mahsereci and Hennig, 2015] have been put forward. In the context of numerical linear algebra, [Bartels and Hennig, 2016, Hennig, 2015 and [Bartels et al., 2019] have approached the solution of a large linear system of equations as a statistical learning task and developed probabilistic alternatives to the classical conjugate gradient method.
Research has been particularly active in the development and analysis of statistical methods for the solution of ordinary and partial differential equations (ODEs and PDEs). One line of research has sought to cast the solution of ODEs in the context of Bayesian filtering theory by building a Gaussian process (GP) regression model for the solution u of the initial value problem of the form (2.5). The observational data consists of the evaluations of the vector field f , interpreted as imperfect observations of the true time derivative u , since one evaluates f at the "wrong" points in space. In this context, the key result is the Bayesian optimality of evaluating f according to the classical Runge-Kutta (RK) scheme, so that the RK methods can be seen as point estimators of GP filtering schemes [Kersting and Hennig, 2016, Schober et al., 2014, 2018 and [Tronarp et al., 2019]. Related iterative probabilistic numerical methods for ODEs include [Abdulle and Garegnani, 2018, Chkrebtii et al., 2016, Conrad et al., 2017, Kersting et al., 2018, Teymur et al., 2018, 2016. The increased participation of mathematicians in the field has led to correspondingly deeper local and global convergence analysis of these methods in the sense of conventional numerical analysis, as in [Conrad et al., 2017, Kersting et al., 2018, Schober et al., 2018, Teymur et al., 2018 and [Lie et al., 2019]; statistical principles for time step adaptivity have also been discussed, e.g. [Chkrebtii and Campbell, 2019].
For PDEs, resent research includes [Chkrebtii et al., 2016, Cockayne et al., 2016, 2017, Owhadi, 2015, with these contributions making substantial use of reproducing kernel Hilbert space (RKHS) structure and Gaussian processes. Unsurprisingly, given the deep connections between linear algebra and numerical methods for PDEs, the probabilistically-motivated theory of gamblets for PDEs [Owhadi, 2017, Owhadi and Scovel, 2017a, Owhadi and Zhang, 2017 has gone hand-in-hand with the development of fast solvers for structured matrix inversion and approximation problems [Schäfer et al., 2017]; see also [Yoo and Owhadi, 2019].
Returning to the point made at the beginning of this section, however, motivation for the development of probabilistic numerical methods has become closely linked to the traditional motivations of UQ (e.g. accurate and honest estimation of parameters of a so-called forward model ), with a role for PN due to the need to employ numerical methods to simulate from a forward model. The idea to substitute a probability distribution in place of the (in general erroneous) output of a traditional numerical method can be used to prevent undue bias and over-confidence in the UQ task and is analogous to robust likelihood methods in statistics [Bissiri et al., 2016, Greco et al., 2008. This motivation is already present in [Conrad et al., 2017], and forms a major theme in [Cockayne et al., 2019a, Oates et al., 2019a. Analysis of the impact of probabilistic numerical methods in simulation of the forward model within the context of Bayesian inversion has been provided in [Lie et al., 2018, Stuart and].

Related Fields and Their Development
The field of PN did not emerge in isolation and the research cited above was undoubtedly influenced by parallel developments in mathematical statistics, some of which are now discussed.
First, the mathematical theory of optimal approximation using splines was applied by Schoenberg [Schoenberg, 1965[Schoenberg, , 1966 and Karlin [Karlin, 1969[Karlin, , 1971[Karlin, , 1972[Karlin, , 1976 in the late 1960s and early 1970s to the linear problem of quadrature. Larkin was aware of the work of Karlin, citing [Karlin, 1969] in [Larkin, 1974]. However, the works cited above were not concerned with randomness and equivalent probabilistic interpretations were not discussed; in contrast, the Bayesian interpretation of spline approximation was highlighted by [Kimeldorf and Wahba, 1970a].
Second, the experimental design literature of the late 1960s and early 1970s, including a sequence of contributions from Sacks and Ylvisacker [Sacks and Ylvisaker, 1968,b, 1966, considered optimal selection of a design 0 ≤ t 1 < t 2 < · · · < t J ≤ 1 to minimise the covariance of the best linear estimator of β given discrete observations of stochastic process where Z is a stochastic process with E[Z(t)] = 0 and E[Z(t) 2 ] < ∞, based on the data {(t j , Y (t j ))} J j=1 . As such, the mathematical content of these works concerns optimal approximation in RKHSs, e.g. Ylvisaker, 1970a, p. 2064, Theorem 1]; we note that Larkin [Larkin, 1970] simultaneously considered optimal approximation in RKHSs. However, the extent to which probability enters these works is limited to the measurement error process Z that is entertained.
Third, the literature on emulation of black-box functions that emerged in the late 1970s and 1980s, with contributions including [O'Hagan, 1978, Sacks et al., 1989, provided Bayesian and frequentist statistical perspectives (respectively) on interpolation of a black-box function based on a finite number of function evaluations. This literature did not present interpolation as an exemplar of other more challenging numerical tasks, such as the solution of differential equations, which could be similarly addressed but rather focused on the specific problem of black-box interpolation in and of itself. The authors of [Sacks et al., 1989] were aware of the work of Sul din but Larkin's work was not cited. The challenges of proposing a suitable stochastic process model for a deterministic function were raised in the accompanying discussion of [Sacks et al., 1989] and were further discussed in [Currin et al., 1991].

Conceptual Evolution -A Summary
To conclude and summarise this section, we perceive the following evolution of the concepts used in, and interpretation applied to, probability in numerical analysis: 1. In the traditional setting of numerical analysis, as seen circa 1950, all objects and operations are seen as being strictly deterministic. Even at that time, however, it was accepted by some that these deterministic objects are sometimes exceedingly complicated, to the extent that they may be treated as being stochastic,à la von Neumann and Goldstine. 2. Sard and Sul din considered the questions of optimal performance of a numerical method in, respectively, the worst-case and the average-case context. Though it is a fact that some of the average-case performance measures amount to variances of point estimators, they were not viewed as such and in the early 1960s these probabilistic aspects were not a motivating factor. 3. Larkin's innovation, in the late 1960s and early 1970s, was to formulate numerical tasks in terms of a joint distribution over latent quantities and quantities of interest, so that the quantity-of-interest output can be seen as a stochastic object. However, perhaps due to the then-prevailing statistical culture, Larkin summarised his posterior distributions using a point estimator accompanied by a credible interval. 4. The fully modern viewpoint, circa 2019, is to explicitly think of the output as a probability measure to be realised, sampled, and possibly summarised.
In this section we wish to emphasise how some of the recent developments mentioned in the previous section have brought greater clarity to the philosophical status of probabilistic numerics, clearing up some old points of disagreement or providing some standardised frameworks for the comparison of tasks and methods.

A Means to an End, or an End in Themselves?
One aspect that has become clearer over the last few years, stimulated to some extent by disagreements between statisticians and numerical analysts over the role of probability in numerics, is that there are (at least) two distinct use cases or paradigms: • (P1) a probability-based analysis of the performance of a (possibly classical) numerical method; • (P2) a numerical method whose output carries the formal semantics of some statistical inferential paradigm (e.g. the Bayesian paradigm; cf. Section 3.2). Representatives of the first class of methods include [Abdulle andGaregnani, 2018, Conrad et al., 2017], which consider stochastic perturbations to explicit numerical integrators for ODEs in order to generate an ensemble of plausible trajectories for the unknown solution of the ODE. In some sense, this can be viewed as a probabilistic sensitivity/stability analysis of a classical numerical method. This first paradigm is also, clearly, closely related to ACA.
The second class of methods is exemplified by the Bayesian probabilistic numerical methods, discussed in [Cockayne et al., 2019a] and Section 3.2. We can further enlarge the second class to include those methods that only approximately carry the appropriate semantics, e.g. because they are only approximately Bayesian, or only Bayesian for a particular quantity of interest or up to a finite time horizon, e.g. the filtering-based solvers for ODEs [Kersting and Hennig, 2016, Kersting et al., 2018, Schober et al., 2014, 2018.
Note that the second class of methods can also be pragmatically motivated, in the sense that formal statistical semantics enable techniques such as ANOVA to be brought to bear on the design and optimisation of a computational pipeline (to target the aspect of the computation that contributes most to uncertainty in the computational output) . In this respect, statistical techniques can in principle supplement the expertise that is typically provided by a numerical analyst.
We note that paradigm (P1), with its close relationship to the longer-established field of ACA, tends to be more palatable to the classical numerical analysis community. The typical, rather than worstcase, performance of a numerical method is of obvious practical interest [Trefethen, 2008]. Statisticians, especially practitioners of Bayesian and fiducial inference, are habitually more comfortable with paradigm (P2) than numerical analysts are. As we remark in Section 4.5, this difference stems in part from a difference of opinion in which quantities are / can be regarded as "random" by the two communities; this difference of opinion affects (P2) much more strongly than (P1).

Bayesian Probabilistic Numerical Methods
A recent research direction, which provides formal foundations for the approach pioneered by Larkin, is to interpret both traditional numerical methods and probabilistic numerical methods as particular solutions to an ill-posed inverse problem [Cockayne et al., 2019a]. Given that the latent quantities involved in numerical tasks are frequently functions, this development is in accordance with recent years' interest in non-parametric inversion in infinite-dimensional function spaces [Stuart, 2010, Sullivan, 2015.
From the point of view of [Cockayne et al., 2019a], which echoes IBC, the common structure of numerical tasks such as quadrature, optimisation, and the solution of an ODE or PDE, is the following: • two known spaces: U, where the unknown latent variable lives, and Q, where the quantity of interest lives; • and a known function Q : U → Q, a quantity-of-interest function; and the traditional role of the numerical analyst is to select/design • a space Y, where data about the latent variable live; • and two functions: Y : U → Y, an information operator that acts on the latent variable to yield information, and B : Y → Q such that B • Y ≈ Q in some sense to be determined.
With respect to this final point, Larkin [Larkin, 1970] observed that there are many senses in which B • Y ≈ Q. One might ask, as Gaussian quadrature does, that the residual operator R := B • Y − Q vanish on a large enough finite-dimensional subspace of U; one might ask, as worst-case analysis does, that R be small in the supremum norm [Sard, 1949]; one might ask, as ACA does, that R be small in some integral norm against a probability measure on U. In the chosen sense, numerical methods aim to make the following diagram approximately commute 9 : A statistician might say that a deterministic numerical method B : Y → U as described above uses observed data y := Y (u) to give a point estimator B(y) ∈ Q for a quantity of interest Q(u) ∈ Q derived from a latent variable u ∈ U.
Example 3.1. The general structure is exemplified by univariate quadrature, in which U : corresponds to pointwise evaluation of the integrand at J given nodes a ≤ t 1 < · · · < t J ≤ b, and the quantity of interest is Thus, we are interested in the definite integral of u, and we estimate it using only the information Y (u), which does not completely specify u. Notice that some but not all quadrature methods B : Y → Q construct an estimate of u and then exactly integrate this estimate; Gaussian quadrature does this by polynomially interpolating the observed data Y (u); by way of constrast, vanilla Monte Carlo builds no such functional estimate of u, since its estimate for the quantity of interest, forgets the locations t j at which the integrand u was evaluated and uses only the values z j := u(t j ) of u.
(Of course, the accuracy of B MC is based on the assumption that the nodes t j are uniformly distributed in [a, b].) This formal framework enables a precise definition of a probabilistic numerical method (PNM) to be stated [Cockayne et al., 2019a, Section 2]. Assume that U, Y, and Q are measurable spaces, that Y and Q are measurable maps, and let P U etc. denote the corresponding sets of probability distributions on these spaces. Let Q : P U → P Q denote the push-forward 10 of the map Q, and define Y etc. similarly.
Definition 3.2. A probabilistic numerical method for the estimation of a quantity of interest Q consists of an information operator Y : U → Y and a map β : P U × Y → P Q , the latter being termed a belief update operator.
That is, given a belief µ about u, β(µ, · ) converts observed data y ∈ Y about u into a belief β(µ, y) ∈ P Q about Q(u), as illustrated by the dashed arrow in the following (not necessarily commutative) diagram: Recall that a diagram such as (3.1) or (3.4) is called commutative if all routes that follow the arrows (functions) from any starting point to any endpoint yield the same result. Thus, commutativity of (3.1) means exactly that B(Y (u)) = Q(u) for all u ∈ U . 10 I.e. Q µ(S) = µ(Q −1 (S)) for all measurable S ⊆ Q As shown by the dotted arrows in (3.3), this perspective is general enough to contain classical numerical methods B : Y → Q as the special case β(µ, y) = δ B(y) , where δ q ∈ P Q is the unit Dirac measure at q ∈ Q.
One desideratum for a PNM β is that its point estimators (e.g. mean, median, or mode) should be closely related to standard deterministic numerical methods B. This aspect is present in works such as [Schober et al., 2014], which considers probabilistic ODE solvers with Runge-Kutta schemes as their posterior means, and [Cockayne et al., 2016[Cockayne et al., , 2017, which consider PDE solvers with the symmetric collocation method as the posterior mean. However, this aspect is by no means universally stressed.
A second, natural, desideratum for a PNM β is that the spread (e.g. the variance) of the distributional output should provide a fair reflection of the accuracy to which the quantity of interest is being approximated. In the statistics literature this amounts to a deside for credible intervals to be well calibrated [Robins and van der Vaart, 2006]. In particular, one might desire that the distribution β contract to the true value of Q(u) at an appropriate rate as the data dimension (e.g. the number of quadrature nodes) is increased. 11 Diagram (3.1), when it commutes, characterises the "ideal" classical numerical method B; there is, as yet, no closed loop in diagram (3.3) involving β, which we would need in order to describe an "ideal" PNM β. This missing map in (3.3) is intimately related to the notion of a Bayesian PNM as defined by [Cockayne et al., 2019a].
The key insight is that, given a prior belief expressed as a probability distribution µ ∈ P U and the information operator Y : U → Y, a Bayesian practitioner has a privileged map from Y into P U to add to diagram (3.3), namely the conditioning operator that maps any possible value y ∈ Y of the observed data to the corresponding conditional distribution µ y ∈ P U for u given y. In this situation, in contrast to the freedom 12 enjoyed by the designer of an arbitrary PNM, a Bayesian has no choice in her/his belief β(µ, y) about Q(u): it must be nothing other than the image under Q of µ y .

Definition 3.3. A probabilistic numerical method is said to be
In this situation µ is called a prior (for u) and β(µ, y) a posterior (for Q(u)).
In other words, being Bayesian means that the following diagram commutes: Note that Definition 3.3 does not insist that a Bayesian PNM actually calculates µ y and then computes the push-forward; only that the output of the PNM is equal to Q µ y . Thus, whether or not a PNM is Bayesian is specific to the quantity of interest Q. Note also that a PNM β(µ, · ) can be Bayesian for some priors µ yet be non-Bayesian for other choices of µ; for details see [Cockayne et al., 2019a, Sec. 5.2].
To be more formal for a moment, in Definition 3.3 the conditioning operation y → µ y is interpreted in the sense of a disintegration, as advocated by [Chang and Pollard, 1997]. This level of technicality is needed in order to make rigorous sense of the operation of conditioning on the µ-negligible event that Y (u) = y. Thus, • for each y ∈ Y, µ y ∈ P U is supported only on those values of u compatible with the observation Y (u) = y, i.e. µ y ({u ∈ U | Y (u) = y}) = 0; • for any measurable set E ⊆ U, y → µ y (E) is a measurable function from Y into [0, 1] satisfying the reconstruction property, or law of total probability, 11 Here we abuse notation slightly: strictly speaking, we should refer not to one PNM β with input data y of varying dimension but to a one-parameter family of PNMs β J parametrised by the data dimension J. 12 The large and rapidly-growing canon of PNMs, only some of which are cited in this article, is strong evidence of just how great this freedom is! Under mild conditions 13 such a disintegration always exists, and is unique up to modification on Y µ-null sets.
Observe that the fundamental difference between ACA (i.e. the probabilistic assessment of classical numerical methods) and Bayesianity of PNMs is that the former concerns the commutativity of diagram (3.1) in the average (i.e. the left-hand half of diagram (3.3)), whereas the latter concerns the commutativity of diagram (3.4).
The prime example of a Bayesian PNM is the following example of kernel quadrature, due to [Larkin, 1972]: Example 3.4. Recall the set-up of Example 3.1. Take a Gaussian distribution µ on C 0 ([a, b]; R), with mean function m : [a, b] → R and covariance function k : [a, b] 2 → R. Then, given the data the disintegration µ y is again a Gaussian on C 0 ([a, b]; R) with mean and covariance functions The Bayesian PNM output β(µ, y), i.e. the push-forward Q µ y , is a Gaussian on R with mean m y and variance (σ y ) 2 given by integrating (3.5) and (3.6) respectively, i.e.
From a practical perspective, k is typically taken to have a parametric form k θ and the parameters θ are adjusted in a data-dependent manner, for example to maximise the marginal likelihood of the information y under the Gaussian model. One may also seek point sets that minimise the posterior variance (σ y ) 2 of the estimate of the integral. For the Brownian covariance kernel k(t, t ) = min(t, t ), the posterior Q µ = N (m y , (σ y ) 2 ) for b a u(t) dt is given by (2.4), the variance of which is clearly minimised by an equally-spaced point set {t j } J j=1 . For more general kernels k, an early reference for selecting the point set {t j } J j=1 to minimise (σ y ) 2 is [O'Hagan, 1991]. This perspective, in which the Bayesian update is singled out from other possible belief updates, is reminiscent of foundational discussions such as [Bissiri et al., 2016, Zellner, 1988. Interestingly, about half of the papers published on PN can be viewed as being (at least approximately) Bayesian; see the survey in the supplement of [Cockayne et al., 2019a]. This includes the work of Larkin, though, as previously mentioned, Larkin himself did not use the terminology of the Bayesian framework. Quite aside from questions of computational cost, non-Bayesian methods come into consideration because the requirement to be fully Bayesian can impose non-trivial constraints on the design of a practical numerical method, particularly for problems with a causal aspect or "time's arrow"; this point was discussed in detail for the numerical solution of ODEs in .
As well as providing a clear formal benchmark, [Cockayne et al., 2019a, Section 5] argue that a key advantage of Bayesian probabilistic numerical methods is that they are closed under composition, so that the output of a computational pipeline composed of Bayesian probabilistic numerical methods will inherit Bayesian semantics itself. This is analogous to the Markov condition that underpins directed acyclic graphical models [Lauritzen, 1996] and may be an advantageous property in the context of large and/or distributed computational codes -an area where performing a classical numerical analysis can often be difficult. For non-Bayesian PNMs it is unclear how these can/should be combined, but we note an analogous discussion of statistical "models made of modules" in the recent work of [Jacob et al., 2017] (who observe, like , that strictly Bayesian models can be brittle under model misspecification, whereas non-Bayesianity confers additional robustness) and also the numerical analysis of probabilistic forward models in Bayesian inverse problems in .

Discussion and Outlook
"Det er vanskeligt at spaa, isaer naar det gaelder Fremtiden." [Danish proverb] As it stands in 2019, our view is that there is much to be excited about. An intermittent stream of ad hoc observations and proposals, which can be traced back to the pioneering work of Larkin and Sul din, has been unified under the banner of probabilistic numerics  and solid statistical foundations have now been established [Cockayne et al., 2019a]. In this section we comment on some of the most important aspects of research that remain to be addressed.

Killer Apps
The most successful area of research to date has been on the development of Bayesian methods for global optimisation [Snoek et al., 2012], which have become standard to the point of being embedded into commercial software [The MathWorks Inc.] and deployed in realistic [Acerbi, 2018, Paul et al., 2018 and indeed high-profile [Chen et al., 2018] applications. Other numerical tasks have yet to experience the same level of practical interest, though we note applications of probabilistic methods for cubature in computer graphics [Marques et al., 2013] and tracking [Prüher et al., 2018], as well as applications of probabilistic numerical methods in medical tractography [Hauberg et al., 2015] and nonlinear state estimation [Oates et al., 2019a] in an industrial context.
It has been suggested that probabilistic numerics is likely to experience the most success in addressing numerical tasks that are fundamentally difficult [Owen, 2019]. One area that we highlight, in particular, in this regard is the solution of high-dimensional PDEs. There is considerable current interest in the deployment of neural networks as a substitute for more traditional numerical methods in this context, e.g. [Sirignano and Spiliopoulos, 2018], and the absence of interpretable error indicators for neural networks is a strong motivation for the development of more formal probabilistic numerical methods for this task. We note also that nonlinear PDEs in particular are prone to non-uniqueness of solutions. For some problems, physical reasoning may be used to choose among the various solutions, from the probabilistic or statistical perspective lack of uniqueness presents no fundamental philosophical issues: the multiple solutions are simply multiple maxima of a likelihood, and the prior is used to select among them, as in e.g. the treatment of Painlevé's transcendents in [Cockayne et al., 2019a].
It has also been noted that the probabilistic approach provides a promising paradigm for the analysis of rounding error in mixed-precision calculations, where classical bounds "do not provide good estimates of the size of the error, and in particular [. . . ] overestimate the error growth, that is, the asymptotic dependence of the error on the problem size" [Higham and Mary, 2018].

Adaptive Bayesian Methods
The presentation of a PNM in Section 3.2 did not permit adaptation. It has been rigorously established that for linear problems adaptive methods (e.g., in quadrature, sequential selection of the notes t j ) do not outperform non-adaptive methods according to certain performance metrics such as worst-case error [Woźniakowski, 1985, Section 3.2]. However, adaptation is known to be advantageous in general for nonlinear problems [Woźniakowski, 1985, Section 3.8]. At a practical level, adaptation is usually an essential component in the development of stopping rules that enable a numerical method to terminate after an error indicator falls below a certain user-specified level. An analysis of adaptive PNMs would constitute a non-trivial generalisation of the framework of [Cockayne et al., 2019a], who limited attention to static directed acyclic graph representation of conditional dependence structure. The generalisation to adaptive PNM necessitates the use of graphical models with a natural filtration, as exemplified by a dynamical Bayesian network [Murphy, 2002].
It has been suggested that numerical analysis is a natural use case for empirical Bayes methods [Carlin andLouis, 2000, Casella, 1985], as opposed to related -but usually more computationally intensive -approaches such as hierarchical modelling and cross-validation. Empirical Bayes methods can be characterised as a specific instance of adaptation in which the observed data are used not only for inference but also to form a point estimator for the prior. For example, in a quadrature setting, the practitioner is in the fortunate position of being able to use evaluations of the integrand u both to estimate the regularity of u and the value of the integral. Empirical Bayesian methods are explored in [Schober et al., 2018] and in [Jagadeeswaran and Hickernell, 2018].

Design of Probabilistic Numerical Methods
Paradigmatic questions in the IBC literature are those of (i) an optimal information operator Y for a given task, and (ii) the optimal numerical method B for a given task, given information of a known type [Traub et al., 1983]. In the statistical literature, there is also a long history of Bayesian optimal experimental design, in parametric and non-parametric contexts [Lindley, 1956, Piiroinen, 2005. The extent to which these principles can be used to design optimal numerical methods automatically (rather than by inspired guesswork on the mathematician's part,à la Larkin) remains a major open question, analogous to the automation of statistical reasoning envisioned by Wald and subsequent commentators on his work [Owhadi and Scovel, 2017b].

Probabilistic Programming
The theoretical foundations of probabilistic numerics have now been laid, but at present a library of compatible code has not been developed. In part, this is due to the amount of work needed in order to make a numerical implementation reliable and efficient, and in this respect PN lies far behind classical numerical analysis at present. Nevertheless, we anticipate that such efforts will be undertaken in coming years, and will lead to the wider adoption of probabilistic numerical methods. In particular, we are excited at the prospect of integrating probabilistic numerical methods into a probabilistic programming language, e.g. [Carpenter et al., 2017], where tools from functional programming and category theory can be exploited in order to automatically compile codes built from probabilistic numerical methods [Ścibior et al., 2015].

Bridging the Numerics-Statistics Gap
"Numerical analysts and statisticians are both in the business of estimating parameter values from incomplete information. The two disciplines have separately developed their own approaches to formalizing strangely similar problems and their own solution techniques; the author believes they have much to offer each other." [Larkin, 1979c] A major challenge faced by researchers in this area is the interdisciplinary gap between numerical analysts on the one hand and statisticians on the other. Though there are some counterexamples, as a first approximation it is true to say that classically-trained numerical analysts lack deep knowledge of probability or statistics, and classically-trained statisticians are not well versed in numerical topics such as convergence and stability analysis. Indeed, not only do these two communities take interest in different questions, they often fail to even see the point of the other group's expertise and approaches to their common problems.
A caricature of this mutual incomprehension is the following: A numerical analyst will quite rightly point out that almost all problems have numerical errors that are provably non-Gaussian, not least because s/he can exhibit a rigorous a-priori or a-posteriori error bound. Therefore, to the numerical analyst it seems wholly inappropriate to resort to Gaussian models for any purpose at all; these are often the statistician's first models of choice, though they should not be the last. This non-paradox was explained in detail by Larkin in [Larkin, 1974]. (As a side note, it seems to us from our discussions that numerical analysts are happier to discuss the modelling of errors than the latent quantities which they regard as fixed, whereas statisticians seems to have the opposite preference; this is a difference in views that echoes the famous frequentist-subjectivist split in statistics.) The numerical analyst also wonders why, in the presence of an under-resolved integral, the practitioner does not simply apply an adaptive quadrature scheme and run it until an a posteriori global error indicator falls below a pre-set tolerance.
We believe that these difficulties are not fundamental and can be overcome by a more careful statement of the approach being taken to address the numerical task. In particular, the meeting ground for the numerical analysts and statisticians, and the critical arena of application for PN, consists of problems that cannot be run to convergence more cheaply than quantifying the uncertainties of the coarse solution -or, at least, where there is an interesting cost-v.-accuracy tradeoff to be had, which is a central enabling factor for multilevel methods [Giles, 2015].
More generally, we are encouraged to see that epistemic uncertainty is being used once again and an analytical device in numerical analysis in the sense originally described by von Neumann and Goldstine [von Neumann and Goldstine, 1947]; see e.g. [Higham and Mary, 2018].

Summary
The first aim of this article was to better understand probabilistic numerics through its historical development. Aside from the pioneering work of Larkin, it was only in the 1990s that probabilistic numerical methods -i.e. algorithms returning a probability distribution as their output -were properly developed. A unified vision of probabilistic computation was powerfully presented in  and subsequently formalised in [Cockayne et al., 2019a].
The second aim of this article was to draw a distinction between PN as a means to an end, as a form of probabilistic sensitivity / stability analysis, and PN as an end in itself. In particular, we highlighted the Bayesian sub-class of PNMs as being closed under composition, a property that makes these particularly well suited for use in UQ; we also remarked that many problems -for reasons of problem structure, computational cost, or robustness to model misspecification -call for methods that are not formally Bayesian.
Finally, we highlighted areas for further development, which we believe will be essential if the full potential of probabilistic numerics highlighted in  is to be realised. From our perspective, the coming to fruition of this vision will require demonstrable success on problems that were intractable with the computational resources of previous decades and a wider acceptance of Larkin's observation quoted above, with which we wholeheartedly agree: numerical analysts and statisticians are indeed in the same business and do have much to offer one other!