1 Introduction

The field of probabilistic numerics (PN), loosely speaking, attempts to provide a statistical treatment of the errors and/or approximations that are made en route to the output of a deterministic numerical method, e.g. the approximation of an integral by quadrature, or the discretised solution of an ordinary or partial differential equation. This decade has seen a surge of activity in this field. In comparison with historical developments that can be traced back over more than a hundred years, the most recent developments are particularly interesting because they have been characterised by simultaneous input from multiple scientific disciplines: mathematics, statistics, machine learning, and computer science. The field has, therefore, advanced on a broad front, with contributions ranging from the building of over-arching general theory to practical implementations in specific problems of interest. Over the same period of time, and because of increased interaction among researchers coming from different communities, the extent to which these developments were—or were not—presaged by twentieth-century researchers has also come to be better appreciated.

Thus, the time appears to be ripe for an update of the 2014 Tübingen Manifesto on probabilistic numerics (Hennig 2014; Osborne 2014a, b, c, d) and the position paper of Hennig et al. (2015) to take account of the developments between 2014 and 2019, an improved awareness of the history of this field, and a clearer sense of its future directions and potential.

In this article, we aim to summarise some of the history of probabilistic perspectives on numerics (Sect. 2), to place more recent developments into context (Sect. 3), and to articulate a vision for future research in, and use of, probabilistic numerics (Sect. 4).

The authors are grateful to the participants of Prob Num 2018, 11–13 April 2018, at the Alan Turing Institute, UK—and in particular the panel discussants Oksana Chkrebtii, Philipp Hennig, Youssef Marzouk, Mike Osborne, and Houman Owhadi—for many stimulating discussions on these topics. However, except where otherwise indicated, the views that we present here are our own, and if we have misquoted or misrepresented the views of others, then the fault is entirely ours.

2 Historical developments

The first aim of this article is to reflect on the gradual emergence of probabilistic numerics as a research field. The account in this section is not intended to be comprehensive in terms of the literature that is cited. Rather, our aim is to provide an account of how the philosophical status of probabilistic approaches to numerical tasks has evolved, and in particular to highlight the parallel, pioneering, but often-overlooked contributions of Sul\('\)din in the USSR and Larkin in the UK and Canada.

2.1 Prehistory (–1959)

The origins of PN can be traced to a discussion of probabilistic approaches to polynomial interpolation by Poincaré in his Calcul des Probabilités (Poincaré 1896, Ch. 21;Poincaré 1912, Ch. 25). Poincaré considered what, in modern terms, would be a particular case of a Gaussian infinite product measure prior on a function f, expressing it as a power series

$$\begin{aligned} f(x) = \sum _{k = 0}^{\infty } A_{k} x^{k} \end{aligned}$$

with independent normally-distributed coefficients \(A_{k}\); one is then given n pointwise observations of the values of f and seeks the probable values of f(x) for another (not yet observed) value of x.

“Je suppose que l’on sache a priori que la fonction f(x) est développable, dans une certain domaine, suivant les puissances croissantes des x,

$$\begin{aligned} f(x) = A_{0} + A_{1} x + \dots . \end{aligned}$$

Nous ne savons rien sur les A, sauf que la probabilité pour que l’un d’eux, \(A_{i}\), soit compris entre certaines limites, y et\(y + {\text {d}}y\), est

$$\begin{aligned} \sqrt{\frac{h_{i}}{\pi }} e^{- h_{i} y^{2}} \, {\text {d}}y . \end{aligned}$$

Nous connaissons par n observations

$$\begin{aligned} f(a_{1})&= B_{1} , \\ f(a_{2})&= B_{2} , \\ \cdots \cdots&\cdots \cdots \\ f(a_{n})&= B_{n} . \end{aligned}$$

Nous cherchons la valeur probable de f(x) pour une autre valeur de x.” (Poincaré 1912, p. 292)

Note that, in using a Gaussian prior, Poincaré was departing from the Laplacian principle of indifference (Laplace 1812), which would have mandated a uniform prior.Footnote 1

Poincaré’s analytical treatment predates the first digital multipurpose computers by decades, yet it clearly illustrates a non-trivial probabilistic perspective on a classic numerical task, namely function approximation by interpolation, a hybrid approach that is entirely in keeping with Poincaré’s reputation as one of the last universalist mathematicians (Ginoux and Gerini 2013).

However, our focus here is on the development of probabilistic numerical methods for use on a computer. The limited nature of the earliest computers led authors to focus initially on the phenomenon of round-off error (Henrici 1962; Hull and Swenson 1966; von Neumann and Goldstine 1947), whether of fixed-point or floating-point type, without any particular statistical inferential motivation; more recent contributions to the statistical study of round-off error include those of Barlow and Bareiss (1985), Chatelin and Brunet (1990), and Tienari (1970). According to von Neumann and Goldstine, writing in 1947,

“[round-off errors] are strictly very complicated but uniquely defined number theoretical functions [of the inputs], yet our ignorance of their true nature is such that we best treat them as random variables.” (von Neumann and Goldstine 1947, p. 1027).

Thus, von Neumann and Goldstine seem to have held a utilitarian view that probabilistic models in computation are useful shortcuts, simply easier to work with than the unwieldy deterministic truth.Footnote 2

Concerning the numerical solution of ordinary differential equations (ODEs), Henrici (1962, 1963) studied classical finite difference methods and derived expected values and covariance matrices for accumulated round-off error, under an assumption that individual round-off errors can be modelled as independent random variables. In particular, given posited means and covariance matrices of the individual errors, Henrici demonstrated how these moments can be propagated through the computation of a finite difference method. In contrast with more modern treatments, Henrici was concerned with the analysis of an established numerical method and did not attempt to statistically motivate the numerical method itself.

2.2 The parallel contributions of Larkin and Sul\('\)din (1959–1980)

One of the earliest attempts to motivate a numerical algorithm from a statistical perspective was due to Al\('\!\)bert Valentinovich Sul\('\)din (1924–1996) (Fig. 1), working at Kazan State University in the USSR (now Kazan Federal University in the Russian Federation) (Norden et al. 1978; Zabotin et al. 1996). After first making contributions to the study of Lie algebras, towards the end of the 1950s Sul\('\)din turned his attention to computational and applied mathematics, and in particular to probabilistic and statistical methodology. His work in this direction led to the establishment of the Faculty of Computational Mathematics and Cybernetics (now Institute of Computational Mathematics and Information Technologies) in Kazan, of which he was the founding Dean.

Sul\('\)din began by considering the problem of quadrature. Suppose that we wish to approximate the definite integral \(\int _{a}^{b} u(t) \, {\text {d}}t\) of a function \(u\in {\mathcal {U}}:=C^{0}([a, b]; {\mathbb {R}})\), the space of continuous real-valued functions on [ab], under a statistical assumption that \((u(t) - u(a))_{t \in [a, b]}\) follows a standard Brownian motion (Wiener measure, \(\mu _{W }\)). For this task, we receive pointwise data about the integrand \(u\) in the form of the values of \(u\) at \(J \in {\mathbb {N}}\) arbitrarily located nodes \(t_{1}, \dots , t_{J} \in [a, b]\), although for convenience we assume that

$$\begin{aligned} a = t_{1}< t_{2}< \dots < t_{J} = b. \end{aligned}$$

In more statistical language, anticipating the terminology of Sect. 3.2, our observed data or information concerning the integrand \(u\) is \(y:=(t_{j}, u(t_{j}))_{j = 1}^{J}\), which takes values in the space \({\mathcal {Y}}:=([a, b] \times {\mathbb {R}})^{J}\).

Fig. 1
figure 1

(Sul\('\)din 2018, reproduced with permission)

Al\('\!\)bert Valentinovich Sul\('\)din (1924–1996).

Since \(\mu _{W }\) is a Gaussian measure and both the integral and pointwise evaluations of \(u\) are linear functions of \(u\), Sul\('\)din (1959, 1960, 1963b) showed by direct calculation that the quadrature rule \(B:{\mathcal {Y}}\rightarrow {\mathbb {R}}\) that minimises the mean squared error

$$\begin{aligned} \int _{{\mathcal {U}}} \left| \int _{a}^{b} u(t) \, {\text {d}}t - B\bigl ( (t_{j}, u(t_{j}))_{j = 1}^{J} \bigr ) \right| ^{2} \, \mu _{W } ({\text {d}}u) \end{aligned}$$
(1)

is the classical trapezoidal ruleFootnote 3

$$\begin{aligned}&B_{tr } \bigl ( (t_{j}, z_{j})_{j = 1}^{J} \bigr ) \nonumber \\&\quad :=\frac{1}{2} \sum _{j = 1}^{J - 1} (z_{j + 1} + z_{j}) (t_{j + 1} - t_{j}) \end{aligned}$$
(2)
$$\begin{aligned}&\quad = z_{1} \frac{t_{2} - t_{1}}{2} + \sum _{j = 2}^{J - 1} z_{j} \frac{t_{j + 1} - t_{j - 1}}{2} + z_{J} \frac{t_{J} - t_{J - 1}}{2} , \end{aligned}$$
(3)

i.e. the definite integral of the piecewise linear interpolant of the observed data. This result was a precursor to a sub-field of numerical analysis that became known as average-case analysis; see Sect. 2.3.

Sul\('\)din was aware of the connection between his methods and statistical regression (Sul\('\)din 1963a) and conditional probability (Sul\('\)din 1963c), although it is difficult to know whether he considered his work to be an expression of statistical inference as such. Indeed, since Sul\('\)din’s methods were grounded in Hilbert space theory (Sul\('\)din 1968, 1969), the underlying mathematics (the linear conditioning of Gaussian measures on Hilbert spaces) is linear algebra which can be motivated without recourse to a probabilistic framework.

In any case, Sul\('\)din’s contributions were something entirely novel. Up to this point, the role of statistics in numerical analysis was limited to providing insight into the performance of a traditional numerical method. The 1960s brought forth a new perspective, namely the statistically motivated design of numerical methods. Indeed,

“A.V. Sul\('\!\)din’s 1969 habilitation thesis concerned the development of probabilistic methods for the solution of problems in computational mathematics. His synthesis of two branches of mathematics turned out to be quite fruitful, and deep connections were discovered between the robustness of approximation formulae and their precision. Building on the general concept of an enveloping Hilbert space, A.V.Sul\('\!\)din proved a projection theorem that enabled the solution of a number of approximation-theoretic problems.” (Zabotin et al. 1996)

However, Sul\('\)din was not alone in arriving at this point of view. On the other side of the Iron Curtain, between 1957 and 1969, Frederick Michael (“Mike”) Larkin (1936–1982) (Fig. 2) worked for the UK Atomic Energy Authority in its laboratories at Harwell and Culham (the latter as part of the Computing and Applied Mathematics Group), as well as working for two years at Rolls Royce, England. Following a parallel path to that of Sul\('\)din, over the next decade Larkin would further blend numerical analysis and statistical thinking (Kuelbs et al. 1972; Larkin 1969, 1972, 1974, 1979a, b, c), arguably laying the foundations on which PN would be developed. At Culham, Larkin worked on building some of the first graphical calculators, the GHOST graphical output system and the accompanying GHOUL graphical output language. It can be speculated that an intimate familiarity with the computational limitations of GHOST and GHOUL may have motivated Larkin to seek a richer description of the numerical error associated to their output.

Fig. 2
figure 2

(Larkin et al. 1967, reproduced with permission)

Frederick Michael Larkin (1936–1982).

The perspective developed by Larkin was fundamentally statistical and, in modern terminology, the probabilistic numerical methods he developed would be described as Bayesian,Footnote 4 which we discuss further in Sect. 3.2. Nevertheless, the pioneering nature of this research motivated Larkin to focus on specific numerical tasks, as opposed to establishing a unified framework. In particular, he considered in detail the problems of approximating a non-negative function (Larkin 1969), quadrature (Larkin 1972, 1974), and estimating the zeros of a complex function (Larkin 1979a, b). In the context of the earlier numerical integration example of Sul\('\)din, the alternative proposal of Larkin was to consider the Wiener measure as a prior, the information \((t_{j}, u(t_{j}))_{j=1}^J\) as (noiseless) data, and to output the posterior marginal for the integral \(\int _{a}^{b} u(t) \, {\text {d}}t\). That is, Larkin took the fundamental step of considering a distribution over the solution space of the numerical task to be the output of a computation—this is what we would now recognise as the defining property of a probabilistic numerical method:

“Among other things, this permits, at least in principle, the derivation of joint probability density functions for [both observed and unobserved] functionals on the space and also allows us to evaluate confidence limits on the estimate of a required functional (in terms of given values of other functionals).” (Larkin 1972)Footnote 5

Thus, in contrast to Sul\('\)din’s description of the trapezoidal rule \(B_{tr }\) from (2) as a frequentist point estimator obtained from minimising (1), which just happens to produce an unbiased estimator with variance \(\frac{1}{12} \sum _{j = 1}^{J - 1} (t_{j + 1} - t_{j})^{3}\), the Larkin viewpoint is to see the normal distribution

$$\begin{aligned} {\mathcal {N}}\left( B_{tr } \bigl ( (t_{j}, z_{j})_{j = 1}^{J} \bigr ) , \frac{1}{12} \sum _{j = 1}^{J - 1} (t_{j + 1} - t_{j})^{3} \right) \end{aligned}$$
(4)

on \({\mathbb {R}}\) as the measure-valued output of a probabilistic quadrature rule, of which \(B_{tr } \bigl ( (t_{j}, z_{j})_{j = 1}^{J}\) is a convenient point summary. Note also that the technical development in this pioneering work made fundamental contributions to the study of Gaussian measures on Hilbert spaces (Kuelbs et al. 1972; Larkin 1972).

Larkin moved to Canada in 1969 to start work as a Consultant in Numerical Methods and Applied Mathematics within the Computing Centre and, subsequently in 1974, as Associate Professor in the Department of Computing and Information Science (now the School of Computing) at Queen’s University in Kingston, Ontario. He received tenure in 1977 and was promoted to full professor in 1980.

“He worked in isolation at Queen’s in that few graduate students and fewer faculty members were aware of the nature of his research contributions to the field. [...] Michael pioneered the idea of using a probabilistic approach to give an alternative local approximation technique. In some cases this leads to the classical methods, but in many others leads to new algorithms that appear to have practical advantages over more classical methods. This work has finally begun to attract attention and I expect that the importance of his contribution will grow in time.” (Queen’s University at Kingston, 11 Feb. 1982)

From our perspective, writing in 2019, it seems that Sul\('\)din and Larkin were working in parallel but were ahead of their time. Their probabilistic perspectives on approximation theory were similar, but limited to a Gaussian measure context. Naturally, given the linguistic barriers and nearly disjoint publication cultures of their time, it would not have been easy for Larkin and Sul\('\)din to be conversant with each other’s work, though these barriers were not always as great as is sometimes thought (Hollings 2016). At least by 1972 (Larkin 1972), Larkin was aware of and cited Sul\('\)din’s work on minimal variance estimators for the values of linear functionals on Wiener space (Sul\('\)din 1959, 1960), but apparently did not know of Sul\('\)din’s 1969 habilitation thesis, which laid out a broader agenda for the role of probability in numerics. Conversely, Soviet authors writing in 1978 were aware of Sul\('\)din’s influence on, e.g. Ulf Grenander and Walter Freiberger at Brown University, but make no mention of Larkin (Norden et al. 1978). Sul\('\)din, for his part, at least as judged by his publication record, seems to have turned his attention to topics such as industrial mathematics [perhaps an “easier sell” in the production-oriented USSR (Hollings 2016)], mathematical biology, and of course the pressing concerns of faculty administration.

Finally, concerning the practicality of Sul\('\)din and Larkin’s ideas, one has to bear in mind the limited computational resources available at even cutting-edge facilities in the 1960s:Footnote 6 probabilistic numerics was an idea ahead of its time, and the computational power needed to make it a reality simply did not exist.

2.3 Optimal numerical methods are Bayes rules (1980–1990)

In the main, research contributions until 1990 continued to focus on deriving insight into traditional numerical methods through probabilistic analyses. In particular, the average-case analysis (ACA) of numerical methods received interest and built on the work of Kolmogorov (1936) and Sard (1963). In ACA, the performance of a numerical method is assessed in terms of its average error over an ensemble of numerical problems, with the ensemble being represented by a probability measure over the problem set; a prime example is univariate quadrature with the average quadratic loss (1) given earlier. Root-finding, optimisation, etc. can all be considered similarly, and we defer to, e.g. Ritter (2000) and Traub et al. (1983) for comprehensive treatments of this broad topic.

A traditional (deterministic) numerical method can also be regarded as a decision rule and the probability measure used in ACA can be used to instantiate the Bayesian decision-theoretic framework (Berger 1985). The average error is then recognised as the expected loss, also called the risk. The fact that ACA is mathematically equivalent to Bayesian decision theory (albeit limited to the case of an experiment that produces a deterministic dataset) was noted by Kimeldorf and Wahba (1970a, b), and Parzen (1970)—and also by Larkin (1970).

Armed with an optimality criterion for a numerical method, it is natural to ask about the existence and performance of method(s) that minimise it. Such methods are called average-case optimal in ACA and are recognised as Bayes rules or Bayes acts in the decision-theoretic context. A key result in this area is the insight of Kadane and Wasilkowski (1985) that ACA-optimal methods coincide with (non-randomised) Bayes rules when the measure used to define the average error is the Bayesian prior; for a further discussion of the relationships among these optimality criteria, including the Bayesian probabilistic numerical methods of Sect. 3.2, see Cockayne et al. (2019a) and Oates et al. (2019b).

Many numerical methods come in parametric families, being parametrised by, e.g. the number of quadrature nodes, a mesh size, or a convergence tolerance. For any “sensible” method, the error can be driven to zero by sending the parameter to infinity or zero as appropriate. If one is prepared to pay an infinite computational cost, then essentially any method can be optimal! Thus, when asking about the optimality of a numerical method, it is natural to consider the optimality of methods of a given computational cost or complexity.

With such concerns in mind, the field of information-based complexity (IBC) (Novak 1988; Traub et al. 1983; Traub and Woźniakowsi 1980) developed simultaneously with ACA, with the aim of relating the computational complexity and optimality properties of algorithms to the available information on the unknowns, e.g. the partial nature of the information and any associated observational costs and errors. For example, Smale (1985, Theorem D) compared the accuracies (with respect to mean absolute error) for a given cost of the Riemann sum, trapezoidal, and Simpson quadrature rules;Footnote 7 in the same paper, Smale also considered root-finding, optimisation via linear programming, and the solution of systems of linear equations.

The example of Bayesian quadrature was again discussed in detail by Diaconis (1988), who repeated Sul\('\)din’s observation that the posterior mean for \(\int _{a}^{b} u(t) \, {\text {d}}t\) under the Wiener measure prior is the trapezoidal method (2), which is an ACA-optimal numerical method. However, Diaconis posed a further question: can other classical numerical integration methods, or numerical methods for other tasks, be similarly recovered as Bayes rules in a decision-theoretic framework? For linear cubature methods, a positive and constructive answer was recently provided by Karvonen et al. (2018), but the question remains open in general.

2.4 Probabilistic numerical methods (1991–2009)

After a period in which probabilistic numerical methods were all but forgotten, research interest was again triggered by various contributions on numerical integration (Minka 2000; O’Hagan 1991; Rasmussen and Ghahramani 2003), each to a greater or lesser extent a rediscovery of earlier work due to Larkin (1972). In each case, the output of computation was considered to be a probability distribution over the quantity of interest.

The 1990s saw an expansion in the PN agenda, first with early work on an area that was to become Bayesian optimisation (Močkus 1975, 1977, 1989) and then with an entirely novel contribution on the numerical solution of ODEs by Skilling (1992). Skilling presented a BayesianFootnote 8 perspective on the numerical solution of initial value problems of the form

$$\begin{aligned} u'(t) \equiv \frac{{\text {d}}u}{{\text {d}}t}= & {} f(t, u(t)) \quad t \in [0, T] ,\nonumber \\ u(0)= & {} u_{0} , \end{aligned}$$
(5)

and considered, for example, how regularity assumptions on f should be reflected in correlation functions and the hypothesis space, how to choose a prior and likelihood, and potential sampling strategies. Despite this work’s then-new explicit emphasis on its Bayesian statistical character, Skilling himself considered his contributions to be quite natural:

“This paper arose from long exposure to Laplace/Cox/Jaynes probabilistic reasoning, combined with the University of Cambridge’s desire that the author teach some (traditional) numerical analysis. The rest is common sense. [...] Simply, Bayesian ideas are ‘in the air’.” (Skilling 1992)

2.5 Modern perspective (2010–)

The last two decades have seen an explosion of interest in uncertainty quantification (UQ) for complex systems, with a great deal of research taking place in this area at the meeting point of applied mathematics, statistics, computational science, and application domains (Le Maître and Knio 2010; Smith 2014; Sullivan 2015):

“UQ studies all sources of error and uncertainty, including the following: systematic and stochastic measurement error; ignorance; limitations of theoretical models; limitations of numerical representations of those models; limitations of the accuracy and reliability of computations, approximations, and algorithms; and human error. A more precise definition is UQ is the end-to-end study of the reliability of scientific inferences.” (U.S. Department of Energy 2009, p. 135)

Since 2010, perhaps stimulated by this activity in the UQ community, a perspective on PN has emerged that sees PN part of UQ (broadly understood) and should be performed with a view to propagating uncertainty in computational pipelines. This is discussed further in Sects. 3.1 and 3.2.

A notable feature of PN research since 2010 is the way that it has advanced on a broad front. The topic of quadrature/cubature, in the tradition of Sul\('\)din and Larkin, continues to be well represented: see, e.g. Briol et al. (2019); Gunter et al. (2014); Karvonen et al. (2018); Oates et al. (2017); Osborne et al. (2012a, b); Särkkä et al. (2016), and Xi et al. (2018), as well as Ehler et al. (2019); Jagadeeswaran and Hickernell (2019); Karvonen et al. (2019a), and Karvonen et al. (2019b) in this special issue. The Bayesian approach to global optimisation continues to be widely used (Chen et al. 2018; Snoek et al. 2012), whilst probabilistic perspectives on quasi-Newton methods (Hennig and Kiefel 2013) and line search methods (Mahsereci and Hennig 2015) have been put forward. In the context of numerical linear algebra, Bartels and Hennig (2016); Cockayne et al. (2019b), and Hennig (2015), as well as Bartels et al. (2019) in this special issue, have approached the solution of a large linear system of equations as a statistical learning task and developed probabilistic alternatives to the classical conjugate gradient method.

Research has been particularly active in the development and analysis of statistical methods for the solution of ordinary and partial differential equations (ODEs and PDEs). One line of research has sought to cast the solution of ODEs in the context of Bayesian filtering theory by building a Gaussian process (GP) regression model for the solution \(u\) of the initial value problem of the form (5). The observational data consist of the evaluations of the vector field f, interpreted as imperfect observations of the true time derivative \(u'\), since one evaluates f at the “wrong” points in space. In this context, the key result is the Bayesian optimality of evaluating f according to the classical Runge–Kutta (RK) scheme, so that the RK methods can be seen as point estimators of GP filtering schemes (Kersting and Hennig 2016; Schober et al. 2014, 2018); see also Tronarp et al. (2019) in this special issue. Related iterative probabilistic numerical methods for ODEs include those of Abdulle and Garegnani (2018); Chkrebtii et al. (2016); Conrad et al. (2017); Kersting et al. (2018); Teymur et al. (2016, 2018). The increased participation of mathematicians in the field has led to correspondingly deeper local and global convergence analysis of these methods in the sense of conventional numerical analysis, as performed by Conrad et al. (2017); Kersting et al. (2018); Schober et al. (2018), and Teymur et al. (2018), as well as Lie et al. (2019) in this special issue; statistical principles for time step adaptivity have also been discussed, e.g. by Chkrebtii and Campbell (2019) in this special issue.

For PDEs, resent research includes Chkrebtii et al. (2016); Cockayne et al. (2016, 2017), and Owhadi (2015), with these contributions making substantial use of reproducing kernel Hilbert space (RKHS) structure and Gaussian processes. Unsurprisingly, given the deep connections between linear algebra and numerical methods for PDEs, the probabilistically motivated theory of gamblets for PDEs (Owhadi 2017; Owhadi and Scovel 2017a; Owhadi and Zhang 2017) has gone hand-in-hand with the development of fast solvers for structured matrix inversion and approximation problems (Schäfer et al. 2017); see also Yoo and Owhadi (2019) in this special issue.

Returning to the point made at the beginning of this section, however, motivation for the development of probabilistic numerical methods has become closely linked to the traditional motivations of UQ (e.g. accurate and honest estimation of parameters of a so-called forward model), with a role for PN due to the need to employ numerical methods to simulate from a forward model. The idea to substitute a probability distribution in place of the (in general erroneous) output of a traditional numerical method can be used to prevent undue bias and over-confidence in the UQ task and is analogous to robust likelihood methods in statistics (Bissiri et al. 2016; Greco et al. 2008). This motivation is already present in Conrad et al. (2017) and forms a major theme of Cockayne et al. (2019a); Oates et al. (2019a). Analysis of the impact of probabilistic numerical methods in simulation of the forward model within the context of Bayesian inversion has been provided by Lie et al. (2018) and Stuart and Teckentrup (2018).

2.6 Related fields and their development

The field of PN did not emerge in isolation and the research cited above was undoubtedly influenced by parallel developments in mathematical statistics, some of which are now discussed.

First, the mathematical theory of optimal approximation using splines was applied by Schoenberg (1965, 1966) and Karlin (1969, 1971, 1972, 1976) in the late 1960s and early 1970s to the linear problem of quadrature. Indeed, Larkin (1974) cites Karlin (1969). However, the works cited above were not concerned with randomness and equivalent probabilistic interpretations were not discussed; in contrast, the Bayesian interpretation of spline approximation was highlighted by Kimeldorf and Wahba (1970a).

Second, the experimental design literature of the late 1960s and early 1970s, including a sequence of contributions from Sacks and Ylvisaker (1966, 1968, 1970a, b), considered optimal selection of a design \(0 \le t_{1}< t_{2}< \dots < t_{J} \le 1\) to minimise the covariance of the best linear estimator of \(\beta \) given discrete observations of stochastic process

$$\begin{aligned} Y(t) = \sum _{i = 1}^{m} \beta _{i} \phi _{i} (t) + Z(t) , \end{aligned}$$

where Z is a stochastic process with \({\mathbb {E}}[ Z(t) ] = 0\) and \({\mathbb {E}}[ Z(t)^{2} ] < \infty \), based on the data \(\{ ( t_{j}, Y(t_{j}) ) \}_{j=1}^J\). As such, the mathematical content of these works concerns optimal approximation in RKHSs, e.g. Sacks and Ylvisaker (1970a, p. 2064, Theorem 1); we note that Larkin (1970) simultaneously considered optimal approximation in RKHSs. However, the extent to which probability enters these works is limited to the measurement error process Z that is entertained.

Third, the literature on emulation of black-box functions that emerged in the late 1970s and 1980s, with contributions from, e.g. O’Hagan (1978) and Sacks et al. (1989), provided Bayesian and frequentist statistical perspectives (respectively) on interpolation of a black-box function based on a finite number of function evaluations. This literature did not present interpolation as an exemplar of other more challenging numerical tasks, such as the solution of differential equations, which could be similarly addressed but rather focused on the specific problem of black-box interpolation in and of itself. Sacks et al. (1989) were aware of the work of Sul\('\)din but Larkin’s work was not cited. The challenges of proposing a suitable stochastic process model for a deterministic function were raised in the accompanying discussion of Sacks et al. (1989) and were further discussed by Currin et al. (1991).

2.7 Conceptual evolution—a summary

To conclude and summarise this section, we perceive the following evolution of the concepts used in, and interpretation applied to, probability in numerical analysis:

  1. 1.

    In the traditional setting of numerical analysis, as seen circa 1950, all objects and operations are seen as being strictly deterministic. Even at that time, however, it was accepted by some that these deterministic objects are sometimes exceedingly complicated, to the extent that they may be treated as being stochastic, à la von Neumann and Goldstine.

  2. 2.

    Sard and Sul\('\)din considered the questions of optimal performance of a numerical method in, respectively, the worst-case and the average-case context. Though it is a fact that some of the average-case performance measures amount to variances of point estimators, they were not viewed as such and in the early 1960s these probabilistic aspects were not a motivating factor.

  3. 3.

    Larkin’s innovation, in the late 1960s and early 1970s, was to formulate numerical tasks in terms of a joint distribution over latent quantities and quantities of interest, so that the quantity-of-interest output can be seen as a stochastic object. However, perhaps due to the then-prevailing statistical culture, Larkin summarised his posterior distributions using a point estimator accompanied by a credible interval.

  4. 4.

    The fully modern viewpoint, circa 2019, is to explicitly think of the output as a probability measure to be realised, sampled, and possibly summarised.

3 Probabilistic numerical methods come into focus

In this section, we wish to emphasise how some of the recent developments mentioned in the previous section have brought greater clarity to the philosophical status of probabilistic numerics, clearing up some old points of disagreement or providing some standardised frameworks for the comparison of tasks and methods.

3.1 A means to an end, or an end in themselves?

One aspect that has become clearer over the last few years, stimulated to some extent by disagreements between statisticians and numerical analysts over the role of probability in numerics, is that there are (at least) two distinct use cases or paradigms:

  • (P1) a probability-based analysis of the performance of a (possibly classical) numerical method;

  • (P2) a numerical method whose output carries the formal semantics of some statistical inferential paradigm (e.g. the Bayesian paradigm; cf. Sect. 3.2).

Representatives of the first class of methods include Abdulle and Garegnani (2018) and Conrad et al. (2017), which consider stochastic perturbations to explicit numerical integrators for ODEs in order to generate an ensemble of plausible trajectories for the unknown solution of the ODE. In some sense, this can be viewed as a probabilistic sensitivity/stability analysis of a classical numerical method. This first paradigm is also, clearly, closely related to ACA.

The second class of methods is exemplified by the Bayesian probabilistic numerical methods, discussed in Cockayne et al. (2019a) and Sect. 3.2. We can further enlarge the second class to include those methods that only approximately carry the appropriate semantics, e.g. because they are only approximately Bayesian, or only Bayesian for a particular quantity of interest or up to a finite time horizon, e.g. the filtering-based solvers for ODEs (Kersting and Hennig 2016; Kersting et al. 2018; Schober et al. 2014, 2018).

Note that the second class of methods can also be pragmatically motivated, in the sense that formal statistical semantics enable techniques such as ANOVA to be brought to bear on the design and optimisation of a computational pipeline (to target the aspect of the computation that contributes most to uncertainty in the computational output) (Hennig et al. 2015). In this respect, statistical techniques can in principle supplement the expertise that is typically provided by a numerical analyst.

We note that paradigm (P1), with its close relationship to the longer-established field of ACA, tends to be more palatable to the classical numerical analysis community. The typical, rather than worst-case, performance of a numerical method is of obvious practical interest (Trefethen 2008). Statisticians, especially practitioners of Bayesian and fiducial inference, are habitually more comfortable with paradigm (P2) than numerical analysts are. As we remark in Sect. 4.5, this difference stems in part from a difference of opinion in which quantities are / can be regarded as “random” by the two communities; this difference of opinion affects (P2) much more strongly than (P1).

3.2 Bayesian probabilistic numerical methods

A recent research direction, which provides formal foundations for the approach pioneered by Larkin, is to interpret both traditional numerical methods and probabilistic numerical methods as particular solutions to an ill-posed inverse problem (Cockayne et al. 2019a). Given that the latent quantities involved in numerical tasks are frequently functions, this development is in accordance with recent years’ interest in non-parametric inversion in infinite-dimensional function spaces (Stuart 2010; Sullivan 2015).

From the point of view of Cockayne et al. (2019a), which echoes IBC, the common structure of numerical tasks such as quadrature, optimisation, and the solution of an ODE or PDE, is the following:

  • two known spaces: \({\mathcal {U}}\), where the unknown latent variable lives, and \({\mathcal {Q}}\), where the quantity of interest lives;

  • and a known function \(Q:{\mathcal {U}}\rightarrow {\mathcal {Q}}\), a quantity-of-interest function;

and the traditional role of the numerical analyst is to select/design

  • a space \({\mathcal {Y}}\), where data about the latent variable live;

  • and two functions: \(Y:{\mathcal {U}}\rightarrow {\mathcal {Y}}\), an information operator that acts on the latent variable to yield information, and \(B:{\mathcal {Y}}\rightarrow {\mathcal {Q}}\) such that \(B\circ Y\approx Q\) in some sense to be determined.

With respect to this final point, Larkin (1970) observed that there are many senses in which \(B\circ Y\approx Q\). One might ask, as Gaussian quadrature does, that the residual operator \(R :=B\circ Y- Q\) vanish on a large enough finite-dimensional subspace of \({\mathcal {U}}\); one might ask, as worst-case analysis does, that R be small in the supremum norm (Sard 1949); one might ask, as ACA does, that R be small in some integral norm against a probability measure on \({\mathcal {U}}\). In the chosen sense, numerical methods aim to make the following diagram approximately commuteFootnote 9:

(6)

A statistician might say that a deterministic numerical method \(B:{\mathcal {Y}}\rightarrow {\mathcal {U}}\) as described above uses observed data \(y:=Y(u)\) to give a point estimator\(B(y) \in {\mathcal {Q}}\) for a quantity of interest \(Q(u) \in {\mathcal {Q}}\) derived from a latent variable \(u\in {\mathcal {U}}\).

Example 1

The general structure is exemplified by univariate quadrature, in which \({\mathcal {U}}:=C^{0}([a, b]; {\mathbb {R}})\), the information operator

$$\begin{aligned} Y(u) :=(t_{j}, u(t_{j}))_{j = 1}^{J} \in {\mathcal {Y}}:=([a, b] \times {\mathbb {R}})^{J}, \end{aligned}$$

corresponds to pointwise evaluation of the integrand at J given nodes \(a \le t_{1}< \dots < t_{J} \le b\), and the quantity of interest is

$$\begin{aligned} Q(u) :=\int _{a}^{b} u(t) \, {\text {d}}t \in {\mathcal {Q}}:={\mathbb {R}}. \end{aligned}$$

Thus, we are interested in the definite integral of \(u\), and we estimate it using only the information \(Y(u)\), which does not completely specify \(u\). Notice that some but not all quadrature methods \(B:{\mathcal {Y}}\rightarrow {\mathcal {Q}}\) construct an estimate of \(u\) and then exactly integrate this estimate; Gaussian quadrature does this by polynomially interpolating the observed data \(Y(u)\); by way of contrast, vanilla Monte Carlo builds no such functional estimate of \(u\), since its estimate for the quantity of interest,

$$\begin{aligned} B_{MC } \left( (t_{j}, z_{j})_{j = 1}^{J} \right) = \frac{1}{J} \sum _{j = 1}^{J} z_{j} , \end{aligned}$$
(7)

forgets the locations \(t_{j}\) at which the integrand \(u\) was evaluated and uses only the values \(z_{j} :=u(t_{j})\) of \(u\). (Of course, the accuracy of \(B_{MC }\) is based on the assumption that the nodes \(t_{j}\) are uniformly distributed in [ab].)

This formal framework enables a precise definition of a probabilistic numerical method (PNM) to be stated (Cockayne et al. 2019a, Section 2). Assume that \({\mathcal {U}}\), \({\mathcal {Y}}\), and \({\mathcal {Q}}\) are measurable spaces, that \(Y\) and \(Q\) are measurable maps, and let \({\mathcal {P}}_{{\mathcal {U}}}\) etc. denote the corresponding sets of probability distributions on these spaces. Let \(Q_{\sharp } :{\mathcal {P}}_{{\mathcal {U}}} \rightarrow {\mathcal {P}}_{{\mathcal {Q}}}\) denote the push-forwardFootnote 10 of the map \(Q\), and define \(Y_{\sharp }\) etc. similarly.

Definition 1

A probabilistic numerical method for the estimation of a quantity of interest \(Q\) consists of an information operator \(Y:{\mathcal {U}}\rightarrow {\mathcal {Y}}\) and a map \(\beta :{\mathcal {P}}_{{\mathcal {U}}} \times {\mathcal {Y}}\rightarrow {\mathcal {P}}_{{\mathcal {Q}}}\), the latter being termed a belief update operator.

That is, given a belief \(\mu \) about \(u\), \(\beta (\mu , \cdot )\) converts observed data \(y\in {\mathcal {Y}}\) about \(u\) into a belief \(\beta (\mu , y) \in {\mathcal {P}}_{{\mathcal {Q}}}\) about \(Q(u)\), as illustrated by the dashed arrow in the following (not necessarily commutative) diagram:

(8)

As shown by the dotted arrows in (8), this perspective is general enough to contain classical numerical methods \(B:{\mathcal {Y}}\rightarrow {\mathcal {Q}}\) as the special case \(\beta (\mu , y) = \delta _{B(y)}\), where \(\delta _{q} \in {\mathcal {P}}_{{\mathcal {Q}}}\) is the unit Dirac measure at \(q\in {\mathcal {Q}}\).

One desideratum for a PNM \(\beta \) is that its point estimators (e.g. mean, median, or mode) should be closely related to standard deterministic numerical methods \(B\). This aspect is present in works such as Schober et al. (2014), which considers probabilistic ODE solvers with Runge–Kutta schemes as their posterior means, and Cockayne et al. (2016, 2017), which consider PDE solvers with the symmetric collocation method as the posterior mean. However, this aspect is by no means universally stressed.

A second, natural, desideratum for a PNM \(\beta \) is that the spread (e.g. the variance) of the distributional output should provide a fair reflection of the accuracy to which the quantity of interest is being approximated. In the statistics literature, this amounts to a deside for credible intervals to be well calibrated (Robins and van der Vaart 2006). In particular, one might desire that the distribution \(\beta \) contract to the true value of \(Q(u)\) at an appropriate rate as the data dimension (e.g. the number of quadrature nodes) is increased.Footnote 11

Diagram (6), when it commutes, characterises the “ideal” classical numerical method \(B\); there is, as yet, no closed loop in diagram (8) involving \(\beta \), which we would need in order to describe an “ideal” PNM \(\beta \). This missing map in (8) is intimately related to the notion of a Bayesian PNM as defined by Cockayne et al. (2019a).

The key insight is that, given a prior belief expressed as a probability distribution \(\mu \in {\mathcal {P}}_{{\mathcal {U}}}\) and the information operator \(Y:{\mathcal {U}}\rightarrow {\mathcal {Y}}\), a Bayesian practitioner has a privileged map from \({\mathcal {Y}}\) into \({\mathcal {P}}_{{\mathcal {U}}}\) to add to diagram (8), namely the conditioning operator that maps any possible value \(y\in {\mathcal {Y}}\) of the observed data to the corresponding conditional distribution \(\mu ^{y} \in {\mathcal {P}}_{{\mathcal {U}}}\) for \(u\) given \(y\). In this situation, in contrast to the freedomFootnote 12 enjoyed by the designer of an arbitrary PNM, a Bayesian has no choice in her/his belief \(\beta (\mu , y)\) about \(Q(u)\): it must be nothing other than the image under \(Q\) of \(\mu ^{y}\).

Definition 2

A probabilistic numerical method is said to be Bayesian for \(\mu \in {\mathcal {P}}_{{\mathcal {U}}}\) if,

$$\begin{aligned} \beta (\mu , y) = Q_{\sharp } \mu ^{y} \text { for } Y_{\sharp } \mu \text {-almost all }y\in {\mathcal {Y}}. \end{aligned}$$

In this situation \(\mu \) is called a prior (for \(u\)) and \(\beta (\mu , y)\) a posterior (for \(Q(u)\)).

In other words, being Bayesian means that the following diagram commutes:

(9)

Note that Definition 2 does not insist that a Bayesian PNM actually calculates \(\mu ^{y}\) and then computes the push-forward; only that the output of the PNM is equal to \(Q_{\sharp } \mu ^{y}\). Thus, whether or not a PNM is Bayesian is specific to the quantity of interest \(Q\). Note also that a PNM \(\beta (\mu , \cdot )\) can be Bayesian for some priors \(\mu \) yet be non-Bayesian for other choices of \(\mu \); for details see Cockayne et al. (2019a, Sec. 5.2).

To be more formal for a moment, in Definition 2 the conditioning operation \(y\mapsto \mu ^{y}\) is interpreted in the sense of a disintegration, as advocated by Chang and Pollard (1997). This level of technicality is needed in order to make rigorous sense of the operation of conditioning on the \(\mu \)-negligible event that \(Y(u) = y\). Thus,

  • for each \(y\in {\mathcal {Y}}\), \(\mu ^{y} \in {\mathcal {P}}_{{\mathcal {U}}}\) is supported only on those values of \(u\) compatible with the observation \(Y(u) = y\), i.e. \(\mu ^{y} ( \{ u\in {\mathcal {U}}\mid Y(u) \ne y\} ) = 0\);

  • for any measurable set \(E \subseteq {\mathcal {U}}\), \(y\mapsto \mu ^{y}(E)\) is a measurable function from \({\mathcal {Y}}\) into [0, 1] satisfying the reconstruction property, or law of total probability,

    $$\begin{aligned} \mu (E) = \int _{{\mathcal {Y}}} \mu ^{y}(E) \, (Y_{\sharp } \mu ) ({\text {d}}y) . \end{aligned}$$

Under mild conditionsFootnote 13 such a disintegration always exists, and is unique up to modification on \(Y_{\sharp } \mu \)-null sets.

Observe that the fundamental difference between ACA (i.e. the probabilistic assessment of classical numerical methods) and Bayesianity of PNMs is that the former concerns the commutativity of diagram (6) in the average (i.e. the left-hand half of diagram (8)), whereas the latter concerns the commutativity of diagram (9).

The prime example of a Bayesian PNM is the following example of kernel quadrature, due to Larkin (1972):

Example 2

Recall the setup of Example 1. Take a Gaussian distribution \(\mu \) on \(C^{0}([a, b]; {\mathbb {R}})\), with mean function \(m :[a, b] \rightarrow {\mathbb {R}}\) and covariance function \(k :[a, b]^2 \rightarrow {\mathbb {R}}\). Then, given the data

$$\begin{aligned} y = (t_{j}, z_{j})_{j = 1}^{J} \equiv (t_{j}, u(t_{j}))_{j = 1}^{J} , \end{aligned}$$

the disintegration \(\mu ^{y}\) is again a Gaussian on \(C^{0}([a, b]; {\mathbb {R}})\) with mean and covariance functions

$$\begin{aligned}&m^{y} (t) = m(t) + k_{T} (t)^\top k_{TT}^{-1} (z_{T} - m_{T}) , \end{aligned}$$
(10)
$$\begin{aligned}&k^{y} (t, t') = k(t, t') - k_{T} (t)^{\top } k_{TT}^{-1} k_{T}(t') , \end{aligned}$$
(11)

where \(k_{T} :[a, b] \rightarrow {\mathbb {R}}^{J}\), \(k_{TT} \in {\mathbb {R}}^{J \times J}\), \(z_{T} \in {\mathbb {R}}^{J}\), and \(m_{T} \in {\mathbb {R}}^{J}\) are given by

$$\begin{aligned}&[k_{T} (t)]_j :=k(t, t_{j}) ,&[k_{TT}]_{i,j}&:=k(t_{i}, t_{j}) , \\&[z_{T}]_{j} :=z_{j} \equiv u(t_{j}) ,&[m_{T}]_j&:=m(t_{j}) . \end{aligned}$$

The Bayesian PNM output \(\beta (\mu , y)\), i.e. the push-forward \(Q_{\sharp } \mu ^{y}\), is a Gaussian on \({\mathbb {R}}\) with mean \(\overline{m}^{y}\) and variance \((\overline{\sigma }^{y})^{2}\) given by integrating (10) and (11) respectively, i.e.

$$\begin{aligned}&\overline{m}^{y} = \int _{a}^{b} m(t) \, {\text {d}}t + \left[ \int _{a}^{b} k_{T} (t) \, {\text {d}}t \right] ^\top k_{TT}^{-1} (z_{T} - m_{T}) , \\&(\overline{\sigma }^{y})^{2} = \int _{a}^{b} \int _{a}^{b} k(t, t') \, {\text {d}}t \, {\text {d}}t' \\&\qquad \qquad \quad - \left[ \int _{a}^{b} k_{T} (t) \, {\text {d}}t \right] ^\top k_{TT}^{-1} \left[ \int _{a}^{b} k_{T} (t') \, {\text {d}}t' \right] . \end{aligned}$$

From a practical perspective, k is typically taken to have a parametric form \(k_\theta \) and the parameters \(\theta \) are adjusted in a data-dependent manner, for example to maximise the marginal likelihood of the information \(y\) under the Gaussian model.

One may also seek point sets that minimise the posterior variance \((\overline{\sigma }^{y})^{2}\) of the estimate of the integral. For the Brownian covariance kernel \(k(t, t') = \min (t, t')\), the posterior \(Q_{\sharp } \mu = {\mathcal {N}}(\overline{m}^{y}, (\overline{\sigma }^{y})^{2})\) for \(\int _{a}^{b} u(t) \, {\text {d}}t\) is given by (4), the variance of which is clearly minimised by an equally spaced point set \(\{ t_{j} \}_{j=1}^{J}\). For more general kernels k, an early reference for selecting the point set \(\{ t_{j} \}_{j=1}^{J}\) to minimise \((\overline{\sigma }^{y})^{2}\) is O’Hagan (1991).

This perspective, in which the Bayesian update is singled out from other possible belief updates, is reminiscent of foundational discussions such as those of Bissiri et al. (2016) and Zellner (1988). Interestingly, about half of the papers published on PN can be viewed as being (at least approximately) Bayesian; see the survey in the supplement of Cockayne et al. (2019a). This includes the work of Larkin, though, as previously mentioned, Larkin himself did not use the terminology of the Bayesian framework. Quite aside from questions of computational cost, non-Bayesian methods come into consideration because the requirement to be fully Bayesian can impose non-trivial constraints on the design of a practical numerical method, particularly for problems with a causal aspect or “time’s arrow”; this point was discussed in detail for the numerical solution of ODEs by Wang et al. (2018).

As well as providing a clear formal benchmark, Cockayne et al. (2019a, Section 5) argue that a key advantage of Bayesian probabilistic numerical methods is that they are closed under composition, so that the output of a computational pipeline composed of Bayesian probabilistic numerical methods will inherit Bayesian semantics itself. This is analogous to the Markov condition that underpins directed acyclic graphical models (Lauritzen 1996) and may be an advantageous property in the context of large and/or distributed computational codes—an area where performing a classical numerical analysis can often be difficult. For non-Bayesian PNMs, it is unclear how these can/should be combined, but we note an analogous discussion of statistical “models made of modules” in the recent work of Jacob et al. (2017) [who observe, like Owhadi et al. (2015)], that strictly Bayesian models can be brittle under model misspecification, whereas non-Bayesianity confers additional robustness) and also the numerical analysis of probabilistic forward models in Bayesian inverse problems by Lie et al. (2018).

4 Discussion and outlook

“Det er vanskeligt at spaa, især naar det gælder Fremtiden.” [Danish proverb]

As it stands in 2019, our view is that there is much to be excited about. An intermittent stream of ad hoc observations and proposals, which can be traced back to the pioneering work of Larkin and Sul\('\)din, has been unified under the banner of probabilistic numerics (Hennig et al. 2015) and solid statistical foundations have now been established (Cockayne et al. 2019a). In this section, we comment on some of the most important aspects of research that remain to be addressed.

4.1 Killer apps

The most successful area of research to date has been on the development of Bayesian methods for global optimisation (Snoek et al. 2012), which have become standard to the point of being embedded into commercial software (The MathWorks Inc. 2018) and deployed in realistic (Acerbi 2018; Paul et al. 2018) and indeed high-profile (Chen et al. 2018) applications. Other numerical tasks have yet to experience the same level of practical interest, though we note applications of probabilistic methods for cubature in computer graphics (Marques et al. 2013) and tracking (Prüher et al. 2018), as well as applications of probabilistic numerical methods in medical tractography (Hauberg et al. 2015) and nonlinear state estimation (Oates et al. 2019a) in an industrial context.

It has been suggested that probabilistic numerics is likely to experience the most success in addressing numerical tasks that are fundamentally difficult (Owen 2019). One area that we highlight, in particular, in this regard is the solution of high-dimensional PDEs. There is considerable current interest in the deployment of neural networks as a substitute for more traditional numerical methods in this context, e.g. Sirignano and Spiliopoulos (2018), and the absence of interpretable error indicators for neural networks is a strong motivation for the development of more formal probabilistic numerical methods for this task. We note also that nonlinear PDEs in particular are prone to non-uniqueness of solutions. For some problems, physical reasoning may be used to choose among the various solutions, from the probabilistic or statistical perspective lack of uniqueness presents no fundamental philosophical issues: the multiple solutions are simply multiple maxima of a likelihood, and the prior is used to select among them, as in e.g. the treatment of Painlevé’s transcendents by Cockayne et al. (2019a).

It has also been noted that the probabilistic approach provides a promising paradigm for the analysis of rounding error in mixed-precision calculations, where classical bounds “do not provide good estimates of the size of the error, and in particular [...] overestimate the error growth, that is, the asymptotic dependence of the error on the problem size” (Higham and Mary 2018).

4.2 Adaptive Bayesian methods

The presentation of a PNM in Sect. 3.2 did not permit adaptation. It has been rigorously established that for linear problems adaptive methods (e.g. in quadrature, sequential selection of the notes \(t_{j}\)) do not outperform non-adaptive methods according to certain performance metrics such as worst-case error (Woźniakowski 1985, Section 3.2). However, adaptation is known to be advantageous in general for nonlinear problems (Woźniakowski 1985, Section 3.8). At a practical level, adaptation is usually an essential component in the development of stopping rules that enable a numerical method to terminate after an error indicator falls below a certain user-specified level. An analysis of adaptive PNMs would constitute a non-trivial generalisation of the framework of Cockayne et al. (2019a), who limited attention to static directed acyclic graph representation of conditional dependence structure. The generalisation to adaptive PNM necessitates the use of graphical models with a natural filtration, as exemplified by a dynamical Bayesian network (Murphy 2002).

It has been suggested that numerical analysis is a natural use case for empirical Bayes methods (Carlin and Louis 2000; Casella 1985), as opposed to related—but usually more computationally intensive—approaches such as hierarchical modelling and cross-validation. Empirical Bayes methods can be characterised as a specific instance of adaptation in which the observed data are used not only for inference but also to form a point estimator for the prior. For example, in a quadrature setting, the practitioner is in the fortunate position of being able to use evaluations of the integrand \(u\) both to estimate the regularity of \(u\) and the value of the integral. Empirical Bayesian methods are explored by Schober et al. (2018) and by Jagadeeswaran and Hickernell (2019) in this special issue.

4.3 Design of probabilistic numerical methods

Paradigmatic questions in the IBC literature are those of (i) an optimal information operator \(Y\) for a given task, and (ii) the optimal numerical method \(B\) for a given task, given information of a known type (Traub et al. 1983). In the statistical literature, there is also a long history of Bayesian optimal experimental design, in parametric and non-parametric contexts (Lindley 1956; Piiroinen 2005). The extent to which these principles can be used to design optimal numerical methods automatically (rather than by inspired guesswork on the mathematician’s part, à la Larkin) remains a major open question, analogous to the automation of statistical reasoning envisioned by Wald and subsequent commentators on his work (Owhadi and Scovel 2017b).

4.4 Probabilistic programming

The theoretical foundations of probabilistic numerics have now been laid, but at present a library of compatible code has not been developed. In part, this is due to the amount of work needed in order to make a numerical implementation reliable and efficient, and in this respect PN lies far behind classical numerical analysis at present. Nevertheless, we anticipate that such efforts will be undertaken in coming years, and will lead to the wider adoption of probabilistic numerical methods. In particular, we are excited at the prospect of integrating probabilistic numerical methods into a probabilistic programming language, e.g. Carpenter et al. (2017), where tools from functional programming and category theory can be exploited in order to automatically compile codes built from probabilistic numerical methods (Ścibior et al. 2015).

4.5 Bridging the numerics–statistics gap

“Numerical analysts and statisticians are both in the business of estimating parameter values from incomplete information. The two disciplines have separately developed their own approaches to formalizing strangely similar problems and their own solution techniques; the author believes they have much to offer each other.” (Larkin 1979c)

A major challenge faced by researchers in this area is the interdisciplinary gap between numerical analysts on the one hand and statisticians on the other. Though there are some counterexamples, as a first approximation it is true to say that classically trained numerical analysts lack deep knowledge of probability or statistics, and classically trained statisticians are not well versed in numerical topics such as convergence and stability analysis. Indeed, not only do these two communities take interest in different questions, they often fail to even see the point of the other group’s expertise and approaches to their common problems.

A caricature of this mutual incomprehension is the following: A numerical analyst will quite rightly point out that almost all problems have numerical errors that are provably non-Gaussian, not least because s/he can exhibit a rigorous a-priori or a-posteriori error bound. Therefore, to the numerical analyst it seems wholly inappropriate to resort to Gaussian models for any purpose at all; these are often the statistician’s first models of choice, though they should not be the last. This non-paradox was explained in detail by Larkin (1974). (As a side note, it seems to us from our discussions that numerical analysts are happier to discuss the modelling of errors than the latent quantities which they regard as fixed, whereas statisticians seems to have the opposite preference; this is a difference in views that echoes the famous frequentist–subjectivist split in statistics.) The numerical analyst also wonders why, in the presence of an under-resolved integral, the practitioner does not simply apply an adaptive quadrature scheme and run it until an a posteriori global error indicator falls below a pre-set tolerance.

We believe that these difficulties are not fundamental and can be overcome by a more careful statement of the approach being taken to address the numerical task. In particular, the meeting ground for the numerical analysts and statisticians, and the critical arena of application for PN, consists of problems that cannot be run to convergence more cheaply than quantifying the uncertainties of the coarse solution—or, at least, where there is an interesting cost-v.-accuracy tradeoff to be had, which is a central enabling factor for multilevel methods (Giles 2015).

More generally, we are encouraged to see that epistemic uncertainty is being used once again and an analytical device in numerical analysis in the sense originally described by von Neumann and Goldstine (1947); see e.g. Higham and Mary (2018).

4.6 Summary

The first aim of this article was to better understand probabilistic numerics through its historical development. Aside from the pioneering work of Larkin, it was only in the 1990s that probabilistic numerical methods—i.e. algorithms returning a probability distribution as their output—were properly developed. A unified vision of probabilistic computation was powerfully presented by Hennig et al. (2015) and subsequently formalised by Cockayne et al. (2019a).

The second aim of this article was to draw a distinction between PN as a means to an end, as a form of probabilistic sensitivity / stability analysis, and PN as an end in itself. In particular, we highlighted the Bayesian subclass of PNMs as being closed under composition, a property that makes these particularly well suited for use in UQ; we also remarked that many problems—for reasons of problem structure, computational cost, or robustness to model misspecification—call for methods that are not formally Bayesian.

Finally, we highlighted areas for further development, which we believe will be essential if the full potential of probabilistic numerics highlighted by Hennig et al. (2015) is to be realised. From our perspective, the coming to fruition of this vision will require demonstrable success on problems that were intractable with the computational resources of previous decades and a wider acceptance of Larkin’s observation quoted above, with which we wholeheartedly agree: numerical analysts and statisticians are indeed in the same business and do have much to offer one other!