Understanding the stochastic partial differential equation approach to smoothing

Correlation and smoothness are terms used to describe a wide variety of random quantities. In time, space, and many other domains, they both imply the same idea: quantities that occur closer together are more similar than those further apart. Two popular statistical models that represent this idea are basis-penalty smoothers (Wood, 2017) and stochastic partial differential equations (SPDE) (Lindgren et al., 2011). In this paper, we discuss how the SPDE can be interpreted as a smoothing penalty and can be fitted using the R package mgcv, allowing practitioners with existing knowledge of smoothing penalties to better understand the implementation and theory behind the SPDE approach.


Introduction
Data collected over space or time are often obtained with the desire to elicit an underlying pattern. The stochastic partial differential equation (SPDE) approach introduced by Lindgren et al. (2011) and implemented in the R-INLA software package (Rue et al., 2009) is a flexible, efficient method to analyse such data. Despite this, wider application is inhibited by two obstacles. First, the methods are presented using mathematical concepts and terms more usually found in applied mathematics and physics, making it difficult for practitioners in other fields to understand and adapt these methods to their own needs. Second, available software implementations are difficult to customise without high-level technical knowledge, limiting application to only those models available in the software or specially requested from software developers.
Here, we aim to mitigate these two issues: we describe (i) how the SPDE model can be interpreted as a basis-penalty smoother, a modelling framework more familiar to practitioners who use smoothing techniques (Wood, 2017), and (ii) how software to fit these smoothers (e.g., mgcv), regularly extended and customised for application, can be used to fit SPDE models or, to go further, used to incorporate SPDE methods into larger models.
In this paper, we consider the following situation. Let z(x) be a random variable observed at location x or time x, depending on the domain. A statistical model for z is constructed in three components or terms: z(x) = η(x) + f (x) + (x). The first component, η(x), is the fixed effect, often a linear combination of observed covariates with unknown parameters.
The third component, (x), represents the measurement error or unstructured error, often (x) ∼ N (0, σ 2 ) for unknown parameter σ and every location x. The second component is a stochastic process, representing the structured dependence among observations: observations made closer together in time or space are more likely to be similar than those further apart. Gaussian with mean zero and covariance matrix Σ with (i, j) th entry c(x i , x j ). We can extend this formulation to non-Gaussian responses by using a link function, g, so the response is then modelled with a specified distribution and a mean g −1 (η(x) + f (x)).
An example of this kind of data might be a time series of counts. The left panel of Figure 1 show human cases of campylobacterosis (a common form of food poisoning, often originating in under-cooked poultry) in northern Québec, every 28 days from 1 January 1990 to 31 October 2000. We may expect a given time period's count to be similar to its neighbours (e.g. due to seasonal variation), so our aim is to build a model that can capture this dependence. Using the above formulation, we model the counts as Poisson z(x) ∼ Po(exp(f (x))) where x represents time, z(x) is the number of cases at time x, f is a function of time representing the underlying process and exp is the appropriate inverse link function. Dependence structures become more complex when we move to a spatial domain.
The right panel of Figure 1 shows remotely-sensed log chlorophyll A levels in the Aral sea, derived from satellite data. In this case we expect that pixels close to each other have similar chlorophyll A levels. We now have x represent a location in space, and the log chlorophyll A level at that location, z(x) is modelled by z(x) = f (x)+ (x), where now f is a 2-dimensional stochastic process and (x) ∼ N (0, σ 2 ). We revisit these examples in Section 4, below.  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000  Including f in the model raises two issues: how to specify the covariance function c and, once specified, how to fit the model. There are many possible solutions to these questions, including the SPDE and smoothing penalty approaches, and each uses different theoretical and numerical approximations; however, there is a common element: for observation locations {x 1 , . . . , x n }, each method aims to define the covariance between these locations, by constructing an approximation to the precision matrix, defined as the inverse of the covariance matrix (Q = Σ −1 ). The precision matrix is a fundamental quantity required to fit these models (Simpson et al., 2012). The size of Σ and Q makes the necessary computations expensive, in particular, if one of these matrices needs to be inverted so as to compute the other. It is in trying to avoid this computational burden that approximations are used.
The SPDE approach provides a method to approximate Q without the common substantial computational burden. The SPDE is an equation to be solved. Solutions to this equation are stochastic processes whose covariance structure is chosen to satisfy the relationship the SPDE specifies. The SPDE approach involves finding an SPDE whose solutions have the covariance structure, and implied precision matrix, that is desired for f . Lindgren et al. (2011) show how to find an approximate solution to the SPDE by representing f as a sum of basis functions multiplied by coefficients; this provides a computationally efficient way to compute Q: Lindgren et al. (2011) show that the coefficients of these basis functions form a Gaussian Markov random field, for which methods for fast computation of precision matrices already exist (Rue and Held, 2005 for spatial modelling with a focus on the SPDE approach. They give an intuition for the method by introducing the notion of a "discretised" differential operator and describing the finite element methods that are used to solve the SPDE (Brenner and Scott, 2007;Bakka, 2018). See also Krainski et al. (2019) for a collection of worked examples of modelling with SPDEs using R-INLA, Wood (2019) for an approach to nested Laplace approximations without sparse Gaussian Markov random field structures, and Blangiardo and Cameletti (2015) for a comprehensive textbook on spatio-temporal modelling with R-INLA. We note that the R-INLA implementation of the SPDE approach has been applied in a wide variety of domains such as spatial epidemiology (Arab, 2015), species distribution mapping (de Rivera et al., 2018), spatial point processes (Simpson et al., 2016;Yuan et al., 2017;Soriano-Redondo et al., 2019) and environmental science (Huang et al., 2017), to name just a few examples.
Our presentation here differs from the above resources in that we explicitly draw links with another well-known modelling framework.
The basis-penalty smoothing approach (Wood, 2017) is similar to the SPDE approach: the function f is a sum of basis functions multiplied by coefficients. Rather than specify an SPDE and deduce a covariance structure between the coefficients, a smoothing penalty is used to induce correlation between the coefficients. This penalty measures how smooth f is in its domain; intuitively, if f changes more smoothly then values of f at nearby locations are more correlated. Jointly optimising a measure of fit (sum of squares or log-likelihood) and smoothing penalty leads to an optimal curve, the smoothing spline (Wahba, 1990).
This is a well-established approach with several excellent introductory resources (Hastie and Tibshirani, 1990;Ramsay and Silverman, 2005;Wood, 2017) and has been applied in many There is a direct correspondence between smoothing splines and stochastic processes (Kimeldorf and Wahba, 1970): the smoothing spline is a minimum variance unbiased linear estimator of the posterior mean of the stochastic process. For a stochastic process with a given covariance function, there is a corresponding SPDE and smoothing penalty such that one can estimate the posterior mean of f using the SPDE approach or the basis-penalty 5 approach: both methods estimate the same quantity with the only differences being in numerical approximations and terminology. This means that the SPDE can be interpreted as a smoothing penalty and vice-versa.
This equivalence has been confirmed by Fahrmeir and Lang (2001), Lindgren and Rue (2008), and Yue et al. (2014) who show how basis-penalty smoothers in a Bayesian framework can be interpreted within the SPDE paradigm. Simpson et al. (2012) remark that the SPDE formulation is useful because it provides those with a background in physics or applied mathematics a way to understand and apply the model. In contrast, less emphasis has been placed on discussing this equivalence the other way around: SPDE methods can be formulated as basis-penalty smoothers. The SPDE formulation can seem opaque and fundamentally different for those unfamiliar with the mathematical concepts used. For this reason, showing the approach within the familiar smoothing framework demystifies the workings of the model and allows researchers in other fields to understand, adapt, and use the methods.
We note that our approach is aligned with the general aim of emphasising links between Gaussian processes and the reproducing kernel Hilbert spaces theory that underpins the basis-penalty smoothing approach (Kanagawa et al., 2018), although here we take a more applied perspective.
Our aim in this paper is to show that the SPDE model as introduced by Lindgren et al.
(2011) (usually fitted using R-INLA) can be described as a basis-penalty smoother and fitted using mgcv. To do this, we first describe the SPDE method for those unfamiliar with the mathematical concepts used, highlighting the key steps in the method. Afterward, we show the equivalences and differences between the SPDE method and the analogous basis-penalty smoother.

What is an SPDE?
A stochastic partial differential equation involves stochastic processes and differential operators. Examples of differential operators (D) are the first derivative, the second derivative, the gradient operator in two-dimensions or the Laplacian in two dimensions. Combinations of these are also differential operators, e.g., for a function f . Here, we consider only linear differential operators, that is, Df is a linear combination of derivatives of f , of different orders. Differential operators of stochastic processes can be treated similarly to those applied to ordinary functions, there is one key difference that we will highlight below. Overall, an SPDE states that the differential of a function f is equal to some known stochastic process, most commonly the white noise process, . The white noise process is completely uncorrelated and (x) is a Normal random variable with mean zero and finite variance for every x.
In general, the SPDE states that Df = for some differential operator D. A stochastic process, f , is called a solution to the SPDE if it satisfies this equation. Consider an example, let D be the first derivative of the function. The SPDE Df = therefore states that the first derivative of f has mean zero and finite variance at every point; furthermore, it states that the value of the derivative at points x and y are uncorrelated for all x = y. Approximately, this means that for a small δ and point variable. Consider if the SPDE has a parameter τ such that Df = /τ such that τ controls the variance in the white noise process. This means that changes in f are more variable when τ is reduced and less variable for higher τ . In other words, the parameters of the SPDE control the smoothness of f . It is important to note that here the term "smoothness" is not used in a mathematical way, meaning differentiability, nor in a strictly statistical way, referring to correlation range, but in a qualitative way-when we speak of differentiability or correlation we shall use those terms explicitly.
For a given D, the mathematical form of the solution to the SPDE Df = is known: where w is a function you can derive given you know D. The function w is called Green's function; in the appendix (Proposition 1) we show how this function is derived from D. Intuitively, w acts as a weighting function such that the value of the stochastic process at x is a weighted sum over the white noise process; this is called a convolution. Suppose w were set to give infinite weight to distance 0, w(0) = ∞, and In summary, the solutions to the SPDE Df = have a covariance structure that is induced by the choice of D. This means that one could describe a system using an SPDE and then deduce the associated covariance function from it. The power of the SPDE approach is realised by doing the opposite: find a D that induces the covariance function that you want. The power of finding the SPDE corresponding to a desired covariance function is that the precision matrix can be efficiently computed using the SPDE.

Solving the SPDE
The SPDE involves applying a differential operator D to a stochastic process, f , but this cannot be done in the same way as when you apply D to a known function. This is because f is random and, in many cases, realisations of f will not be suitably differentiable. For example, the Brownian motion stochastic process has a derivative equal to the white noise process, but it is also known that simulated trajectories of Brownian motion are nowhere differentiable. Df = is a convenient shorthand way to think about the SPDE, but technically, the SPDE only has meaning when stated in an integral form. That with compact support. The function φ is often called the test function. For brevity, let f, g = f (x)g(x) dx and so the integral form is Df, φ = , φ . The notation f, g is called the inner product of f, g, it has many nice mathematical properties, including being lin- . . , f n , g 1 , . . . , g m and constants a 1 , . . . , a n , b 1 , . . . , b m .
In the integral form, the equation makes sense because any stochastic process can be integrated, but not every one can be differentiated. By requiring the equation to hold for every φ, we require the left-hand stochastic process Df and the right-hand process to have the same integral, no matter how we average over space. For example, if the stochastic processes were one-dimensional, we could split the real line into intervals [n, n + 1] and select a function φ n to be one on this interval and zero outside. Since the integral equation must hold for all such functions, we therefore require Df to have the same average value as on each and every interval.
Given an SPDE, Lindgren et al. (2011) show how to derive an approximate solution using the finite element method. The domain (e.g., time or space) is split into "elements", e.g., a grid or a triangulation, often called a mesh. To each point j = 1, . . . , M on this mesh, a basis function ψ j is associated. The solution to the SPDE is then a weighted sum of the basis functions and random variables β j : f (x) = M j=1 β j ψ j (x). The integral form of the SPDE then implies that for any function φ, M j=1 β j Dψ j , φ = , φ . We cannot, however, check this equation for infinitely many test functions φ, so instead we restrict to only testing with the functions that can be written in our chosen basis. As D is a linear operator, this is equivalent to solving the system of equations M j=1 β j Dψ j , ψ i = , ψ i for every i = 1, . . . , M . This system can be written as a matrix equation: Pβ = e where P has (i, j) th entry Dψ i , ψ j and e has j th entry , ψ j .
To summarise, the SPDE is written in an integral form, sometimes using inner products, since stochastic processes are well defined when integrated but not when differentiated.
Given this, the solution is represented in a chosen basis. The integral form is then solved by considering only test functions within that basis. This leads to a matrix equation involving the coefficients β, the matrix P, and the random vector e. The random vector e has known distribution, because it depends only on the basis functions and the white noise process: e has a multivariate Gaussian distribution with mean zero and a precision matrix Q e where Q −1 e has (i, j) th entry ψ i , ψ j . It follows from Pβ = e that β ∼ N (0, Q −1 ) where Q = P T Q e P.
The SPDE is therefore a way to specify a prior for β.
This provides an approximate solution to the SPDE. For example, given an SPDE, one can use the finite element method to compute Q and therefore simulateβ from a multivariate Gaussian distribution with precision Q. The functionf = M j=1β j ψ j would then be a realisation from a stochastic process which is a solution to the SPDE, a stochastic process with the covariance structure implied by D.

Matérn SPDE
The focus of Lindgren et al. (2011) where ν, κ, τ are parameters, K ν is the modified Bessel function of the second kind, and d is the dimension of the domain. It is difficult to fit models with this covariance structure due to the computational issues mentioned above. Lindgren et al. (2011) apply the finite element method to approximate stochastic processes with Matérn covariance (a comparison of the notation used in Section 2.2 and that used in Lindgren et al. (2011) is given in the appendix, Section 5). To do this, they present the differential operator that corresponds to this covariance function: When α = 2, this is called a fractional differential operator; for this paper, we consider only the case when α = 2 and so D is again a linear differential operator. In practice, α is poorly identified and difficult to estimate from data, so its value is often assumed to be fixed (Zhang, 2004). Lindgren et al. (2011) solve the SPDE κ 2 f − ∆f = /τ using the finite element method.
By deriving the weighting function and computing the covariance from this SPDE, Whittle (1954) shows that the solutions have Matérn covariance, as desired. In other words, the precision matrix computed from the finite element method is an approximation to the precision matrix one would obtain if you computed the variance-covariance matrix Σ with the Matérn covariance function and then, at great computational cost, inverted this matrix. 3 The basis-penalty approach 3.1 What is a basis-penalty smoother?
The basis-penalty approach refers to models where f is assumed to have the form f (x) = M j=1 β j ψ j for M basis functions ψ 1 , . . . , ψ M and parameters β = (β 1 , . . . , β M ). A model is then assumed for the observations given this form for f to provide a measure of fit, the loglikelihood l(β). Alternatively, the sum-of-squares can be used as a measure of fit. Optimising to obtainβ leads to a functionf that, given M is large enough, will interpolate the observed data: capturing the noise in the observations as well as the underlying signal (such overfitting stops us from making inference on the signal). A smoothing penalty, J(β, λ), is subtracted from the log-likelihood to penalize functions that are too wiggly. The smoothing parameter, λ, controls the extent of the penalization (a larger value of λ leads to a smootherf ).
The estimates for β are defined to be those that optimise the joint measure of fit and smoothness, the penalised likelihood: l p (β, λ) = l(β)−J(β, λ). This involves estimating both the optimal smoothing parameter λ and coefficients β. In practice, REstricted Maximum Likelihood (REML; Wood, 2011) is used to to do this.
There are several choices for the smoothing penalty. Most are defined using a differential operator D. For example, in one dimension, the smoothing penalty J(β, λ) = λ (∂ 2 f /∂x 2 ) 2 dx, i.e., where D is the second derivative, is often used. For this penalty, functions with rapidly changing gradients are penalised while functions with constant gradient, straight lines, have no penalty. In higher dimensions, the thin-plate spline (Wood, 2003) is often used with penalty: J(β, λ) = λ (∂ 2 f /∂x 2 ) 2 + 2 (∂ 2 f /∂x∂y) 2 + (∂ 2 f /∂y 2 ) 2 dxdy for two-dimensions. This penalty takes the total variation in the gradient of f including the interaction between the coordinates. The penalty for smoothing splines takes the form J(β, λ) = λ (Df ) 2 dx for some chosen differential operator D (see Yue and Speckman (2010) and Yue et al. (2014) who show this for the thin plate spline penalty). This can also be written as an inner product J(β, λ) = λ Df, Df .

13
When f (x) = M j=1 β j ψ j (x), the penalty based on the differential operator D can be written in matrix form: J(β, λ) = λβ Sβ where S is a M × M matrix with (i, j) th entry Dψ i , Dψ j .
In summary, a basis-penalty smoother is specified by selecting a basis, e.g., a B-spline basis of specified order, and a smoothing penalty. The parameters are then estimated by optimising the penalised likelihood: l p (β, λ) = l(β) − λβ Sβ.

Connection between SPDE and penalty
Rewriting the penalized log-likelihood as a likelihood we obtain exp{l p (β, λ)} = exp{l(β)} × exp(−λβ Sβ). A Bayesian interpretation of the penalised likelihood as proportional to a posterior implies that exp(−λβ Sβ) is an improper prior for β (Silverman, 1985;Wood, 2017). Since exp(−λβ Sβ) is proportional to a multivariate normal distribution with mean zero and precision matrix S λ = λS, the penalized likelihood is equivalent to assigning the prior β ∼ N (0, S −1 λ ). The connection between the SPDE approach and the basis-penalty approach can now be made clear. It can be shown that for a given differential operator D, the approximate precision matrix for the SPDE Df = is the same as the precision matrix S λ computed using the smoothing penalty Df, Df (appendix, Proposition 3). This connection has two implications. First, it means that the differences between the basis-penalty approach and the SPDE finite element approximation, when using the same basis and differential operator, are differences in implementation only, as both should lead to the same approximate precision matrix. Second, the connection means that any SPDE of the form Df = can be understood and interpreted as a smoothing penalty of the form Df, Df = {Df (x)} 2 dx, and vice-versa.

Matérn penalty
The SPDE specified in Lindgren et al. (2011) has the differential operator D = τ (κ 2 − ∆). Given the connection described above, this can be interpreted as a smoothing penalty: τ (κ 2 f − ∆f ) 2 dx. This penalty is different from those considered above because it contains two smoothing parameters: τ and κ. This offers it more flexibility. The penalty can still, however, be interpreted as such: it is a trade-off between the value of the function f and the second derivative ∆f in each direction. As κ is increased, the value of κ 2 f increases, meaning that ∆f can be higher, the function be less smooth, while keeping the penalty the same. Alternatively, κ can be described as the inverse correlation range: higher values of κ lead to less smooth functions meaning values of the function become less correlated. The smoothing parameter τ controls the overall smoothness of f .
The Matérn penalty can be written in matrix form as above, but for computational convenience, it is first broken into three parts: Df, Df = τ (κ 4 f, f + 2κ 2 ∇f, ∇f + ∆f, ∆f ).
Notice that it appears that the Laplacian ∆ has been replaced with the gradient operator ∇: this relationship holds here using Green's first identity and the Neumann boundary condition, see Bakka et al. (2018) for more detail. This leads to the smoothing matrix S = τ (κ 4 C + 2κ 2 G 1 + G 2 ) where C, G 1 , G 2 are all M × M matrices with (i, j) th entries ψ i , ψ j , ∇ψ i , ∇ψ j , and ∆ψ i , ∆ψ j , respectively. All of these matrices are sparse and so computation of the smoothing penalty, β Sβ, is computationally efficient. The matrix S is equal to the matrix Q = P Q e P computed using the finite element method (Appendix, Proposition 3).

Fitting the Matérn SPDE in mgcv
The mgcv R package allows the specification of new basis-penalty smoothers by writing new "smooth.construct" functions which build an appropriate design matrix (containing evaluations of the basis functions), penalty matrices and other optional components. Within this framework we can fit the SPDE model in mgcv providing a smooth.construct.spde.smooth.spec constructor. R-INLA provides helper functions to construct the required design and penalty matrices. Here we sketch an algorithm for setting-up SPDE models in mgcv.
Given we have a response {y i ; i = 1, . . . , n} and covariates in an n × n c matrix X c , we construct our model as follows.
3. We need to connect the basis representation of f to the observation locations. At present β contains the value of f at each mesh point, not at each observation location.
A matrix multiplication is used to project the values at all mesh points to the observations locations, it is called the projection matrix A (found using INLA::inla.spde.make.A).
The full design matrix X is then given by combining the fixed effects design matrix X c and the contribution for f , A.
4. Having constructed the design matrix and penalty matrices, use REML to find optimal κ, τ and β subject to the penalty matrix κ 4 C + 2κ 2 G 1 + G 2 . (Parametrisation for this model in mgcv is given in Supplementary Material section 4.) As REML is an empirical Bayes procedure, we expect point estimates forβ to coincide for the procedure outlined above and R-INLA. A uniform prior is implied for the smoothing parameters (τ or κ); R-INLA allows for similar estimation by just using the modes of the hyperparameters κ and τ (the int.strategy="eb" option). Proper priors could be used if step (4), above, was replaced by an MCMC scheme.

Examples
We now compare the SPDE and basis-penalty models applied to three example datasets.
We fitted the SPDE Matérn model in both R-INLA and mgcv. Code for these examples is available as supplementary material.     Plots of the mean of these differences and their standard deviations are shown in Figure   4. The mean plot shows structure to the differences in the models, though differences are relatively small (range of log chlorophyll A values in original data: 1. 905-19.275). This is to be expected if the models produce similar predicted values to each other, which are consistent through each realisation.

MODIS land surface temperatures
To compare the two approaches on a large data set, we now turn to land surface temperature data collected by the Terra instrument on the MODIS satellite. The data consist of a 500 × 300 grid of measurements in the latitude range 34.29519-37.06811 and longitude range -95.91153--91.28381 on August 4, 2016. The training data (105,569 observations) as defined in Heaton et al. (2018) were used to fit the model, but a significantly simpler mesh was used (see Supplementary Figure 5). Following from Heaton et al. (2018) we assumed a Gaussian response for temperature and fitted a 2-dimensional SPDE model on latitude and longitude.
As the dataset was quite large, we used the bam ("big additive model") function in mgcv to fit the SPDE model (additionally discretizing covariate values for efficient storage and computation; Wood et al., 2017). The SPDE fitted by R-INLA used the empirical Bayes integration strategy. We timed the fitting for both approaches (ignoring mesh setup, which was shared across methods), taking only the time for inla and bam to run. The mgcv model was slightly faster (4.71 minutes versus 5.23 in R-INLA running on a Windows server using 1 core of a Xeon Gold 6152 at 2.1GHz with 512Gb RAM). Supplementary Figure 6 compares the predictions from the two methods, the largest absolute difference between predictions is 4.761, which is small compared to the range of the data.

Discussion
We have drawn links between two approaches to fitting the same model: the stochastic partial differential equation method as implemented in R-INLA and the basis-penalty smoothing approach as fitted in mgcv. This paper aims to make accessible what is equivalent between the approaches, what is a matter of choice, and what is fundamentally different. Yue et al. (2014) show how splines can be specified using the SPDE approach, benefitting those familiar with SPDEs. Here, we do the opposite for the benefit of those familiar with the (penalized likelihood/empirical Bayes) GAM framework. Supplementary Figure 1

is a flow diagram
showing the parallels between the smoothness and correlation approaches we have discussed.
Similarities between many smoothing techniques can be drawn. Smoothing splines, kriging, Gaussian Markov random fields, and SPDEs approximate similar models, but their explanations make it difficult for practitioners to appreciate their commonalities and determine precisely what is a necessary and what is a coincidental association. Taking the precision matrix as the common currency between these methods, a modelling framework emerges: 1. Choose a covariance model: explicitly, as in kriging, through the smoothness penalty as with smoothing splines, or with an SPDE; 2. Approximate the precision matrix Q: reduce dimension (fixed rank kriging, thin plate splines) or induce sparsity in Q (B-splines, SPDE); 3. Draw approximate inference using a software implementation: e.g., with mgcv, MCMC (e.g., Stan (Carpenter et al., 2017); JAGS, (Plummer, 2017)), R-INLA (Rue et al., 2009), lme4 (Bates et al., 2015) or TMB (Kristensen et al., 2016). This paper is an example of comparing two methods according to this framework. Doing so for other smoothing methods will allow alternative modelling approaches to be compared on the grounds of genuine differences: in the covariance function, in the approximation for Q, in the estimation procedure, or, simply, in the software implementation.