# Finite area smoothing with generalized distance splines

- 374 Downloads
- 3 Citations

## Abstract

Most conventional spatial smoothers smooth with respect to the Euclidean distance between observations, even though this distance may not be a meaningful measure of spatial proximity, especially when boundary features are present. When domains have complicated boundaries leakage (the inappropriate linking of parts of the domain which are separated by physical barriers) can occur. To overcome this problem, we develop a method of smoothing with respect to generalized distances, such as within domain distances. We obtain the generalized distances between our points and then use multidimensional scaling to find a configuration of our observations in a Euclidean space of 2 or more dimensions, such that the Euclidian distances between points in that space closely approximate the generalized distances between the points. Smoothing is performed over this new point configuration, using a conventional smoother. To mitigate the problems associated with smoothing in high dimensions we use a generalization of thin plate spline smoothers proposed by Duchon (Constructive theory of functions of several variables, pp 85–100, 1977). This general method for smoothing with respect to generalized distances improves on the performance of previous within-domain distance spatial smoothers, and often provides a more natural model than the soap film approach of Wood et al. (J R Stat Soc Ser B Stat Methodol 70(5):931–955, 2008). The smoothers are of the linear basis with quadratic penalty type easily incorporated into a range of statistical models.

### Keywords

Finite area smoothing Generalized additive model Multidimensional scaling Spatial modelling Splines## 1 Introduction

*finite area smoothing*) then problems can occur when the smoother does not respect the boundary shape appropriately, especially when the shape of the boundary is complex. This complexity may manifest itself as some peninsula-like feature(s) in the domain with notably different observation values on either side of the feature. Features such as peninsulae give rise to a phenomenon known as

*leakage*. The top two panels of Fig. 1 show an example of leakage (taken from Wood et al. 2008) where the high values in the upper half of the domain (top panel) leak across the gap to the lower values below and vice versa (second panel). The phenomenon is problematic since it causes the fitted surface to be mis-estimated; this can then lead to incorrect inference, in particular the mis-identification of “hot spots” in biological populations.

The problem of leakage arises because spatial smoothers consider proximate data to be similar, but in almost all cases distance between data locations is measured using straight line (Euclidean) distance. This approach is flawed in cases in which Euclidean distance is not a meaningful measure of proximity. For example, since whales do not travel on land, the meaningful distance between sightings of two whales on either side of the Antarctic peninsula is not the straight line distance across the peninsula, but the shortest path between them that stays entirely in open water. This issue is ubiquitous in spatial ecology. Natural and man-made barriers carve up the landscape (and seascape), partitioning biological populations; spatial models should take this into account.

In this article we propose a general method for smoothing, based on generalized distances between points. We apply this to produce a finite area smoother, based on the *within-area distances* between points in the domain of interest. The general approach uses multidimensional scaling (MDS; e.g. Chatfield and Collins 1980, Chapter 10) to associate a location in a \(\fancyscript{D}\) dimensional Euclidian space (*p-space*) with each original data point. The Euclidian distances between points in p-space then approximate the original generalized distance between the points. Smoothing is then performed with respect to locations in p-space. Reasonable approximation of the generalized distances by the Euclidean distances in p-space can require \(\fancyscript{D}\) to be greater than the 2–4 dimensions in which conventional multidimensional smoothers work well. For this reason we revisit the general class of smoothers proposed in Duchon (1977), selecting a smoother that behaves well with increasing dimension. Note that when applied to the finite area problem our generalized distance smoother can be viewed as an extension of Wang and Ranalli (2007), albeit somewhat better founded (which we argue below).

The use of multidimensional scaling in spatial statistics is not new, especially in the kriging literature. For example, Sampson and Guttorp (1992) model spatial covariance functions by computing a distance measure based on the observed spatial covariances, then project the points using MDS before kriging (points out of sample are found using a thin plate spline). Their approach differs from ours in a number of ways: (i) they require multiple observations at each location in order to calculate the covariances, (ii) only projections in 2 dimensions are considered and (iii) non-metric MDS is used. Further applications in geostatistics are described in Sect. 5 and compared to the method proposed here.

The smoother proposed here has the attractive property of being representable using a linear basis expansion with an associated quadratic penalty. Such basis-penalty smoothers have a dual interpretation as Gaussian random fields (Rue and Held 2005), and are appealing because of the ease with which they can be incorporated as components of other models. Such components include, for example, varying-coefficient models, random/mixed effects models and signal regression models, as well as the focus for this article: generalized additive models (see e.g. Ruppert et al. 2003; Wood 2006, for overviews). This flexibility is vital in ecological applications, where a spatial smooth is usually only one part of a much larger model.

Before presenting our proposed method in detail we now briefly review spline type spatial smoothers, and previous approaches to the finite area smoothing problem in the additive model literature.

### 1.1 Spline smoothing for spatial data

Within this framework it is straightforward to allow \(\eta _i\) to depend on multiple smooth functions of various predictor variables, as well as on conventional parametric terms that are linear in any unknown parameters (Hastie and Tibshirani 1990). Such models are widely used in quantitative ecology, for example in the creation of density maps which can then be integrated over the domain to obtain an abundance estimate (e.g., Williams et al. 2011; Miller et al. 2013) or as part of a larger model, taking into account nuisance spatial effects (e.g., Augustin et al. 2009).

## 2 Previous approaches to the problem of leakage

There are three main types of existing approaches for dealing with the finite area smoothing problem.

### 2.1 Partial differential equation methods

Wood et al. (2008) use the physical analogy of a soap film to motivate an alternative which can be represented as a basis penalty smoother, and has better boundary behaviour. First consider the domain boundary to be made of wire, then dip this wire into a bucket of soapy water; a soap film with the same shape as the boundary will have then formed. If the wire lies in the spatial plane, the height of the soap film at a given point is the value of the smooth at that point. This film is then distorted smoothly toward each datum, while minimising the overall surface tension in the film. Mathematically the soap film consists of two sets of basis functions, one that is based entirely inside the domain (a set of interior knot locations are specified) and one that is induced by the (known or estimated) boundary values. These functions are found by solving Poisson and Laplace’s equations in two dimensions. The penalty associated with the former set is again (2).

The soap film approach has the basis-penalty form that is convenient for applied work and solves the boundary leakage problem, but basis setup is quite computationally expensive, and for many applications the approach is less natural than smoothing using within domain distances. A further problem with the soap film approach is that no distinction exists between ‘open’ boundaries (for example a boundary that is simply the limit of the region surveyed) and ‘hard’ boundaries (real physical barriers).

### 2.2 Within-area distances

Wang and Ranalli (2007) propose to replace straight-line distances with ‘geodesic’ distances in a smoother that is a sort of approximate thin plate spline (Geodesic Low-rank Thin Plate Splines, GLTPS). To calculate the geodesic distances, a graph is constructed in which each vertex is the location of an observation and is connected only to its \(k\) nearest neighbours. The within-area distances between each vertex pair is approximated using Floyd’s algorithm (Floyd 1962) to find the shortest path through the graph. This algorithm is cubic in the number of data, making the approach costly for large datasets. At large sample sizes the geodesic distances will tend towards ‘within-area distance’, i.e. the shortest path between two points that lies entirely within the domain of interest (Bernstein et al. 2000).

Wang and Ranalli use their geodesic distances in place of the usual Euclidean distances in the radial basis functions used to define a thin plate spline. They leave the basis for the nullspace of the thin plate spline penalty (i.e. those functions which are not penalised; in the case of 2-dimensional smoothing, linear functions of the coordinates and the plane) unchanged, so some linkage across boundary features remains in the smoother (since nullspace functions are defined over \(\mathbb {R}^2\) in the 2-dimensional case). The principle difficulty in interpreting the results of their method is that it is unclear what their penalty term penalizes. The interpretational difficulty arises because Wang and Ranalli’s expressions (3) and (9) involve the square roots of matrices that are not positive semi-definite. In the case of their expression (3), which relates to a thin plate spline, this problem would be rectifiable if the spline coefficients had the usual thin plate spline linear constraints applied in order to force positive definiteness on the spline penalty. However in the case of (9), which defines their geodesic splines, there appears to be no sensible way to obtain positive semi-definiteness. This is a problem because matrix square roots in general only exist for positive semi-definite matrices plus some rather special cases not useful here (see e.g., Higham 1987). It appears that for computational purposes Wang and Ranalli have used the generalization of a matrix square root given in appendix A.2.11 of Ruppert et al. (2003), but this square root lacks the basic properties that would allow Wang and Ranalli’s (2) to be interpretable exactly as a (reparameterized) thin plate spline, or for it to be possible to work out what the penalty on their geodesic spline is actually penalizing.

The Complex Region Spatial Smoother (CReSS) of Scott-Hayward et al. (2013) adapts GLTPS in two ways. First, in building the graph, edges are only drawn between two points if the straight line drawn between the points lies entirely within the boundary (boundary vertices are also included, in addition to observations). Second, a set of local radial basis functions are used (with a tuneable parameter controlling the locality of the basis). An AICc-weighted average over a series of models with different basis sizes, knot locations and locality of the basis functions is used for prediction. Unlike Wang and Ranalli (2007), the nullspace of the basis is removed.

The combination of an un-modified nullspace, the opacity of the penalty meaning and \(O(n^3)\) computational cost of distance calculation are of some concern for practical work. In the case of CReSS, the necessity of running many models also creates a substantial computational burden. For both interpretive and computational reasons it seems worthwhile to investigate alternative ways of using the within-area distance idea, avoiding these difficulties.

### 2.3 Domain warping

Paul Eilers (in a seminar at University of Munich in 2006) suggested conformally mapping the smoothing domain to a convex one via the Schwarz-Christoffel transformation (Driscoll and Trefethen 2002). The idea is that smoothing can then be conducted on the convex domain, without leakage problems. The first author has extensively investigated such an approach (Miller 2012, Chapter 3). Although it is possible to warp the boundary of the region into a shape such as a rectangle, the resulting distortions in the positions of observations inside the region lead to observations with vastly differing response values being “squashed” together, while other areas contain no observations. These distortions in observation density make smoothing more difficult and cause artefacts that are significantly more problematic than the leakage effects that the method seeks to avoid.

The methods proposed in the next section can be viewed as an attempt to put within-area distance methods on a more interpretable foundation by using an extension of the notion of domain warping.

## 3 The generalized distance smoothing model

*Duchon spline*(Duchon 1977), a generalization of the familiar thin plate spline.

The key idea here is that we smooth over a Euclidean space in which the Euclidean inter-observation distances are approximately equal to the original generalized distances. That is \(\Vert \mathbf{x}(\mathbf{d}_i) - \mathbf{x}(\mathbf{d}_j) \Vert \approx d_{ij}\) when \(d_{ij}\) is the generalized distance between points \(i\) and \(j\) (\(\Vert \cdot \Vert \) is the Euclidean norm). The choice of \(\fancyscript{D}\) determines the accuracy of the distance approximation. This can either be part of model specification, in which case \(\fancyscript{D}\) is chosen to achieve some specified level of approximation accuracy, or more pragmatically, can be chosen to optimize estimated prediction error (e.g. GCV score).

In the case of finite area smoothing, the elements of \(\mathbf{d}_i\) are ‘within-area’ distances between points, that is to say the shortest path between two points, such that the path lies entirely within the domain of interest. We will refer to the original 2 dimensional data co-ordinates as being elements of the ‘o-space’ while \(\fancyscript{D}\) dimensional co-ordinates in the MDS projection will be referred to as elements of the ‘p-space’. Web Appendix B gives and algorithm for calculating within-area distances for simple polygons.

These smoothers will be henceforth referred to as MDSDS (Multi-Dimensionally Scaled Duchon Splines), and the next three subsections provide the details for the MDS, smoothing and \(\fancyscript{D}\) selection steps.

### 3.1 MDS as a transformation of space

In this section we consider the construction of the mapping \(\mathbf{x}(\mathbf{d})\) by multidimensional scaling (MDS; Gower 1968). We start the process by choosing a representative set of locations of size \(n_s\) within the domain of interest (i.e. in o-space). This set might be all the locations at which we have observations, but in the case of finite area smoothing we would usually choose a set of locations spread uniformly over the region of interest in order to ensure that all the important geographic features in o-space will be represented in p-space.

**1**is a vector of \(1\)s) we can obtain the double centred version of \(\mathbf{D}\), \(\mathbf{S} = - \mathbf{HDH}/2\), which is then eigen-decomposed

So, MDS combined with Gower’s interpolation formula provide a means for constructing and computing with \(\mathbf{x}(\mathbf{d})\). We now turn to the construction of a suitable smoother in p-space.

### 3.2 Smoothing with Duchon splines

To combat this problem we use a more general version of the thin plate spline from the larger class of functions considered in Duchon (1977), which will allow us to obtain a smoother for which \(M = {\fancyscript{D}}+1\). Duchon (1977) is somewhat inaccessible, and this larger class has been almost entirely ignored in the statistical literature, so we provide a brief summary here.

Although MDS only produces a unique configuration up to rotation and translation this is not problematic for smoothing with Duchon splines as they are isotropic smoothers and hence rotation invariant.

We are now in a position to produce a spline suitable for smoothing in p-space. Specifically we choose a (reduced rank) Duchon spline with \(m=2\) and \(s = {\fancyscript{D}}/2 - 1\), which will give us a smooth \(f\) for which \(M={\fancyscript{D}}+1\) (i.e. the unpenalized component of \(f\) grows only linearly with \(\fancyscript{D}\)). We choose \(s = {\fancyscript{D}}/2 - 1\) for \(m=2\) since this is the minimum \(s\) we can use. Using a higher \(s\) will increase the weighting on the higher frequency components of the smooth in the penalty, reducing flexibility. Since our aim here is simply to minimise the effect of the nullspace, we simply choose \(s\) as small as possible to avoid any other effects.

### 3.3 Selecting \(\fancyscript{D}\)

Having accepted the need for \({\fancyscript{D}}>2\), we need some means of choosing \(\fancyscript{D}\). Rather than setting a maximum difference between the distances in \(\mathbf {D}\) and the distances in the projection, we choose \(\fancyscript{D}\) in order to minimize GCV. Selecting \(\fancyscript{D}\) is typically a small part of the computational burden, since the MDS and smoothing are cheap relative to the computation of distances (at least in the finite area smoothing case). Figure 6 shows the relationship between \(\fancyscript{D}\) and GCV score for the Aral sea data analysed below.

## 4 Examples

To illustrate the utility of the model two simulation studies are shown, followed by examples using real data. All concentrate on the finite area smoothing problem. In each case MDSDS was compared with thin plate splines as described in Wood (2003) (which do not account for the boundary), geodesic low-rank thin plate splines (GLTPS) and the soap film smoother (which both do account for the boundary). The GLTPS model was as described in Wang and Ranalli (2007), but with the within-area distances calculated as described in Web Appendix B (i.e. the same as for MDSDS); knots were placed using the cover.design method in the package fields (again, as in Wang and Ranalli 2007). In all cases smoothing parameters were selected by GCV. The R packages mgcv (available from CRAN) and msg (available from https://github.com/dill/msg) were used to fit the models. Code for fitting the GLTPS is available at https://github.com/dill/gltps.

In all the cases below the basis size specified refers to the maximum basis size allowed, since the penalty will reduce the complexity of the smoother, we simply need to specify an upper bound on the basis size.

### 4.1 Ramsay’s horseshoe

The horseshoe shape shown in the top panel of Fig. 1 is an obvious benchmark for techniques that wish to combat leakage. Although perhaps unrealistic (and bordering on pathological), any new method that works well on the horseshoe should have a good chance of working well in more realistic situations. A simulation experiment was run with the same setup as in Wood et al. (2008): 200 replicates were generated at each of three error levels (standard normal noise multiplied by 0.1, 1 and 10) with sample size 600. A thin plate regression spline, with basis size 100 and a soap film smoother with 32 interior knots and a 40 knot cyclic spline was used to estimate the boundary. For the MDSDS model, the basis size was set to 100 and a 20 by 20 initial grid was used for the MDS projection (see Web Appendix A), MDS projection dimension was selected by GCV in the range of 2 and the number of dimensions that explained 95% of the variation in the distance matrix of the initial grid. For the GLTPS 40 knot locations were selected as in Wang and Ranalli (2007). For each realisation the mean squared error (MSE) was calculated between the true function and a prediction grid of 720 points.

### 4.2 Peninsulae domain

The results from the modified Ramsay horseshoe are encouraging. However the domain is not particularly realistic. To further explore the performance of MDSDS a more realistic domain was used. The domain, which attempts to mimic a coastline, is shown in the left panel of Fig. 3.

Simulations were run at signal-to-noise ratios of 0.50, 0.75 and 0.95 (equating to adding standard normal noise multiplied by 0.35, 0.9 and 1.55, respectively). The soap film smoother used 109 internal knots and 60 for the cyclic boundary smooth. The MDSDS models used an initial grid of 120 by 126 points, the basis size was 140. The thin plate regression spline basis size was also 140. For the GLTPS, 80 knots were selected using the space filling design.

### 4.3 Aral sea

The Aral sea is located between Kazakhstan and Uzbekistan and has been steadily shrinking since the 1960s when the Soviet government diverted the sea’s two tributaries in order to irrigate the surrounding desert. The NASA SeaWiFS satellite collected data on chlorophyll levels in the Aral sea over a series of 8 day observation periods from 1998 to 2002 (Wood et al. 2008). The 496 data are averages of the \(38^\text {th}\) observation period. Smooths were fitted to the spatial coordinates (Northings and Eastings; kilometres from a specified latitude and longitude) with the logarithm of chlorophyll concentration (modelled with a Gamma distribution) as the response.

## 5 Discussion

Our MDSDS approach appears to have competitive performance compared to existing methods, while providing a number of possible advantages. Relative to the soap film smoother the method has a more natural handling of open and closed boundaries, and is also often the more natural model when the linkage between geographic areas is via movement of organisms. Relative to Wang and Ranalli (2007) our approach is somewhat more transparent in terms of what is being penalized when smoothing, and also uses a nullspace basis that avoids leakage, unlike the Wang and Ranalli method for which the nullspace does not respect boundary features. As mentioned above, MDSDS fits easily into a generalized additive model, which may have many more components, which are often necessary for ecological work.

As mentioned above, using MDS to build covariance functions for kriging has been investigated previously. When non-Euclidean distances are used, covariance functions may no-longer be positive definite or conditionally negative definite (Curriero 2005), so MDS can be used to project the data, creating a set of Euclidean distances. For example, Løland and Høst (2003) used river network distances in the construction of a variogram, and overcame the problem of lack of positive definiteness by using MDS and then constructing the variogram in MDS space. Projection dimension selection is partially addressed in Jensen et al. (2006), the authors suggest using the proportion of variation explained or the Bayesian criterion of Oh and Raftery (2001) as possible metrics but do not fully explore the issue, resorting to 2-dimensional projections. The use of Duchon splines in MDSDS allows for a high-dimensional projections, thus allowing for more accurate approximation of the distance matrix whilst ensuring that the points maintain ordering (this second point has not been addressed in the kriging literature to our knowledge).

Further work would involve considering more biologically motivated measures of distance. For example, distances based on the minimum energetic cost of moving between two locations. It is also of interest to investigate the use of MDSDS for smoothing non-geographic distances outside of ecology such as the socio-economic similarity of parliamentary constituencies, or measures of genetic relatedness.

## Notes

### Acknowledgments

We are especially grateful to Jean Duchon for generous help in understanding Duchon (1977). David wishes to thank EPSRC for financial support during his PhD at the University of Bath.

## Supplementary material

### References

- Augustin N, Musio M, von Wilpert K, Kublin E, Wood SN, Schumacher M (2009) Modeling spatiotemporal forest health monitoring data. J Am Stat Assoc 104(487):899–911CrossRefGoogle Scholar
- Bernstein M, De Silva V, Langford J, Tenenbaum J (2000) Graph approximations to geodesics on embedded manifolds. Technical report, Department of Psychology, Stanford University. ftp://ftp-sop.inria.fr/prisme/boissonnat/ImageManifolds/isomap.pdf
- Chatfield C, Collins AJ (1980) Introduction to multivariate analysis. Science paperbacks, Chapman and HallGoogle Scholar
- Curriero F (2005) On the use of non-euclidean isotropy in geostatistics. Technical report 94, Johns Hopkins University, Department of Biostatistics. http://www.bepress.com/cgi/viewcontent.cgi?article=1094&context=jhubiostat
- Driscoll TA, Trefethen L (2002) Schwartz-Christoffel transform. Cambridge University Press, CambridgeCrossRefGoogle Scholar
- Duchon J (1977) Splines minimizing rotation-invariant semi-norms in Sobolev spaces. Constructive theory of functions of several variables, pp 85–100Google Scholar
- Floyd RW (1962) Algorithm 97: shortest path. Commun. ACM 5(6):345–345CrossRefGoogle Scholar
- Gower J (1968) Adding a point to vector diagrams in multivariate analysis. Biometrika 55(3):582CrossRefGoogle Scholar
- Hastie TJ, Tibshirani RJ (1990) Generalized additive models. Monographs on statistics and applied probability. Taylor & Francis, New YorkGoogle Scholar
- Higham NJ (1987) Computing real square roots of a real matrix. Linear Algebra Appl 88–89:405–430. doi:10.1016/0024-3795(87)90118-2 CrossRefGoogle Scholar
- Jensen OP, Christman MC, Miller TJ (2006) Landscape-based geostatistics: a case study of the distribution of blue crab in Chesapeake Bay. Environmetrics 17(6):605–621. doi:10.1002/env.767 CrossRefGoogle Scholar
- Løland A, Høst G (2003) Spatial covariance modelling in a complex coastal domain by multidimensional scaling. Environmetrics 14(3):307–321. doi:10.1002/env.588 CrossRefGoogle Scholar
- Miller DL (2012) On smooth models for complex domains and distances. PhD thesis, University of BathGoogle Scholar
- Miller DL, Burt ML, Rexstad EA (2013) Spatial models for distance sampling data: recent developments and future directions. Methods in Ecology and EvolutionGoogle Scholar
- Oh MS, Raftery AE (2001) Bayesian multidimensional scaling and. J Am Stat Assoc 96(455):1031CrossRefGoogle Scholar
- Ramsay T (2002) Spline smoothing over difficult regions. J R Stat Soc Ser B Stat Methodol 64(2):307–319CrossRefGoogle Scholar
- Rue H, Held L (2005) Gaussian Markov random fields: theory and applications. Monographs on statistics and applied probability. Taylor & Francis, New YorkCrossRefGoogle Scholar
- Ruppert D, Wand M, Carroll RJ (2003) Semiparametric regression. Cambridge series on statistical and probabilistic mathematics. Cambridge University Press, CambridgeCrossRefGoogle Scholar
- Sampson PD, Guttorp P (1992) Nonparametric estimation of nonstationary spatial covariance structure. J Am Stat Assoc 87(417):108–119CrossRefGoogle Scholar
- Scott-Hayward LAS, MacKenzie ML, Donovan CR, Walker CG, Ashe E (2013) Complex region spatial smoother (CReSS). J Comput Graph. Stat. doi: 10.1080/10618600.2012.762920
- Vretblad A (2003) Fourier analysis and its applications. Graduate texts in mathematics. Springer, BerlinGoogle Scholar
- Wang H, Ranalli M (2007) Low-rank smoothing splines on complicated domains. Biometrics 63(1):209–217PubMedCrossRefGoogle Scholar
- Williams R, Hedley SL, Branch TA, Bravington MV, Zerbini AN, Findlay KP (2011) Chilean blue whales as a case study to illustrate methods to estimate abundance and evaluate conservation status of rare species. Conserv Biol 25(3):526–535. doi:10.1111/j.1523-1739.2011.01656.x PubMedCrossRefGoogle Scholar
- Wood SN (2003) Thin plate regression splines. J R Stat Soc Ser B Stat Methodol 65(1):95–114CrossRefGoogle Scholar
- Wood SN (2006) Generalized additive models: an introduction with R. Chapman & Hall/CRC, LondonGoogle Scholar
- Wood SN (2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Stat Soc Ser B Stat Methodol 73(1):3–36CrossRefGoogle Scholar
- Wood SN, Bravington MV, Hedley SL (2008) Soap film smoothing. J R Stat Soc Ser B Stat Methodol 70(5):931–955CrossRefGoogle Scholar