Abstract
Given the common problem of missing data in realworld applications from various fields, such as remote sensing, ecology and meteorology, the interpolation of missing spatial and spatiotemporal data can be of tremendous value. Existing methods for spatial interpolation, most notably Gaussian processes and spatial autoregressive models, tend to suffer from (a) a tradeoff between modelling local or global spatial interaction, (b) the assumption there is only one possible path between two points, and (c) the assumption of homogeneity of intermediate locations between points. Addressing these issues, we propose a value propagationbased spatial interpolation method called VPint, inspired by Markov reward processes (MRPs), and introduce two variants thereof: (i) a static discount (SDMRP) and (ii) a datadriven weight prediction (WPMRP) variant. Both these interpolation variants operate locally, while implicitly accounting for global spatial relationships in the entire system through recursion. We evaluated our proposed methods by comparing the mean absolute error, root mean squared error, peak signaltonoise ratio and structural similarity of interpolated grid cells to those of 8 common baselines. Our analysis involved detailed experiments on a synthetic and two realworld datasets, as well as experiments on convergence and scalability. Empirical results demonstrate the competitive advantage of VPint on randomly missing data, where it performed better than baselines in terms of mean absolute error and structural similarity, as well as spatially clustered missing data, where it performed best on 2 out of 3 datasets.
1 Introduction
Under perfect lab conditions, a data scientist can train models, infer variables of interest and discover new knowledge from neatly organised, consistent and complete datasets. However, in realworld scenarios, one is rarely so lucky. Whether it is random measurement noise, inconsistent annotation, missing data or another problem, realworld data can be messy, and tricky to process in such a way that downstream models and processes can use it effectively.
In this work, we aim to address the problem of missing data in the specific case of spatial gridded data by proposing a computational method for spatial interpolation. Prominent examples of such missing data in realworld scenarios include satellite imagery in remote sensing data (due to orbits, swaths and cloud clover Carrasco et al. 2019), mapping of ecological field measurements and samples collected at a limited set of locations (Fang et al. 2019), and local precipitation forecasting from meteorological measuring stations covering a limited set of locations (Tabios III and Salas 1985). As such, spatial interpolation is a problem highly relevant to many fields, and a large body of literature is dedicated to it in statistical domains (Heaton et al. 2019; Jiang 2018; Montero et al. 2015). Data in spatial settings is particularly susceptible to missing values, due to, among other reasons, (i) limited and/or variable spatial and temporal resolutions, (ii) limited availability of measuring locations, (iii) measurements being acquired at different times and different locations, and (iv) the characteristics of the locations in question (e.g., cloud cover or inaccessible areas). As a simplified example, consider the task of mapping the temperature at a certain time throughout the Himalayas. Since resources are limited and parts of the terrain are inaccessible, it is infeasible to collect measurements at every 100\(m^2\). This gives rise to the problem of filling in the entire grid based on measurements from a limited number of locations. In this case, we could also use additional information on the elevation of the terrain to help inform our decisions – a location with a higher elevation than a reference value will likely have a lower temperature, and vice versa.
Spatial interpolation methods, such as Kriging (also known as Gaussian processes) (Jiang 2018; Cressie 2015), tend to be founded on an assumption of autocorrelation, meaning that values are more strongly correlated with one another the closer their spatial proximity is. Our method is no exception in this regard. However, existing methods can be categorised into local methods and distancebased methods. Local methods, such as spatial autoregressive models (Anselin 1988; Haining 1978) or convolutional neural networks (Dong et al. 2015; Shi et al. 2016), rely on adding the information of a strictly defined local neighbourhood around a target cell to enhance their predictions. The downside of these methods is that potentially valuable information outside the predefined neighbourhood is disregarded. Moreover, if local information is not available, local methods may require imputation methods to perform their estimations. Distancebased methods, on the other hand, most notably including various Gaussian processbased approaches (Jiang 2018; Cressie 2015), can use any measurement available, but rely on a distancebased weighting to use this information for their predictions. The downside of these methods is that, in most spatial settings, paths cannot be assumed to be homogeneous, and thus distance alone may not be sufficient to reliably predict values. For example, in the case of temperature measurements in the Himalayas, the difference in elevation between pairs of locations may vary despite the distance being the same. This problem is further exacerbated by the twodimensionality of spatial problems, allowing for the existence of multiple paths between any two locations, some of which may be more important than others for the propagation of values (for example, a longer path around a mountain as opposed to a shorter path over it).
In this work, we propose a method that incorporates a systemoriented perspective, illustrated in Fig. 1. In this perspective, we use a local neighbourhood to perform estimations, but we rely on recursion to propagate known values through direct neighbours over a network of (mostly indirectly) mutually interacting cells, iteratively updated until an equilibrium is reached. At every recursive call, a weight is applied to the values being propagated to represent spatial autocorrelation. This weight can furthermore be assigned dynamically in a datadriven manner, based on the features of the underlying spatial configuration. This allows for higher autocorrelation weights between, for example, two neighbouring blocks of a city, and lower weights between an industrious port and the open sea. The update rules for every cell were based on the Bellman equation for Markov reward processes (Bellman 1957), canonically used to estimate the value of a particular state (cell). With this perspective, we can address both the limitations of local methods and distancebased methods.
Our main contributions in this work are as follows:

We propose a novel method, VPint, for spatial interpolation, incorporating a systemoriented perspective aimed at overcoming the limitations of existing local or distancebased methods.

We introduce two variants of our value propagation interpolation algorithm, both of which incorporate elements of Markov reward processes: SDMRP, using a static discount throughout the grid and requiring no additional data, and WPMRP, exploiting spatial features to predict neighbourspecific spatial weights.

We provide a vectorised implementation of our methods allowing for high degrees of parallel processing to speed up the algorithm running time, which we make publicly available.^{Footnote 1}

We empirically evaluate our methods on synthetic data and 2 realworld datasets and compare their performance against that of popular baselines from the Kriging, machine learning and deep learning fields in terms of mean absolute error, root mean squared error, peak signaltonoise ratio and structural similarity. We also conducted experiments testing convergence, scalability, and whether the proposed method generalises to spatiotemporal data.
2 Related work
To date, various spatial interpolation methods, both local and distancebased, have been proposed. We will discuss a selection of popular methods in this section.
Gaussian processes Given its widespread use, the first set of methods of note are Gaussian processes (GP), also known as Krige (1951). GPs Cressie (2015); Schabenberger and Gotway (2017) are a set of interpolation techniques based on learning the covariance of target values over distance using variogram (kernel) functions fitted to the data. Popular variants of GPs are discussed in Jiang (2018) and include ordinary Kriging (OK), universal Kriging (UK) and regression Kriging (Kriging after detrending). Contemporary contributions to GP methods include a scalable gradientbased surrogate function method (Bouhlel and Martins 2019) and a neural networkbased method to overcome GPs’ limitation of disregarding the characteristics of intermediate locations in paths between pairs of locations (Sato et al. 2019). Although the assumptions made differ per variant, all GPbased methods are limited by their reliance on pairwise distancebased covariance models. Moreover, traditional GP methods tend to scale poorly to larger datasets (\(O((nm)^4)\)). An overview of modern GP methods aimed at increasing the viability of GPs for largescale datasets is given in Heaton et al. (2019), including local approximate GPs Gramacy and Apley (2015), stochastic partial differential equation approaches (Rue et al. 2017) and multiresolution approximations (Katzfuss 2017).
Gapfill Gapfill Gerber et al. (2018) is a local method utilising no explanatory variables that, unlike GPs, does not build an explicit statistical model. Instead, as a local method, it relies on using subsets of the available data for its predictions. Although its local perspective and cellspecific independent predictions allow gapfill to be highly parallelised, its performance in terms of accuracy tends to fall short of GPs Heaton et al. (2019), and its dependency on the presence of sufficient amounts of nonmissing values within its neighbourhood renders it infeasible for cases where missing values are clustered together.
Belief propagation This family of methods, particularly loopy belief propagation (Pearl 1982), has been used successfully for image denoising (Song et al. 2011), image restoration (Zheng et al. 2020) and image completion tasks (Levin et al. 2003). It generally considers graphical models (Lauritzen 1996), such as Markov random fields, and gridded datasets can also be converted to this representation. The key idea is to compute the marginal distributions of nodes in a network, based on the beliefs (estimations) of the values of the child nodes connected to them. This is done through a process of message passing, which iteratively propagates beliefs over the network. Conceptually, this type of method is similar to our proposed method, although it does not leverage the Bellman equation or datadriven spatial weights, and unlike belief propagation, our method computes predicted values rather than distributions thereof. While belief propagation can be considered to take the systemoriented perspective, its exact form operating on junction trees scales poorly to larger graphs (\(\mathcal {O}(M \cdot N^{3})\), where M denotes the number of nodes and N is the number of discrete states per node) (McAuley and Caetano 2010) and high precision continuous variables (Zheng et al. 2012), rendering it computationally infeasible for most practical applications of gridbased interpolation tasks. Similarly, the standard approximate loopy belief propagation algorithm may have high errors compared to other methods (Satorras and Welling 2021), or even oscillate rather than converge (Murphy et al. 2013).
Spatial regression Spatial autoregressive models (Anselin 1988) (SAR) have remained relatively consistent, but have been expanded in some recent work (Yang and Lee 2017; Fix et al. 2021). Moving average (MA) models are often used in the context of timeseries modelling (Durbin 1959), but can also be used for spatial regression problems using the “MA by AR” approach (Haining 1978). Highly related to SAR and MA models, autoregressive moving average (ARMA) models have seen recent work of particular relevance to the COVID19 pandemic, modelling a transmission network of influenza (Qiu et al. 2021). Apart from SAR, MA, and ARMA models, which include additional features for the spatial lag and/or residuals, there are also approaches using an explicit spatial, temporal or spatiotemporal data representation, such as the tensor decompositionbased work by Corizzo et al. (2021). This latter work seems particularly relevant for spatial interpolation problems where collinearity exists within the explanatory variables. Given that the explanatory variables are being leveraged for their shared spatial structure with the target variable, collinearity in the explanatory variables is a likely scenario. However, unlike our method, these types of method are not interpolation methods, aiming instead at predicting target values directly from the spatiotemporal features or a latent representation thereof. Other recent work using spatial regression approaches include house price estimation using geographically weighted regression (Soltani et al. 2021), varying coefficient spatiotemporal regression (Lee et al. 2021), ambient blackcarbon prediction (Awad et al. 2017) and an analysis of the spatial patterns of COVID19 (Wu et al. 2021). The spatial autoregressive regression models suffer from the limitations of a local perspective: their use of a predefined local neighbourhood dismisses information outside of the neighbourhood radius.
Neural networks and deep learning Deep learning techniques, and convolutional neural networks (CNN) in particular, have been used to great effect in many computer vision applications (Dong et al. 2015; Shi et al. 2016). These computer visionbased interpolation CNNs could also be applied to general spatial interpolation. Moreover, in their 2020 publication, Hashimoto and Suto formulated a CNN architecture for the specific purpose of spatial interpolation (Hashimoto and Suto 2020). Apart from CNNs, graph neural networks (GNNs) have also been applied recently to spatiotemporal interpolation by Wu et al. (2020), utilising fully connected networks with distancebased weights determined using a random subgraph sampling strategy. Like autoregressive models, CNNs have a local perspective and therefore, dismiss potentially meaningful information outside their predefined neighbourhood. Conversely, similar to GPs, GNNs suffer from the reliance on distancebased weights, dismissing potential nonhomogeneity of intermediate locations on paths between locations.
By adopting a systemoriented perspective, the method we propose in this work aims to be situated between these two main categories of existing work (local and distancebased). Moreover, like Gapfill, it offers a computational alternative to existing methods with an emphasis on explicit statistical spatial modelling.
3 Problem statement
Let us define a spatial grid \(\mathbf {G}\) as an \((n \times m)\) matrix, where n corresponds to the number of rows and m to the number of columns. At every cell c in \(\mathbf {G}\), where \(c = \mathbf {G}_{i,j}\), with i and j corresponding to the row and column indices in \(\mathbf {G}\), respectively, there exists a true value \(y^{*}_{c}\) that may be either known or unknown. If \(y^{*}_{c}\) is known, we set the cell value \(y_{c} = y^{*}_{c}\). If it is not known, we mark this location as unknown: \(y_{c} = \varnothing \). The \((n \times m)\) matrix \(\mathbf {Y}\) contains \(y_c\) for all c in \(\mathbf {G}\).
We further define a feature grid \(\mathbf {X}\) as an \((n \times m \times f)\) tensor, where f denotes the number of features per cell. Thus \(\mathbf {x}_c\) in \( \mathbf {X}\) is a feature vector corresponding to location c in \(\mathbf {G}\). We can now define a prediction model \(\mathcal {M}(\mathbf {Y},\mathbf {X})\) that takes as input the available data in \(\mathbf {Y}\), along with the corresponding feature vectors per location in \(\mathbf {X}\), and returns a prediction matrix \(\hat{\mathbf {Y}}\). The objective of spatial interpolation is to find a model \(\mathcal {M}^{*}\) that minimises the mean absolute error (MAE) for all locations c in \(\mathbf {G}\), given the predictions in \(\hat{\mathbf {Y}}\). Concretely:
4 Methods
In this section, we will describe our proposed interpolation method in four steps. The general procedure and main philosophy will first be illustrated, after which we introduce some background for our update rules, and propose the two concrete variants of our method that we implemented. Finally, we will discuss our approach for ensuring efficient computation allowed by parallel matrix operations.
4.1 General interpolation procedure
The core of our proposed method relies on iterative elementwise updates to an estimation grid. We first instantiate \(\hat{\mathbf {Y}}\), with missing values given by \(\mathbf {Y}\) being set to arbitrary real values as initial predictions (the mean of known values in our experiments). Next, for every cell \(c \in \mathbf {G}\), if \(\mathbf {Y}_{c}\) is known, we use it as a static prediction. If it is not known, we update its value using the estimated value of its neighbours \(\{c' : c' \in N_S(c)\}\), where \(N_S(c)\) denotes the set of spatial neighbours to cell c. Thus, by iterating this procedure, our algorithm recursively propagates known values throughout chains of estimated values in \(\hat{\mathbf {Y}}\), through all possible paths in the system, anchored by known values.
4.2 Background: update rule
Our update rule is based on Markov reward processes (MRPs). MRPs (Bellman 1957) are models of the form \(M = \{S,T,R\}\), where S is a set of states \(\{s_1,s_2,...,s_{S}\}\), \(\mathbf {T}\) is an \(S \times S\) matrix of transition probabilities \(T_{(s,s')}\) between all pairs of states s and \(s'\), and R is a set of rewards \(\{r_{s_1},r_{s_2},...,r_{s_{S}}\}\) associated with being in a state s. MRPs extend Markov chains, which do not incorporate rewards R, and have been successfully used to model the behaviour of a single variable over time (Sato and Trivedi 2007; Bianchi and Presti 2016). In these temporal models, a state s represents a set of attribute values at a particular time t in a sample trajectory over time. At every t a state s can probabilistically transition from s to any of a set of successor states (given the current state) \(S's = \{s's_{1},s's_{2},...,s's_{S}\}\) based on transition probabilities given by \(T_{(s,s')}\), until an absorbing state is reached from which no further transitions are possible: \(S's = 0\). Since MRPs are Markovian, the transition probability to go from s to \(s'\) are contingent solely on s, and are unaffected by the history of previous states in the trajectory. If a reward \(r_{s}\) is associated with the state s, this gives information about the desirability of state s. However, aside from this immediate reward \(r_{s}\), intuitively the expected future rewards \(\mathbf {E}(s')\) from all \(s' \in S's\) should also be considered, as states leading to successor states with high future rewards would be more desirable. This leads to a notion of state values, where the rewards of all possible successor states \(s'\) are used to recursively compute state values v(s) for all \(s \in S\). This is typically done by iterating the Bellman equation (Bellman 1957), where the immediate reward \(r_{(s,s')}\) is added to the discounted (using the discount parameter \(\gamma \)) average expected values of the successor states:
We opted to use this equation as our interpolation update rule. In the case of interpolation, a location c (at a certain time) can be seen as a state s, with the set of spatial neighbours \(N_S(c)\) being analogous to the set of successor states \(S's\) in MRPs. The state values v(s), then, would be the target variable \(\hat{y}_{c}\) to be estimated, with immediate rewards given by known values and the discount \(\gamma \) representing spatial autocorrelation. Using the Bellman equation as an update rule, we can define the set of spatial neighbours \(N_{S}(c)\) as the cells \(\{c' : c' \in \mathbf {G}\}\) that share a border with c, such that our spatial interpolation algorithm takes the form of:
Here, \(A_S(c)\) denotes an aggregation function over the spatial neighbourhood of c. While in principle, it is possible to add any userdefined aggregation function, we opted to stay close to the canonical Bellman equation, by taking the mean (spatial lag) of \(N_S(c)\):
Using Eq. 4 also allows us to provide an efficient vectorised implementation of our method. While the method runs for a set amount of iterations in principle, it can also incorporate an early stopping criterion by introducing a variable \(\delta \), representing the change of a configuration over iterations, and using \(\hat{y}_{c}^{(1)}\) to denote the predictions from the previous iteration:
This then allows for the early stopping of the algorithm if \(\delta \) drops below a userspecified threshold.
One could also consider generalising this approach to spatiotemporal interpolation problems. In that case, the algorithm cannot solely rely on Eq. 3. Whereas two spatial dimensions share the same scale, and can thus both use the same weight \(\gamma \) as a spatial discount, a temporal dimension may behave very differently. As a result, to generalise to a spatiotemporal domain, we need to introduce an additional parameter \(\tau \) for discounts representing temporal autocorrelation. This also leads to the set of temporal neighbours \(N_{T}(c)\), which represent the same location at different time steps. Thus, the spatiotemporal update rule becomes:
where \(A_T(c)\) will generally use the temporal lag aggregation function:
4.3 Variants
We propose two variants of our value propagation interpolation method. The first, SDMRP (static discountMRP), uses a single spatial weight parameter \(\gamma \) for the entire dataset, which can be tuned using random search on subsampled data from known values. The second variant, WPMRP (weight predictionMRP) exploits spatial data as explanatory variables to inform its prediction of neighbourspecific weights. Unlike SDMRP, WPMRP would therefore not assume isotropy (the same spatial effects in all directions), although it would necessitate the use of Eq. 4 as an aggregation function. The two variants applied to the example of Fig. 2b, visualising the interpolation problem of temperature measurements in the Himalayas, are shown in Fig. 2.
4.3.1 Basic static discounts: SDMRP
The most basic variant of our proposed method stays closest to the canonical form of the Bellman equation in Eq. 3. It uses a single discount parameter \(\gamma \), ranging between 0 and 1, to represent spatial autocorrelation. This means that, for SDMRP, values can only decrease over subsequent recursive calls, making known values reminiscent of a light source in the fog, radiating values around itself and merging with other light sources, but decaying over distance. In the example of spatial interpolation of temperatures, it would propagate the known temperature values over the grid, at an intensity decreasing with every recursive call, like a heat source dissipating over distance. This can be seen in Fig. 2a. The advantage of this method is that it does not require additional features to be applied to a dataset, nor does it require a prediction model to be explicitly trained. It will also regress to the initialisation value (such as 0, or the mean value) over distance, which can be a desirable property as uncertainty increases, but can also be considered a downside as it does not provide much additional information. Its main hyperparameter \(\gamma \) also requires tuning, which can be done automatically by subsampling known values and performing interpolation using randomly searched \(\gamma \) settings. Furthermore, the spatial characteristics of the grid are not taken into account, and isotropy is assumed. SDMRP has a time complexity of \(\mathcal {O}(4 \cdot \mathbf {Y} \cdot k)\), if k is the number of times Eq. 3 is iterated (every \(c \in \mathbf {Y}\) can have at most 4 neighbours).
4.3.2 WPMRP
In an ideal case, rather than using a single static weight \(\gamma \), we would use a method allowing us to use locationspecific weights \(\gamma _{c',c}\). For example, when spatially interpolating temperatures in the Himalayas, knowing the difference in elevation between two neighbouring locations would enable us to know whether one value is likely to be the same, lower, or higher than the other, as illustrated in Fig. 2c. To this end, we created the weight prediction variant WPMRP, in which we use the spatial feature vectors \(\mathbf {x}_{c} \in \mathbf {X}\) and \(\mathbf {x}_{c'} \in \mathbf {X}\) as inputs to a weight prediction model \(\mathcal {M}_{w}\). This model predicts an individual weight of the location pair \((c,c')\) as \(\gamma _{c',c} = \mathcal {M}_{w}(\mathbf {x}_{c'},\mathbf {x}_{c})\) from spatial data describing the locations (such as houses, shops and land use). \(\mathcal {M}_{w}\) could consist of any machine learning model, ensemble or pipeline, but could also leverage functions directly operating on the feature space such as distance measures and inverse similarity metrics. Mapping features to a dense data manifold of lower dimensionality could also be considered.
In the case of machine learning models and pipelines, in order to train the model, we use the available cells with known true values in \(\mathbf {Y}\) to supervise the training. For all pairs of neighbours \(\{(c,c') : c,c' \in \mathbf {Y} \wedge y_{c} \ne \varnothing \wedge y_{c'} \ne \varnothing \}\), we would compute the true weight using the fraction \(\gamma ^{*}_{c',c} = \frac{y^{*}_{c}}{y^{*}_{c'}}\), resulting in a ground truth vector \(\Gamma ^{*}\) that can be used as the targets for the training of a regression model. The method for matching the elements of \(\Gamma ^{*}\) to predictive features is a design choice: the location features \(\mathbf {x}_{c'}\) and \(\mathbf {x}_{c}\) of every location pair \((c',c)\) would need to be combined, and this could be done in any manner the situation calls for, such as adding the vectors or computing a distance metric. In our experiments we opted to simply concatenate \(\mathbf {x}_{c'}\) and \(\mathbf {x}_{c}\). Thus, with \(\Gamma ^{*}\) and \(\mathbf {x}_{c'}\), \(\mathbf {x}_{c}\) for all \((c',c)\) pairs, we can train a regression model \(\mathcal {M}^{*}_{w}(\mathbf {x})\), such that, if \(\mathbf {Y}_{N} := \{(c',c) : (c',c \in \mathbf {Y}) \wedge (c' \in N(c) )\}\):
Here we propose to train \(\mathcal {M}_{w}\) on \(\Gamma ^{*}\) using any regression (machine learning) algorithm. The full pipeline of WPMRP using machine learning weight prediction is outlined in Algorithm 1 (which assumes available functions for model fitting). Lines 17 generate the elements of the true weight vector \(\Gamma ^{*}\), and line 8 fits a weight prediction model to the weights found in line 4. Lines 922 show the iterative updates of cells in \(\mathbf {Y}\), and lines 2327 create and return the predictions in the form of a grid \(\hat{\mathbf {Y}}\). The time complexity to run WPMRP is the same as that of SDMRP, but with the added cost of the model used for \(\mathcal {M}_{w}\) (which can be chosen freely): \(\mathcal {O}(4\mathbf {Y} \cdot k) + \mathcal {O}_{\mathcal {M}_{w}}\), if \(\mathcal {O}_{\mathcal {M}_{w}}\) is the time complexity of making predictions with \(\mathcal {M}_{w}\).
4.4 Vectorbased update rule for parallel computation
For the efficient processing of the main iterative loop of lines of our algorithms as in lines 922 Algorithm 1, we reformulated our update function as a series of matrix operations, allowing updates to be carried out in a highly parallelised manner through vectorisation. This approach does, however, necessitate the use of weighted averaged (spatial lag) as an aggregation function. We will illustrate the procedure on the simpler case (spatial MRP), but the approach can be generalised to spatiotemporal MRP as well. The main idea of this approach is to shuffle neighbouring values around in matrices and tensors \(\mathbf {T}^{y_i}\), where the subscript y indicates this tensor contains values, and i indicates the stage of operations the data is currently in, with the accompanying neighbour weights in \(\mathbf {T}^{\gamma _i}\), where \(\gamma \) indicates this tensor contains weights. These operations are performed in order to compute weighted sums of neighbouring values for all cells in the grid as a matrix dot product in Eq. 11.
Concretely, let \(\hat{\mathbf {Y}}\) denote a matrix of size \((n \times m)\) containing predicted values \(\hat{y}_{c}\) at every cell where the true value is not known, and \(y_{c}\) otherwise. We first turn this matrix into a threedimensional tensor \(\mathbf {T}^{y_0}\) of size \((n \times m \times d)\), where d denotes the maximum number of neighbours \(max(N_S(c))\) for any cell \(c \in \mathbf {G}\) (in practice, this will generally be 4 as a cell can share at most 4 edges in a grid). For all c, the entries along the daxis of \(\mathbf {T}^{y_0}\) will contain the values of the neighbours of c. Concretely:
If \(N_S(c) < d\), the remaining values of the third dimension of \(\mathbf {T}^{y_0}_{c}\) are set to 0. We similarly construct a tensor \(\mathbf {T}^{\gamma _0}\) of size \((h \times w \times d)\), of which the entries match those of \(\mathbf {T}^{y_0}\). However, the values of this tensor contain weights \(\gamma _{c',c}\) from neighbour \(c'\) to cell c, rather than the values:
Next, we systematically stack all columns of \(\mathbf {T}^{y_0}\) and \(\mathbf {T}^{\gamma _0}\) as additional rows, resulting in the new matrices \(\mathbf {T}^{y_1}\) and \(\mathbf {T}^{\gamma _1}\) of size \((h \cdot w \times d)\). Now every row represents a single location c in a single dimension, although the information on the original columns of \(\mathbf {Y}\) is kept through the order of the rows. The columns of \(\mathbf {T}^{y_1}\) now show the values of the neighbouring values for a row’s location’s neighbours \(N_S(c)\), and the columns of \(\mathbf {T}^{\gamma _1}\) contain the corresponding weights. We now perform an MRP update by computing the dot product of \(\mathbf {T}^{y_1}\) and the transpose of \((\mathbf {T}^{\gamma _1})\), and placing its diagonal values into a new vector \(\mathbf {T}^{y_2}\) of size \((h \cdot w)\):
Since this vector has the same order as the rows of \(\mathbf {T}^{y_1}\), we can reshape this vector into a matrix \(\mathbf {T}^{y_3}\) of size \((h \times w)\), corresponding to the shape of \(\hat{\mathbf {Y}}\). We now create another \((h \times w)\) matrix \(\mathbf {T}^{n}\), where \(\mathbf {T}^{n}_{c} = N_S(c)\), allowing us to divide \(\mathbf {T}^{y_3} / \mathbf {T}^{n}\) elementwise, resulting in an updated prediction matrix \(\hat{\mathbf {Y}}\):
Finally, since this operation needlessly updated known values, we substitute original known values in \(\hat{\mathbf {Y}}\): \(\hat{\mathbf {Y}}_{c} = \mathbf {Y}_{c}\) for all \(c : \mathbf {Y}_{c} \ne \varnothing \).
Using this vectorised approach, we found that the complexity of our algorithms in terms of wallclock time improved by a factor between 10 and 100. In order to adapt this approach to spatiotemporal MRP interpolation, which adds an extra dimension for time, \(\mathbf {Y}\) is of size \((h \times w \times t)\), \(\mathbf {T}^{y_0}\) and \(\mathbf {T}^{\gamma _0}\) are of size \((h \times w \times t \times d)\), and d becomes equal to 6, as any cell can now have up to 6 neighbours. Since all neighbours are already included in the fourth dimension, there is no reason to keep the spatial and temporal dimensions separate. Thus, we can still generate the 2D matrices \(\mathbf {T}^{y_1}\) and \(\mathbf {T}^{\gamma _1}\), as we simply add another dimension to the stacking operation (resulting in \(h\cdot w\cdot t\) rows instead of \(h \cdot w\)). As a result, with these exceptions, the pipeline can remain the same as it was for the spatial case.
5 Experiments
In this section, we will share the details of our experiments. We will first introduce the research questions we were interested in, after which we will list the baselines we compared our method to and the datasets used in our experiments.
5.1 Research questions
We were interested in answering the following research questions with our experiments:

R1: How does VPint compare to baseline methods in terms of mean absolute error, root mean squared error, peak signaltonoise ratio and structural similarity?

R2: Does VPint converge to stable prediction values?

R3: Can VPint be generalised to spatiotemporal problems?

R4: Can WPMRP leverage spatial features to perform better than SDMRP, given sufficiently informative features?

R5: How do VPint and baseline methods scale as the size of the dataset increases?
In addition to these main research questions, we were also interested in whether different patterns of missing data would give different results.
5.2 Baselines
Our selection of baselines was aimed at including competitive interpolation and regression methods used for spatial and geospatial modelling in practice. The selection we made consists of:

Ordinary Kriging (OK), using an implementation by the Python library PyKrige (Murphy 2020). Like our proposed methods, ordinary Kriging predicts values using weighted sums:
$$\begin{aligned} \hat{y}_{c} = \sum _{c' \in N_{S}(c)} \gamma _{c',c} \cdot y_{c'} \end{aligned}$$(13)Here \(\gamma _{c',c}\) is the distancebased weight between known cell \(c'\) and unknown cell c. However, OK uses \(y_{c'}\) instead of \(\hat{y}_{c'}\), \(N_S(c)\) will contain more cells than only direct neighbours, and weights are determined using a distancebased variogram model.

Universal Kriging (UK), also using PyKrige’s implementation. UK is highly similar to OK, but it compensates for the possible existence of a trend in the data. For both OK and UK, while more advanced methods exist, such as local approximate Gaussian processes (Gramacy and Apley 2015), as these are aimed at improving the scalability of Kriging rather than its accuracy, we consider OK and UK to be suitable representative methods for this class of algorithm.

Loopy belief propagation, using a Python implementation for denoising images (Grampurohit 2021), which can be applied to interpolation problems by treating missing values as noise (generated from a uniform distribution centred around the mean of the known values, with a range based on their standard deviation). In a basic form, belief propagation is centred around the equation:
$$\begin{aligned} L(\hat{y}_{c}) = \Pi _{c' \in N_S(c)} \lambda (\hat{y}_{c'}) \end{aligned}$$(14)Here, \(L(\hat{y}_c)\) refers to the likelihood of \(y^{*}_c\) being equal to \(\hat{y}_c\), and \(\lambda (\hat{y}_{c'})\) is the likelihood of the neighbouring values (children) \(c' \in N_S(c)\). To ultimately produce a single predicted value, the most likely value can be used:
$$\begin{aligned} \hat{y}_{c} \in \mathop {\mathrm{argmax}}\limits {L(\hat{y}_{c})} \end{aligned}$$(15) 
Nonspatial regression, using autosklearn (Feurer et al. 2019) to select the best performing regression model (or ensemble) and hyperparameter settings out of a large collection of algorithms, including linear regression, support vector regression, gradient boosted methods and others.^{Footnote 2} We denote this model, which will typically be an ensemble of multiple powerful machine learning models, as \(\mathcal {F}\). The resulting general form of the predictions from nonspatial regression is:
$$\begin{aligned} \hat{y}_{c} = \mathcal {F}(\mathbf {x}_c) \end{aligned}$$(16)We allowed autosklearn 150 seconds per run to find the best performing ensemble.

Spatial autoregressive (SAR), moving average (MA) and autoregressive moving average (ARMA) models, using autosklearn to find the best performing regression model. Canonically, these models are ordinary least squares (OLS)based linear regression methods, with extra spatial (SAR) or error (MA) terms (both in the case of ARMA). However, since we use autosklearn, though OLS is also a possible model, the final model will generally have a different formula, such as the potentially nonlinear support vector regression models. For SAR, the spatial term is based on a spatial weight matrix \(\mathbf {WM}\) and a vector \(\mathbf {y}\) containing all the known values of the grid, corresponding to the rows of \(\mathbf {WM}\). The general form of SAR is:
$$\begin{aligned} \hat{y}_{c} = \mathcal {F}(\mathbf {x}_c,\mathbf {WM},\mathbf {y}) \end{aligned}$$(17)For MA models, we used the “MA by AR” approach (Haining 1978). Its formula, using the prediction error vector \(\mathbf {\epsilon }\) instead of SAR’s \(\mathbf {y}\), is:
$$\begin{aligned} \hat{y}_{c} = \mathcal {F}(\mathbf {x}_c,\mathbf {WM},\mathbf {\epsilon }) \end{aligned}$$(18)Following the “MA by AR” approach, before we can use Eq. 18, we first needed to determine \(\mathbf {\epsilon }\) using:
$$\begin{aligned} \epsilon _c = y_c  \mathcal {F}_{s}(\mathbf {x}_c) \end{aligned}$$(19)Here, \(\mathcal {F}_{s}\) represents a separate nonspatial regression model, as in Eq. 16, to compute prediction errors on all known values. These errors can then be used by Eq. 18 by putting the values of \(\epsilon _c\) for all c into a single vector \(\mathbf {\epsilon }\). For ARMA, we again use the “MA by AR” approach for the MA component. As ARMA is a combination of SAR and MA, its formula is:
$$\begin{aligned} \hat{y}_{c} = \mathcal {F}(\mathbf {x}_c,\mathbf {WM},\mathbf {y},\mathbf {\epsilon }), \end{aligned}$$(20)where \(\mathbf {\epsilon }\) is obtained using Eq. 19.

Convolutional neural networks (CNN), optimised using automated neural architecture search (NAS). The CNN regression predicted \(\hat{y}_{c}\) from \(\mathbf {x}_{c}\) and \(\mathbf {x}_{c'}\) for all \(c' \in N_{S}(c)\), where \(N_{S}(c)\) is determined by the convolutional filters of the network, similar to CNN approaches used in computer vision (Dong et al. 2015; Shi et al. 2016). We used NAS implemented by autokeras (Jin et al. 2019) for all training sets (50 trials, 1000 epochs). Although the model architectures for CNNs can be quite complex, on an abstract level these networks are still regression models of the same form as Eq. 16.
5.3 Datasets
Our main experiments involved a synthetic spatial dataset as well as two realworld datasets (GDP and COVID19 trajectories), with an additional synthetic spatiotemporal dataset used to address R3. The implementation of our data generation algorithms used to create the experimental synthetic datasets is included in our public code repository; likewise, the realworld datasets are available for public use at their respective sources, allowing others to reproduce our results.
5.3.1 Synthetic data
Spatial targets For this synthetic dataset, based on a parameterised mean \(\mu \) and standard deviation \(\sigma \), the interpolation grid \(\mathbf {Y}\) of userspecified size \((n \times m)\) (set to \(n=50\) and \(m=50\) in our experiments) was generated, where each cell c was assigned a base value \(y^{b}_{c}\) by sampling from the normal distribution \(\mathcal {N}(\mu ,\sigma )\). Next, to assign true values \(y^{*}\) affected by spatial interaction, we updated every cell c as a weighted average (based on a spatial autocorrelation parameter \(a^{s}\)) of its own value and the mean of its neighbouring values:
Spatiotemporal targets To address R3, we also generated synthetic spatiotemporal data. For this type of data we introduced additional parameters for the number of timesteps d and the temporal autocorrelation coefficient \(a^{t}\). We then built a threedimensional tensor \(\mathbf {Y}\) of size \((n \times m \times d)\) by using Eq. 21 at every time step. Since, at this point, the temporal layers of \(\mathbf {Y}\) are still fully independent, we use the temporal neighbourhood function \(N_{T}(c)\) to perform a final update on the cells of \(\mathbf {Y}\) ensuring temporal interaction:
Synthetic features For our synthetic data, we created a feature vector \(\mathbf {x} = (x_{c_{1}}^{b},x_{c_{2}}^{b},...,x_{c_{{\mathbf {x}^{b}}}})\) for every location \(c \in \mathbf {Y}\). Every base feature \(x^{b}_{c_{i}} \in \mathbf {x}^{b}_{c}\) was generated using a uniform distribution \(\mathcal {U}(min,max)\) with userspecified min and max values. These features were then updated in a similar manner to the cell values y, using a parameter called the feature correlation coefficient f:
A feature correlation coefficient f of 0 would result in fully random features, whereas a coefficient of 1 would result in features identical to the targets.
5.3.2 Realworld data
In the case of realworld data, the variables being measured, such as GDP or COVID19 incidence, are generally not gridded in nature. As a result, to generate these datasets, data needs to be aggregated, e.g., by taking the mean (estimated) GDP per capita for residents in the area covered by a grid cell, or the sum of COVID19 incidence at that location. The granularity of these datasets thus introduces a tradeoff: a high granularity increases the computational cost and may result in relatively sparse datasets (as was the case in our COVID19 dataset), but does provide a high level of detail. Meanwhile, a low granularity may result in data too lowgrained to draw meaningful conclusions from, or cells that simply all regress to a global mean due to the erasure of local spatial patterns, but will be faster to compute and likely results in a higher density dataset. There is no minimal or maximum granularity cutoff point at which an interpolation method becomes infeasible. However, when gauging how applicable an interpolation method is to a users’ gridded dataset, this tradeoff merits consideration.
Gross domestic product (GDP) targets For GDP data, we used a gridded spatial dataset containing worldwide GDP estimates sourced from World Bank (DECRG 2010) at a resolution of 1km \(\times \) 1km. We specifically looked at the city of Taipei in Taiwan and its surroundings, including both heavily populated urban areas expected to have high GDP values, and surrounding sparsely populated mountainous areas with low GDP values. The resulting grid had a size of \(51 \times 51\) pixels.
Aggregated COVID19 trajectory targets This dataset consisted of trajectories of confirmed COVID19 patients prior to their diagnosis in South Korea (DACON 2020). Although this data was spatiotemporal in principle, we opted to aggregate over time both due to the relative sparsity of the data (as it was gathered at the start of the COVID19 pandemic), and to alleviate potential privacyrelated concerns in this relatively sensitive dataset. Thus, every \(c \in G\) had a value corresponding to the total number of visits by people infected with COVID19 over the entire time period. The city of interest in this dataset was Daegu, which was the main hotspot of the epidemic in South Korea at the time the data was collected. A visualisation of this data can be found in Fig. 3b. Since the target data did not come in gridded form, we set the resolution of this dataset to \(35 \times 51\) pixels, putting it at a similar scale to the GDP dataset used for Taipei.
Mapbased features To generate features for GDP and COVID19 trajectories in South Korea and Taiwan, we aggregated a selection of vector and point map data sourced from OpenStreetMap (OpenStreetMap 2019). For all \(c \in \mathbf {G}\), every element in \(\mathbf {x}_{c}\) represented the count of all objects in the map data corresponding to a certain type, such as apartments, houses and shops. There are various design choices available for preprocessing this type of data, such as dealing with objects without an annotated type (drop or replace), feature selection (none, manually created highlevel taxonomy, or keeping the most frequent types) and feature normalisation (none, unit length scaling, mean normalisation or Zscore normalisation). In accordance with the design philosophy of programming by optimisation (PbO) (Hoos 2012), we did not commit to any of these choices, and instead used a commonly used Bayesian optimisationbased automated algorithm configurator, version 0.12.0 of SMAC3 (Hutter et al. 2011), to select the best possible feature construction pipeline per method (time budget 24 hours per algorithm per dataset).
5.4 Experimental setup
The following section will explain the procedures and experimental conditions necessary to carry out our experiments.
5.4.1 Missing data procedures
In order to evaluate our methods, we required data that was fully available to compute error metrics, while also having access to grids with missing data. To this end, we introduced two methods for ‘hiding’ known values, resulting in different patterns of missing data.
Random missing values This missing value approach was straightforward. Given a proportion of known values p, for all other cells there is a probability of being randomly obscured if a number \(z = \mathcal {U}(0,1)\) sampled from a uniform distribution between 0 and 1 is smaller than p. That is, for every cell \(c \in G\):
In our experiments, p was set to 0.8.
Spatially clustered hidden values Much like the spatial data itself, the missing data points in a grid may not be independent, and instead subject to spatial autocorrelation themselves. For example, some locations may have missing data due to natural barriers making measurements difficult, or due to local phenomena such as clouds obscuring parts of the measurements. In this missing value approach, we were inspired by optical satellite data, where clouds are the biggest source of missing data in the field. This approach is also why algorithms like Gapfill (Gerber et al. 2018) could not be considered for our experiments, as it requires a part of the data in a neighbourhood to be available. Our method for creating clusters of missing data was based on random walks. Given a number of points k, a number of walks w and the number of steps per walk r, the algorithm creating artificial clusters is outlined in Algorithm 2. When applied to spatiotemporal data, the spatially missing data was applied independently to every time step.
5.4.2 Experimental setup
The general form of our experiments was to run 10 algorithms (SDMRP, WPMRP and 8 baselines) 30 times for two types of missing data (random and spatially clustered) on every dataset (3 in total, with 1 additional dataset for spatiotemporal data), including both synthetic data and realworld datasets and addressing R1 and R3. The performance of the methods was compared according to their ranks based on the Wilcoxon signedrank test (Wilcoxon 1992), which is similar to a ttest but does not assume normality.
For spatial synthetic data we set the size of the grid to \(n=100\) and \(m=100\), and for the spatiotemporal synthetic data used to address R3, we set the size to \(n=50\), \(m=50\) and the number of timesteps \(d=5\). Tracking \(\delta \) allowed us to visualise the convergence of our method (R2), using different settings for f in Eq. 23 allowed us to gauge the effectiveness of WPMRP relative to SDMRP as a function of the correlation between the features and true values of locations (R4), and varying n and m allowed us to see how well all methods scaled to larger datasets (R5). Thus, in addition to the general performance results, we used the synthetic data to run additional experiments to address research questions 2 through 5. Conversely, we used the realworld datasets to gauge how well the performance on synthetic data, and the analysis thereof, would generalise to realworld cases. For the scalability analysis we set \(n=m\), with n ranging from 20 to 200 in steps of 20. The spatially clustered missing data was generated using \(k=5\) centre points, \(r = \frac{n+m}{2}\) steps per walk, and \(w=\frac{r}{2}\) walks.
For every combination of a dataset with a type of missing data, all algorithms were run 30 times, and used automated algorithm configuration (for the feature preprocessing pipeline explained in Sect. 5), automated machine learning (methods using autosklearn, automating the selection of machine learning algorithms and their hyperparameters, as explained in Sect. 5), NAS (in the case of CNN, automating the neural network architecture explained in Sect. 5) and random search (in the case of SDMRP’s \(\gamma \), explained in Sect. 4).
All experiments were run on a computing cluster consisting of 26 homogeneous nodes containing 94 GBs of memory and using Intel Xeon E52683 v4 CPUs running at 2.10GHz.
6 Results
In this section, we will report on the results of our experiments. We first explain the performance metrics used, after which we will cover detailed results for all individual datasets (synthetic spatial data, GDP and COVID19). These results were computed as the mean of 30 runs per algorithm and dataset for every performance metric. After covering the datasetspecific performance metrics, we investigate other properties of VPint: qualitative visual plausibility, the convergence of Eq. 3, the degree to which it can be generalised to spatiotemporal problems, the required feature correlation for WPMRP to perform better than SDMRP, and the scaling of different methods to larger datasets. Finally, we will provide a highlevel summary our findings.
6.1 Performance metrics
Since multiple properties can be desirable in an interpolation method, we evaluated our method based on 4 performance metrics. The first of these was the mean absolute error (MAE):
MAE is the main error metric reflecting the accuracy of the predictions obtain from the methods we studied, with all errors weighted equally. We also added root mean squared error (RMSE), which penalises extreme errors relatively more severely:
In addition to MAE and RMSE as basic error metrics, we also included two metrics common in the computer vision and image processing fields. The first of these is peak signaltonoise ratio (PSNR):
PSNR is highly related to the root mean squared error (RMSE) metric, and in fact contains it as a component. It computes the logarithm of the RMSE scaled by the maximal (true) value \(max(\mathbf {Y}^{*})\); as such, it is expected to show similar patterns to RMSE. The main motivation for using PSNR is that it scales values by the maximal (true) value \(max(\mathbf {Y}^{*})\); therefore, PSNR results will have a similar range for the GDP dataset (where errors of over \(30\,000\) were common) and the COVID19 dataset (where errors were typically under 10). Finally, we looked at the structural similarity index (SSIM):
SSIM aims to quantify the similarity of two images in a manner consistent with human perception, emphasising spatial structure over absolute errors.
6.2 Empirical performance (R1)
The results presented in this subsection are dedicated to answering R1. We ran detailed experiments on the synthetic spatial dataset as well as the realworld GDP per capita and COVID19 datasets.
Synthetic spatial data
The results for synthetic spatial data are shown in Table 1. On this data, a fairly consistent pattern can be observed for all performance metrics: on randomly missing data WPMRP performs best, whereas ARMA performs best on spatially clustered hidden data. It is not surprising that ARMA, as well as other regressionbased methods, suffer less from missing data being clustered together since they are based on predicting values directly from features. It is more surprising that WPMRP shows very extreme values for this type of missing data. Since SDMRP does not suffer from the same problem, it seems that the problem lies in the weight prediction model \(\mathcal {M}_w\), rather than being inherent to VPint. One possible cause for the behaviour on spatially clustered missing data may be that a mispredicted (high) weight will get disproportionately amplified with subsequent recursive calls where the target value is supposed to go up. Although these types of runs only seemed to happen on the synthetic and GDP datasets, it is a downside of WPMRP, and one could consider constraining weights, or applying normalisation techniques, to alleviate the issue. Apart from these cases, there was no big difference between the results of randomly missing and spatially clustered missing data.
GDP per capita The results for GDP data are shown in Table 2. In terms of MAE and SSIM, WPMRP was the best performing method among all methods for randomly missing data. For spatially clustered missing data, while belief propagation performed better than SDMRP and WPMRP suffered from extreme values hampering its performance, both VPint variants still performed well in terms of SSIM. In terms of RMSE and PSNR, belief propagation performed best in most cases, though SDMRP performed best together with belief propagation for spatially clustered missing data, and WPMRP, OK and UK performed better in terms of RMSE on randomly hidden data. Also, worth noting is that the performance of all methods was rather poor, with all methods achieving high error rates and low similarity scores. This may imply that it is hard to predict GDP based on spatial patterns alone (OK, UK, SDMRP), while the mapbased features were also not informative enough to make any worthwhile predictions (all other methods).
COVID19 trajectories The results for COVID19 trajectories are shown in Table 3. On this dataset, SDMRP is performing best out of all methods in terms of MAE and SSIM, though none of the methods was clearly better in terms of RMSE and PSNR than the others in terms of statistical significance. It is, however, unfortunate to see WPMRP as one of the two only methods performing significantly worse than all others on this dataset in these metrics, despite a high SSIM compared to baseline methods. Since other methods using feature data (apart from CNN) perform better than Kriging, it seems unlikely that the mapderived features are the cause of WPMRP not performing well on this dataset. Instead, it appears that they are more effective for directly predicting the COVID19 incidence at a particular location, rather than the relationship between neighbouring locations. This may be caused by the COVID19 grid being relatively sparse; propagating values from 0 is difficult to do with a spatial weight alone. Thus, for sparse grids with mostly 0 values, SDMRP with its decay over distance may be more appropriate, whereas WPMRP, which can increase or decrease values based on the weights that follow from feature data, may be more appropriate in cases where all cells contain values in a nonzero range.
6.3 Other properties (R2R5)
We now present the results of the experiments addressing the remaining research questions, R2–R5, exploring various properties of our proposed method.
Visual plausibility An example of hidden synthetic data (\(n=m=50\)) is shown in Fig. 4, with a visual comparison between its reconstruction by the different methods. The reconstructed images caution against relying too much on mean absolute error, as all methods (apart from belief propagation and CNN) were able to reach a similar mean absolute error as WPMRP or better (around 2.2 for random, 0.9 for clustered). However, our methods (and WPMRP in particular) appear to capture the spatial characteristics of the original image, captured by the structural similarity index (\(SSIM=0.34\) for random, \(SSIM=0.69\) for spatially clustered), better than OK (which seemingly simply predicts the average value, \(SSIM=0.04\) for random, \(SSIM=0.65\) for spatially clustered). Compared to nonspatial regression, our method delivers less noisy interpolations, while also blurring less than ARMA. The results for belief propagation and CNN catch particular attention, as in these cases belief propagation vastly underestimated the values, whereas CNN overestimated them orders of magnitude higher than other methods. The latter implies that there may be a large risk of overfitting for the neural networks, due to the models being too complex for the limited amount of training data available.
Algorithm convergence (R2)
An example of the convergence of WPMRP over iterations can be seen in Fig. 5. As the figure shows, as the algorithm iterates Eq. 3, it converges to a stable configuration which we use for our predictions. Moreover, the variability of this convergence was fairly low, indicating that the running time of the algorithm will be relatively stable regardless of the situation. This example considered the convergence of WPMRP on randomly hidden synthetic spatial data, but similar behaviour could be observed for SDMRP, and on different datasets with spatially clustered hidden data. This includes the convergence of WPMRP on spatially clustered hidden data for synthetic spatial data, where Table 1 earlier indicated that WPMRP did not perform well.
Generalising to spatiotemporal problems (R3)
Addressing R3, to gauge whether our method could also be applied to 3dimensional spatiotemporal problems, we ran an additional set of experiments on synthetic spatiotemporal data. The results of this experiment are shown in Table 4. Unfortunately, it appears that our proposed method does not (yet) generalise well to 3dimensional problems, as both VPint variants were the worst performing out of all methods. However, other modifications adapting VPint to the spatiotemporal domain may be more successful. Interestingly, on this synthetic dataset, ARMA performed best across the board – one might have expected a more inherently spatiotemporal method, like Kriging, would have performed better. Given this, it may be the case that the spatiotemporal version of VPint performs badly not due to an inherent problem with the method, but rather a lack of exploitable patterns in the temporal dimension of the data. Whereas Kriging methods will tend to give lower weights to noninformative variables, in its current form, our method weights all dimensions equally, meaning that a noninformative dimension would harm the performance of our method rather than helping it. In the synthetic data we used in our experiments, independent spatial data was generated for all time steps individually, after which temporal autocorrelation was simulated using Eq. 22. As every time step had two essentially independent neighbours, the autocorrelation between the two may have cancelled out one another on some cells, leading to a diminished performance. As a result, the model may perform better on realworld spatiotemporal data under favourable conditions. However, as this is outside the scope of this work focused on spatial interpolation, further research would be necessary to establish exactly what those favourable conditions would be.
Performance of WPMRP compared to SDMRP as a function of feature correlation (R4) To address R4, we ran an additional experiment on synthetic data with settings of the feature correlation coefficient f in Eq. 23 ranging between 0.05 and 1.0 in steps of 0.05. Figure 6 compares the performance of the two methods based on MAE as a function of feature correlations. The figure shows the error distribution acquired from 30 runs. As expected, Fig. 6 shows that WPMRP performs better than SDMRP for high values of f, and conversely, SDMRP appears more successful for low featuretarget correlations. However, there appear to be diminishing returns for higher f after 0.4, and already at a correlation of 0.1 WPMRP performed better than SDMRP on the synthetic data. In conclusion for R4, this experiment shows that WPMRP leverages spatial features to perform better than SDMRP in situations where the features are sufficiently informative.
Scaling to larger datasets (R5) Addressing R5, we ran a scalability analysis by running every algorithm once on synthetic data for grid sizes ranging from 20 to 200 (height and width) in steps of 20. The results of these experiments, based on the total running time of methods (including training, if any, but excluding NAS, SMAC and other algorithm configuration as they are optional) can be seen in Fig. 7a (random) and b (spatially clustered). The figures show that SDMRP, while faster than CNNs, does not scale well to larger datasets, and that WPMRP scales similarly compared to nonspatial regression, SAR, MA and ARMA. This tells us that the iterative MRPderived update rule likely does not account for a large portion of the running time; instead, it appears that the autosklearn training procedure, much like in the case of nonspatial regression and SAR, MA and ARMA is the main bottleneck for WPMRP. The reason, then, for SDMRP to scale poorly, would be the random searchbased subsampling procedure used to find an optimal static discount \(\gamma \) explained in Sect. 5.
We can also see in both figures that universal Kriging scales very poorly to larger datasets; in fact, its runs timed out after grids of the size \(100 \times 100\). While CNN was slightly less affected than UK by the increasing size, its running times were still exceedingly high, and likewise hit a timeout threshold after \(100 \times 100\) grids. Similarly, while the running times of ordinary Kriging were similar to those of other methods, its memory usage became prohibitively large by exceeding its allotted 13GB at \(120 \times 120\) grids. Thus, this experiment showed another weakness of GPs, namely their high memory usage, which is also detrimental to their scalability. Newer GP methods, like local approximation GPs (Gramacy and Apley 2015), may scale better in terms of running time and memory usage by using local approximations, although this may come at the expense of a decreased ability to capture global information.
In conclusion for R5, our methods scale better than Kriging to larger datasets, on par with nonspatial regression, SAR, MA, and ARMA, though SDMRP did take longer than these methods on randomly missing data. Generally, our methods use substantially less memory than ordinary Kriging and universal Kriging.
6.4 Highlevel summary
Tables 1, 2, 3 show the competitive advantage of the VPint variants, in terms of MAE and SSIM. For randomly missing data, the two VPint variants together performed better in terms of MAE than baseline methods on all 3 spatial datasets, although individually both methods only performed better than all baselines on 2 out of 3 datasets. WPMRP performed better than all other methods on synthetic and GDP data, though SDMRP also performed better than baseline methods on the GDP data, and SDMRP performed better than all other methods on the COVID19 dataset. In terms of SSIM, the two VPint variants together were again the best performing methods on all 3 datasets, where WPMRP again performed best on the synthetic (though tied with SAR and ARMA, with SDMRP following one ranking lower) and GDP datasets, and SDMRP also performed better than the baseline methods on GDP data. SDMRP performed better on the COVID19 dataset, where WPMRP was the third best performing method after CNNs. In terms of RMSE, the results were less consistent, with no method clearly outperforming the others across all datasets, and as expected, the rankings for PSNR and RMSE were almost always the same. This difference implies that, while the VPint variants will often perform better on average, when they do fail to produce good results, the errors will be more extreme than those of baseline methods. This was seen especially clearly in runs where WPMRP obtained error values orders of magnitude higher than all other methods.
On spatially clustered missing data, in terms of MAE, SDMRP still performed best on the COVID19 dataset, but was outperformed by belief propagation on GDP data and by ARMA on synthetic data. WPMRP also failed to significantly outperform baseline methods on any of the datasets for this type of missing data. However, in terms of SSIM, both VPint variants performed significantly better than all baselines on GDP and COVID19 data, and SDMRP tied with SAR and ARMA for synthetic data. Since it seems that SDMRP preserves the spatial structure of spatially clustered hidden data better than baseline methods on all 3 datasets, and WPMRP did so on 2 out of 3 datasets, we conclude that VPint would be a better option for this type of missing data if the spatial structure of the interpolations is important. Interestingly, SSIM is higher for all methods for spatially clustered data; this is likely caused by this type of missing data being considered a substantial structural element, thus affecting SSIM less than random missing data.
Regarding our additional experiments, we found that the convergence of VPint tends to progress smoothly and has very little variance between runs (as seen in Fig. 5). Table 4 shows that, in its current form, our method does not yet generalise well from the spatial case to the spatiotemporal case. Figure 6 shows that WPMRP will perform better than SDMRP starting from a feature correlation coefficient f of around 0.1 for synthetic data, implying that a high correlation between features and targets is not required for the feature data to have added value to the method. Finally, Fig. 7 showed favourable scalability of our proposed method, particularly compared to Kriging.
7 Conclusion and future work
In this work we proposed VPint, a value propagationbased method for spatial interpolation, establishing a systemoriented perspective. To this end, we introduced two variants of our interpolation method (SDMRP and WPMRP), the latter of which exploits spatial features describing the characteristics of the grid. In our experiments, VPint was found to perform significantly better than baseline methods in terms of mean absolute error and structural similarity on randomly missing data in 3 datasets, and 2 out of 3 datasets for spatially clustered missing data. Overall, whether VPint is the appropriate choice of algorithm appears to depend on the type of data in question, and the goals of the user. In the common case where a low error rate is the objective, particularly in a way that preserves the spatial structure of a grid, VPint (and especially WPMRP) will generally be the best option for randomly missing values. On spatially clustered missing data, SDMRP is usually a better option, though other methods are more competitive on this type of missing data. Despite these advantages offered by VPint, if the user is looking for a method that does not suffer from outliers of particularly bad predictions, and is willing to accept higher average errors as a result, other methods may be a better option.
In future work, it would be interesting to focus on exploring the performance of our methods on other realworld datasets, particularly when using other sets of features not derived from map data. Furthermore, we see value in further analysis of the spatiotemporal variant of VPint, focussed on the circumstances under which it will perform well. Alternatively, a different approach to spatiotemporal interpolation could be a temporally layered version of WPMRP, using a representation similar to the tensorbased approach adopted by Corizzo et al. (2021). Such an approach would eliminate the need for explicit feature data, and would instead use known values at different time steps as features to derive spatial weights. This type of approach may well be worth exploring. Finally, applying this method to specific fields frequently suffering from missing data, such as cloud cover in remote sensing data, may contribute greatly to those fields. The need for cloud removal techniques, given that about 70% of Earth is covered by clouds at any time, is great; existing techniques are limited and often either difficult to apply for nonexperts in deep learning, or fail to produce actionable new information (such as simply predicting a mean, or replacing cloudy pixels). A datadriven, universally and easily applicable method able to fill in cloud cover (or gaps caused by faulty sensors), informed by the highly correlated features of previous imagery at the same place, could increase the availability of data and therefore the efficacy of the many highimpact methods dependent thereon to a large extent.
Availability of data and material
The code used to generate synthetic data is included in the project’s code repository. All realworld data used is publicly available: the GDP dataset can be found at https://datacatalog.worldbank.org/search/dataset/0037850, and the COVID19 dataset can be found at https://www.dacon.io/competitions/official/235590/data/.
Code Availability
The code used for the experiments is publicly available at https://github.com/ADAresearch/VPint.
Notes
Autosklearn is an automated machine learning (AutoML) package that allows automatic algorithm selection, hyperparameter optimisation and feature preprocessing ensuring that a highperforming pipeline is selected on given dataset.
References
Anselin L (1988) Spatial econometrics: methods and models (vol. 4). Studies in Operational Regional Science Dordrecht: Springer Netherlands
Awad YA, Koutrakis P, Coull BA, Schwartz J (2017) A spatiotemporal prediction model based on support vector machine regression: Ambient black carbon in three new england states. Environ Res 159:427–434
Bellman R (1957) A markovian decision process. J Math Mech pp 679–684
Bianchi F, Presti FL (2016) A markov reward model based greedy heuristic for the virtual network embedding problem. In: 2016 IEEE 24th international symposium on modeling. Analysis and simulation of computer and telecommunication systems (MASCOTS), IEEE, pp 373–378
Bouhlel MA, Martins JR (2019) Gradientenhanced kriging for highdimensional problems. Eng Comput 35(1):157–173
Carrasco L, O’Neil AW, Morton RD, Rowland CS (2019) Evaluating combinations of temporally aggregated sentinel1, sentinel2 and landsat 8 for land cover mapping with google earth engine. Remote Sens 11(3):288
Corizzo R, Ceci M, FanaeeT H, Gama J (2021) Multiaspect renewable energy forecasting. Inf Sci 546:701–722
Cressie N (2015) Statistics for spatial data. Wiley, New York
DACON (2020) Corona Data Visualization AI Contest. https://www.dacon.io/competitions/official/235590/data/, accessed: 02052021
DECRG WB (2010) Gross domestic product 2010. https://datacatalog.worldbank.org/search/dataset/0037850 Accessed 31 Oct 2021
Dong C, Loy CC, He K, Tang X (2015) Image superresolution using deep convolutional networks. IEEE Trans Pattern Anal Mach Intell 38(2):295–307
Durbin J (1959) Efficient estimation of parameters in movingaverage models. Biometrika 46(3/4):306–316
Fang H, Baret F, Plummer S, SchaepmanStrub G (2019) An overview of global leaf area index (lai): methods, products, validation, and applications. Rev Geophys 57(3):739–799
Feurer M, Klein A, Eggensperger K, Springenberg JT, Blum M, Hutter F (2019) Autosklearn: efficient and robust automated machine learning. Automated machine learning. Springer, Cham, pp 113–134
Fix MJ, Cooley DS, Thibaud E (2021) Simultaneous autoregressive models for spatial extremes. Environmetrics 32(2):e2656
Gerber F, de Jong R, Schaepman ME, SchaepmanStrub G, Furrer R (2018) Predicting missing values in spatiotemporal remote sensing data. IEEE Trans Geosci Remote Sens 56(5):2841–2853
Gramacy RB, Apley DW (2015) Local gaussian process approximation for large computer experiments. J Comput Graph Stat 24(2):561–578
Grampurohit S (2021) loopybpdenoise. https://github.com/sanjeevg15/loopybpdenoise, accessed: 22102021
Haining R (1978) The moving average model for spatial interaction. Trans Inst Br Geogr pp 202–225
Hashimoto R, Suto K (2020) Sicnn: spatial interpolation with convolutional neural networks for radio environment mapping. In: 2020 international conference on artificial intelligence in information and communication (ICAIIC), IEEE, pp 167–170
Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M et al (2019) A case study competition among methods for analyzing large spatial data. J Agric Biol Environ Stat 24(3):398–425
Hoos HH (2012) Programming by optimization. Commun ACM 55(2):70–80
Hutter F, Hoos HH, LeytonBrown K (2011) Sequential modelbased optimization for general algorithm configuration. In: International conference on learning and intelligent optimization, Springer, pp 507–523
Jiang Z (2018) A survey on spatial prediction methods. IEEE Trans Knowl Data Eng 31(9):1645–1664
Jin H, Song Q, Hu X (2019) Autokeras: An efficient neural architecture search system. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1946–1956
Katzfuss M (2017) A multiresolution approximation for massive spatial datasets. J Am Stat Assoc 112(517):201–214
Krige DG (1951) A statistical approach to some basic mine valuation problems on the witwatersrand. J South Afr Inst Min Metall 52(6):119–139
Lauritzen SL (1996) Graphical models, vol 17. Clarendon Press
Lee J, Kamenetsky ME, Gangnon RE, Zhu J (2021) Clustered spatiotemporal varying coefficient regression model. Stat Med 40(2):465–480
Levin A, Zomet A, Weiss Y (2003) Learning how to inpaint from global image statistics. ICCV 1:305–312
McAuley J, Caetano T (2010) Exploiting withinclique factorizations in junctiontree algorithms. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, pp 525–532
Montero JM, FernándezAvilés G, Mateu J (2015) Spatial and spatiotemporal geostatistical modeling and kriging. Wiley, New York
Murphy BS (2020) pykrige. https://pypi.org/project/PyKrige/ Accessed 31 Sept 2020
Murphy K, Weiss Y, Jordan MI (2013) Loopy belief propagation for approximate inference: an empirical study. arXiv preprint arXiv:1301.6725
OpenStreetMap (2019) OpenStreetMap. https://www.openstreetmap.org/, accessed: 27122019
Pearl J (1982) Reverend Bayes on inference engines: a distributed hierarchical approach. Cognitive Systems Laboratory, School of Engineering and Applied Science
Qiu J, Wang H, Hu L, Yang C, Zhang T (2021) Spatial transmission network construction of influenzalike illness using dynamic bayesian network and vectorautoregressive moving average model. BMC Infect Dis 21(1):1–9
Rue H, Riebler A, Sørbye SH, Illian JB, Simpson DP, Lindgren FK (2017) Bayesian computing with inla: a review. Ann Rev Stat Appl 4:395–421
Sato K, Inage K, Fujii T (2019) On the performance of neural network residual kriging in radio environment mapping. IEEE Access 7:94557–94568
Sato N, Trivedi KS (2007) Accurate and efficient stochastic reliability analysis of composite services using their compact markov reward model representations. In: IEEE international conference on services computing (SCC 2007), IEEE, pp 114–121
Satorras VG, Welling M (2021) Neural enhanced belief propagation on factor graphs. In: International conference on artificial intelligence and statistics, PMLR, pp 685–693
Schabenberger O, Gotway CA (2017) Statistical methods for spatial data analysis. CRC Press
Shi W, Caballero J, Huszár F, Totz J, Aitken AP, Bishop R, Rueckert D, Wang Z (2016) Realtime single image and video superresolution using an efficient subpixel convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1874–1883
Soltani A, Pettit CJ, Heydari M, Aghaei F (2021) Housing price variations using spatiotemporal data mining techniques. J Housing Built Environ pp 1–29
Song L, Gretton A, Bickson D, Low Y, Guestrin C (2011) Kernel belief propagation. In: Gordon G, Dunson D, Dudík M (eds) Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, Proceedings of Machine Learning Research, vol 15, pp 707–715, https://proceedings.mlr.press/v15/song11a.html
Tabios GQ III, Salas JD (1985) A comparative analysis of techniques for spatial interpolation of precipitation 1. JAWRA J Am Water Resour Assoc 21(3):365–380
Wilcoxon F (1992) Individual comparisons by ranking methods. In: Breakthroughs in statistics, Springer, pp 196–202
Wu C, Zhou M, Liu P, Yang M (2021) Analyzing covid19 using multisource data: An integrated approach of visualization, spatial regression, and machine learning. GeoHealth 5(8):e2021GH000439
Wu Y, Zhuang D, Labbe A, Sun L (2020) Inductive graph neural networks for spatiotemporal kriging. arXiv preprint arXiv:2006.07527
Yang K, Lf L (2017) Identification and qml estimation of multivariate and simultaneous equations spatial autoregressive models. J Econ 196(1):196–214
Zheng L, Mengshoel O, Chong J (2012) Belief propagation by message passing in junction trees: Computing each message faster using gpu parallelization. arXiv preprint arXiv:1202.3777
Zheng X, Lin X, Wu P (2020) Outdoor image restoration based on belief propagation algorithm and formalized mtf. In: Journal of Physics: Conference Series, IOP Publishing, vol 1651, p 012168
Acknowledgements
This publication is part of the project “Physicsaware Spatiotemporal Machine Learning for Earth Observation Data” (with project number OCENW.KLEIN.425) of the research programme Open Competition ENW which is partly financed by the Dutch Research Council (NWO). This research was also partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Responsible editor: Albrecht Zimmermann and Peggy Cellier.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Arp, L., Baratchi, M. & Hoos, H. VPint: value propagationbased spatial interpolation. Data Min Knowl Disc 36, 1647–1678 (2022). https://doi.org/10.1007/s10618022008432
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618022008432