Spatio-temporal prediction of missing temperature with stochastic Poisson equations

This paper presents our winning entry for the EVA 2019 data competition, the aim of which is to predict Red Sea surface temperature extremes over space and time. To achieve this, we used a stochastic partial differential equation (Poisson equation) based method, improved through a regularization to penalize large magnitudes of solutions. This approach is shown to be successful according to the competition’s evaluation criterion, i.e. a threshold-weighted continuous ranked probability score. Our stochastic Poisson equation and its boundary conditions resolve the data’s non-stationarity naturally and effectively. Meanwhile, our numerical method is computationally efficient at dealing with the data’s high dimensionality, without any parameter estimation. It demonstrates the usefulness of stochastic differential equations on spatio-temporal predictions, including the extremes of the process.


Introduction
The aim of EVA 2019 data competition is to predict spatio-temporal extremes of Red Sea surface temperature (Huser 2020). The dataset of the competition consists of daily sea surface temperature (SST) anomalies for the entire Red Sea from 1985 to 2015 whilst part of the data is masked. The main goal of this data competition is to predict the distribution of

X(s, t) = min (s,t)∈N (s,t)Â (s,t),
for space-time validation points (s, t), whereÂ(·, ·) is the predicted SST anomalies. The competition evaluates the performance via a threshold-weighted continuous ranked probability score (twCRPS). The main challenge is the data's high dimensionality (16703 locations and 11315 days) and strong non-stationarity in space and time (Section 3.3 of Huser (2020)). More details of the data competition, including the definition of the anomaly, plots of the data along with some exploratory analysis, as well as the competition's rules and goals, are available in the Editorial (Huser 2020).
To predict the extremes on the spatio-temporal domain, it is sufficient to predict the whole field of values on it. That is to say, we can predictÂ(·, ·) instead of directly predicting X(·, ·) in Eq. (1). Given the partial data ofÂ(·, t) for someday t, the predicted data in the unobserved region should comply with some continuity constraints. According to Fourier's analytical theory of heat, it is reasonable to assume a temperature field has a second order derivative, implying that the anomaly field has its Laplacian. We formulate the missing value prediction problem as a Poisson equation based model. The Poisson equations are defined on the two-dimensional spatial domain, not the spatio-temporal domain. For each time step (day), Poisson equations are solved to interpolate the unobserved regions. To generate the stochasticity in predictions, our Poisson equation is not deterministic but stochastic. The stochastic components (i.e. the Laplace fields) in our SPDEs (stochastic partial differential equations) are sampled from the observed data, assuming temporal stationarity, which allows bypassing explicit estimation of parameters defining a stochastic model. The method is further improved with a regularization term to penalize large magnitudes of solutions. Experimental results show that it is effective and efficient to use the regularized stochastic Poisson equation to fill in the missing data in this problem.
A lot of approaches have been proposed on spatio-temporal extremes in the field of extreme value theory (EVT). One category of approaches is to model block maxima using max-stable processes (Davison et al. 2012;Davis et al. 2013;Huser and Davison 2014). The other category is to model the exceedances of a threshold (Pickands 1975;Bacro et al. 2020) using extreme distributions such as generalized Pareto distribution (GPD). Both categories of models have been used on climate data (Chavez-Demoulin and Davison 2005;Reich and Shaby 2012). SPDE-based approaches have been made popular by the INLA method in spatial and spatiotemporal statistics (Krainski et al. 2018). In Lindgren et al. (2011), a type of SPDE similar to our Poisson equations is studied, which will be discussed in the last section of this paper. Comparing to EVT-based methods, our method is more efficient since we do not need any parameter estimation. The drawback is its performance will be inferior comparing to methods tailored to the prediction of the upper tail. Our approach is very similar to the Poisson image editing method (Pérez et al. 2003) from the image processing community, which blends multiple images seamlessly with Poisson equations.
The remainder of this paper is structured as follows. In Section 2, we give the details about the stochastic Poisson equation based model. Its numerical method and the regularized version are given in Section 3. In Section 4, we explain our scheme to generate predictions, and the cross-validation dataset we prepared to evaluate our approach, followed by some concluding discussions in Section 5.

Notation
The following notation is used: Italic lower-case letters refer to scalar values (x) or scalar-valued functions (f (·)). Lower-case bold letters denote vectors (x) or vector-valued functions (f(·)). Finally, upper-case bold letters represent matrices (X).

Method
The missing data prediction problem can be studied from the perspective of data interpolation. By considering a 1D curve y = f (x) with missing interval (x 0 , x 1 ), as shown in Fig. 1, the interpolation result should have at least C 0 continuity. It is straightforward to connect two endpoints of the missing interval with a straight line segment, on which the second derivative is zero. Thus finding such a line segment is equivalent to solving the differential equation (2) The one-dimensional case above can be easily generalized to 2D, i.e. assuming the second derivative on the missing region is zero. Given a function u(x, y), (x, y) ∈ S with masked region M ⊂ S, the unobserved region can be interpolated by solving the equation (3)

Fig. 1 One-dimensional data interpolation
Equations taking the form u = 0 are known as Laplace equations. An interpolation example is demonstrated in Fig. 2 (a-b). It is observed that the predicted region is over-smooth, lacking variations, compared to the known area, indicating that the right-hand side (RHS) in Eq. (3) should be some function f which is not zero. Thus the missing region can be predicted by solving the equation u = f , which is known as Poisson equation.
The physical meaning of Poisson equation u = f here is that, given the partial temperature data u of someday t ∈ T , the Laplacian of u (we call it Laplace field) in the unobserved region should follow some target field f . Immediately, it is necessary to model the target Laplace field f conditioning on the known data. We assume f is random since its exact value is unknown. Then the equation we are solving is actually a stochastic Poisson equation, instead of a deterministic one. Although we can construct some statistical model for random f , it is more straightforward to sample existing fields from known data directly, with the assumption that the Laplace field is independent and identically distributed (i.i.d.) in time for the sake of simplicity. That is to say, the Laplace fields of day t 1 , t 2 , . . . are independent and identically distributed realizations of the random function f . The stochastic Poisson equation is solved by solving a series of deterministic ones. In terms of solving one of the Poisson equations, we take a reference day t ∈ T as an example. Day t is picked as a reference day, then Laplace field of day t is a reference target Laplace field for day t. Therefore, missing data interpolation for day t can be accomplished by solving in which M t is the unobserved region of the day t. In general, this function does not always have a unique solution, but our numerical method gives a unique solution in our specific setting, which is described in Section 3. There are still missing data in u(x, y; t ) since u(x, y; t ) is incomplete for any t . These values in u(x, y; t ) will be simply filled with zeros. An example is shown in Fig. 2 (c-d).
As indicated in Huser (2020), the temperature anomalies are not stationary in space and time. The Poisson equation u = f can handle it effectively. In fact, the solution of Poisson equation can be decomposed as the sum of the solutions of the following two equations, In these two equations, the left one is the Laplace equation that imposes the continuity constraints, and the right one provides the randomness. The left one, which is only related to the boundary conditions, ensures the solution is adaptive to the known data of the current day, tackling the non-stationarity in time. The right one has RHS f = u(x, y; t ), making the Laplace field be the same as some other day in a location-wise sense. Thus it resolves the temperature anomalies' non-stationarity in space. In conclusion, the sum of them can handle the non-stationarity in both space and time.

Numerical solution of the Poisson equation
The numerical solution of Poisson equation on a rectangular domain with Dirichlet boundary condition can be found with the finite difference method (Knabner and Angerman 2003). The key idea is discretizing the domain as a grid and approximating the Laplace operator with a linear combination of neighbor grid nodes. The differential equation is then converted to a linear system, of which the left-hand side (LHS) matrix is named the Laplacian matrix. In this paper, Poisson equation Eq. (4) is solved with the finite difference method. First, the spatial domain is represented as a mesh grid . The Red Sea S has been discretized into S = 16703 grid cells in the dataset of this competition. The grid cells are represented with their row/column indices (i, j ), in which i ∈ {1, . . . , 359}, j ∈ {1, . . . , 233}. Each cell (i, j ) has at most four neighbors (i ± 1, j) and (i, j ± 1). Then the mesh grid is established after taking cells (i, j ) as mesh nodes and the neighboring relation as mesh edges.
Then, the Laplacian matrix is determined after defining the mesh grid . Noting that the mesh grid is equivalent with an undirected graph G with its nodes and edges, the Laplacian matrix L of graph G is defined according to Godsil and Royle (2001), as a generalization of the Laplacian matrix of a rectangular domain. The Poisson equation Eq. (4) is converted to the linear system in which matrix L has shape S × S, u is an S-dimensional column vector containing the temperature anomalies of all locations on the day t, and u is the temperature anomalies vector of some other day t . In this equation, the LHS Lu is the Laplace field of day t, and the RHS Lu is the reference Laplace field taken from day t . Both of them are computed with the Laplacian matrix L defined on the mesh domain. As discussed above, some elements in (Lu ) are not available since u is incomplete (containing values represented with NaNs). These invalid values in (Lu ) are filled with zeros. Finally, unknown values of u are determined by solving the equation. Without loss of generality, it can be assumed that vector u is split into two parts u 0 and u 1 , of which the first one u 0 contains all unobserved locations while the known values appear in the second part u 1 . This form can be obtained by multiplying permutation matrices on both sides of Eq. (6). Then L and u are partitioned accordingly, resulting in the linear system L 00 L 01 L 10 L 11 which implies that L 00 u 0 = L 00 u 0 + L 01 (u 1 − u 1 ).
( 8 ) This equation has a unique solution since L 00 is invertible, which will be proved in the next subsection. At last, Eq. (8) can be solved with LU decomposition of matrix L 00 .

An optimization perspective: least squares
A discrete approximation of this optimization problem is in which G is the gradient matrix. The gradient matrix is defined as transpose of graph G's incidence matrix, when an arbitrary orientation is given to G (Godsil and Royle 2001). Rows and columns of matrix G are the numbers of edges and nodes in graph G respectively. If G is written in the block form according to values' availability as before, i.e. G = G 0 G 1 , finding unknown values u 0 is to solve the least squares problem min in which h = G 0 u 0 − G 1 u 1 − u 1 . Invalid values (NaNs) in h are also filled with zeros.
The least squares problem (12) is solved with the pseudo inverse theory, i.e.
Noting that L 00 = G 0 G 0 since L = G G, solution of the least squares problem should be identical with that of Eq. (8) if there is no missing value on the reference day. When there is missing data, the only difference between them is that it is the invalid gradient values that are filled with zeros in Eq. (13), while it is the invalid Laplacian values that are replaced with zeros in Eq. (8).
To make L 00 = G 0 G 0 invertible, block G 0 in the gradient matrix is required to have linearly independent columns. This is true because each unknown node on graph G is connected with at least one known node by a path (Godsil and Royle 2001), which can be concluded from the fact that the spatial domain has only one connected component and all days have at least one known node.

Least squares with regularization
Solution of the least squares problem in Eq. (12) gives a prediction of the missing data that is smooth and vibrant (Fig. 3 (c)). However, it is suffering from numerical issues. Although the Laplacian matrix L 00 is invertible, its condition number is large from the viewpoint of computing, making the numerical solution instable. If the Poisson equation has a closed domain with Dirichlet boundary condition, the solution will be well bounded by known values. However, this is not our case.
Regularization is commonly used to stabilize least squares, based on which Eq. (12) becomes with λ ≥ 0. The regularized least squares problem takes the form of ridge regression (Hastie et al. 2009), of which the solution is given by Here, f is G 0 h, and I is the identity matrix. Please note that f can also take the value of RHS in Eq. (8) This minimization objective means the gradient field of the target day should be close to that of the reference day. At the same time, large values of u are penalized. It is necessary to determine the value of λ in the regularized problem. Observing the results with different λ values in Fig. 3 (c-f), we can see that larger λ generates flatter result. By considering the limit situation when λ → +∞, the predicted temperature anomalies will degenerate to all zeros (Fig. 3 (f)). In this case, there is a step between observed and unobserved regions ("boundary step"), which can be evaluated as follows where P is the set of all neighboring location pairs (p, q) where p is unobserved and q is observed. Unavailable data u 0 is uniquely determined once given λ ≥ 0, thus u 0 is a function of λ. So is the "boundary step" BS(λ; t). The "boundary step" is large when the regularization parameter λ is too large or too small. However, finding the optimal λ is not trivial. We pick the value of λ that minimizes the "boundary step" among all candidates in a finite set , i.e.
The candidate set will be given numerically in the next section. Please note that the best weight λ * is determined for each day t separately. We do not estimate a fixed value of λ for all days.

Inference
To predict the distribution of X(s, t) (see Eq. (1)) for space-time validation point (s, t), we will predict the whole temperature anomaly fieldÂ(s, t) for all days that fall in the ±3 days temporal neighborhoods of validations days first, then compute the spatio-temporal neighborhood minimums directly. And last, we use the empirical distribution of the minimums as the prediction.

Inference procedure
Given a space-time validation point (s 0 , t 0 ), the temperature anomaly field of that target day,Â(s, t 0 ), s ∈ S, is predicted first. To achieve this, a guidance day t is picked to provide a reference of Laplace or gradient field. The sampled day t should be as complete as possible, thus it is sampled from the period 1985-2006 with only 20% missing data. After that, linear systems are constructed and solved following the method presented in the previous section. All space-time validation points from the day t 0 are computed together from the same linear system, to avoid redundant computing, which saves time, energy, and helps fight climate change.
After predicting the field of the target day t 0 , we repeat the same procedure to predict all seven days in its temporal neighborhood, i.e. the days t 0 + i, −3 ≤ i ≤ 3. Hence seven reference days should be sampled. Instead of choosing them independently, it is better to sample seven successive days, i.e. the days t + i, −3 ≤ i ≤ 3. The advantage is that the temporal dependency of the temperature anomalies is implicitly maintained in the data of sampled seven days. Until now, allÂ(·, ·) values in N (s 0 , t 0 ) are obtained. Then we calculate the minimum to get one realization of X(s, t).
An empirical distribution of X(s, t) is predicted if the whole procedure above is performed repeatedly for t 1 , t 2 , . . . , t N . We repeated N times to generate N values of X(s, t) for each validation point (s, t), and used the empirical cumulative distribution function (ECDF) of them as the prediction of that validation point. It is noted that the LHS Laplacian matrix in Eq. (6-8) only depends on the data availability, i.e. the binary mask, of the target day; its temperature anomaly values and that of the reference day only affect the RHS of the linear systems. We take advantage of this to pre-factorize the Laplacian matrix with LU decomposition only once. Then all of those days sharing the same data availability can be solved directly by forward and backward substitutions.

Leave-one-out validation
Since the validation set of this competition is not available to the teams, a crossvalidation set is prepared during the competition. Data in the range of 1985-2006 is used for 22-fold cross-validation. For the i-th fold (i = 0, . . . , 21), data of the year 1985 + i is left out for validation while the other 21 years are used for training. For each day in the left-out year, some of the locations have already been unobserved in the original training set (gray regions in the left of Fig. 4) and more locations are masked out (gray regions in the right of Fig. 4) for cross-validation. For example, as demonstrated in Fig. 4, 1985 is left out and part of Jan 5's data is masked out (new gray regions in the right figure comparing to the left). Unmasked regions (colored regions in the right figure) are used as training data in the cross-validation stage. For some points in the gray region, they can be sampled as cross-validation points if their spatio-temporal neighborhood N (·, ·) is completely available in the original training set, which are shown as black dots in the right of Fig. 4. Ground truth X(·, ·) is available on these points. Therefore, we have generated the training set and validation points for cross-validation. Since the average temperature anomalies of the first 22 years are lower than that of the last 9 years (where the competition's final predictions should be made), as indicated in Huser (2020), we translated the data by the differences of empirical means between the last 9 years and our cross-validation set in order to match the means.
We evaluated different configurations of our method on the generated crossvalidation set, including Poisson equation Eq. At last, an ensemble is proposed, which pools all 25 resulting candidate u 0 s (one from the Poisson equation and 12 from each of the regularized methods) and picks the one with the least "boundary step" from them as Eq. (18). In all methods above, the sample size N is 1000. These methods are compared to the benchmark proposed in Huser (2020), as shown in Tab. 1. The ensemble gives the lowest error in terms of the average twCRPS on the cross-validation set, thus we used its result as the final submission for the competition.
After sampling N = 1000 times, the ECDF curves are already smooth enough. We have tried to fit the ECDFs with the generalized extreme value (GEV) distribution, while there is little difference according to the twCRPS.

Final submission
Our approach is implemented with Python to perform parallel computing with an Intel ® Core™ i7-9700K CPU on a desktop computer. In our final submission to the competition, we used the ensemble method with sample size N = 1000 and candidate weight set as described above. The total computational cost is less than 70 minutes.

Discussion
This approach is shown to be successful in the context of this competition, thanks to its advantages. First, the Laplacian matrix is a nice encoding of the spatial domain's geometry. It contains the links between locations and its neighbors, and helps to build the spatial dependence between locations. Second, the temporal dependence is considered implicitly via borrowing the Laplace fields from successive days, i.e. reference weeks, instead of independent reference days. Third, the non-stationarity over time and space is embedded in our model since the values in unobserved regions can be decomposed into two parts (Eq. (5)); one is associated only with the current day's information and the other considers the spatial variation. Last, this method is training-free. Training data are used directly as references without any time-consuming parameter estimation. In Lindgren et al. (2011), the SPDE (λ − ) α/2 u = f is studied, where α > 0, λ ≥ 0 and f corresponds to Gaussian white noise. First, it should be noted that the − in Lindgren et al. (2011) is actually the + in this paper. The latter notation is used here to make it compatible with the discretization of Laplace operator in Godsil and Royle (2001). Therefore, the above SPDE coincides with the (screened) Poisson equation in this work, when α = 2, λ ≥ 0. The difference is that our model is derived from an interpolation perspective. Our regularization term with λ is motivated by penalizing large magnitudes of solutions. At the same time, our model is parameterfree since the value of λ is determined according to the "boundary step" on each day. Also, we do not assume the RHS f corresponds to Gaussian white noise.
However, there are still some aspects that are neglected, which will be further studied to improve the current approach. For example, this approach is not tailored specifically to the evaluation criterion twCRPS. The performance might be further improved by putting the emphasis on the upper tail, according to the weight function in twCRPS. Also, reference weeks are just sampled randomly, which means some of the sampled reference fields might not be compatible with the known data. It will be helpful to study which Laplace field is more likely to present subject to the known data.
Thanks to the efficiency of our method, we were able to simulate a large number of temperature anomaly fields and compute local minimums directly. That is why extreme value theory methods were not used in our competition contribution. In the future, it will be studied to improve this model with EVT methods. For example, the sample size N can be largely reduced if EVT methods are used to model the distributions of reference Laplace or gradient fields. This model can also be combined with the EVT method to further reduce its error on large temperature anomalies.