1 Introduction

Thanks to the development of sensors, GPS devices, and satellite systems, a wide variety of spatial data are being accumulated, including climate (Stralberg et al. 2015; Wang et al. 2016b), traffic (Zheng et al. 2014; Yuan et al. 2011), economic, and social data (Haining 1993; Shadbolt et al. 2012). Analyzing such spatial data is critical in various fields, such as environmental sciences (Jerrett et al. 2005; Hession and Moore 2011), urban planning (Yuan et al. 2012), socio-economics (Smith-Clarke et al. 2014; Rupasingha and Goetz 2007), and public security (Bogomolov et al. 2014; Wang et al. 2016a).

Collecting data on some attributes is difficult if the attribute-specific sensing devices are very expensive, or experts that have extensive domain knowledge are required to observe data. Also, collecting data in some regions is difficult if they are not readily accessible. To counter these problems, many spatial regression methods have been proposed (Gao et al. 2006a, b; Ward and Gleditsch 2018); they predict missing attribute values given data observed at some locations in the region. Although Gaussian processes (GPs) (Rasmussen and Williams 2006; Banerjee et al. 2008) have been successfully used for spatial regression, they fail when the data observed in the target region are insufficient.

In this paper, we propose a few-shot learning method for spatial regression. Our model learns from spatial datasets on various attributes in various regions, and predicts values when observed data in the target task is scant, where both the attribute and region of the target task are different from those in the training datasets. Figure 1 illustrates the framework of the proposed method. Some attributes in some regions are expected to exhibit similar spatial patterns to the target task. Our model uses the knowledge learned from such attributes and regions in the training datasets to realize prediction in the target task.

Fig. 1
figure 1

Our framework. In a training phase, our model learns from training datasets containing various attributes from various regions. In a test phase, our model predicts spatial values of a target attribute in a target region given a few observations; the target attribute and region are not present in the training datasets

Our model uses a neural network to embed a few labeled data into a task representation. Then, target spatial data are predicted based on a GP with neural network-based mean and kernel functions that depend on the inferred task representation. We call our model the neural embedding-based Gaussian processes. Using the task representation yields a task-specific prediction function. By basing the modeling on GPs, the prediction function can be rapidly adapted to small labeled data in a closed form without iterative optimization, which enables efficient back-propagation through the adaptation. As the mean and kernel functions employ neural networks, we can flexibly model spatial patterns in various attributes and regions. By sharing the neural networks across different tasks in our model, we can learn from multiple attributes and regions, and use the learned knowledge to handle new attributes and regions. The neural network parameters are estimated by maximizing the expected prediction performance when a few observed data are given, which is calculated using training datasets by an episodic training framework (Ravi and Larochelle 2017; Santoro et al. 2016; Snell et al. 2017; Finn et al. 2017; Li et al. 2019).

The main contributions of this paper are as follows:

  1. 1.

    We present a framework of few-shot learning for spatial regression.

  2. 2.

    We propose a GP-based model that uses neural networks to learn spatial patterns from various attributes and regions.

  3. 3.

    We empirically demonstrate that the proposed method performs well in few-shot spatial regression tasks.

The remainder of this paper is organized as follows. Section 2 briefly reviews related work. In Sect. 3, we define our task, propose a few-shot learning model for spatial regression based on neural embedding-based Gaussian processes, and develop its training procedure. Section 4 experimentally demonstrates the effectiveness of the proposed method using climate data. Finally, we present concluding remarks and discuss future work in Sect. 5.

2 Related work

GPs, or kriging (Cressie 1990), have been widely used for spatial regression (Banerjee et al. 2008; Luttinen and Ilin 2009; Park et al. 2011; Stein 2012; Gu and Hu 2012). They achieve high prediction performance at locations that are close to the observed locations. However, if the target region is large and only a few observed data are given, performance falls at locations far from the observed locations. For improving generalization performance, neural networks have been used for mean and/or kernel functions of GPs (Wilson et al. 2011; Huang et al. 2015; Calandra et al. 2016; Wilson et al. 2016a, b; Iwata and Ghahramani 2017; Iwata and Otsuka 2019; Jean et al. 2018). However, these methods require a lot of training data.

Many few-shot learning, or meta-learning, methods have been proposed (Schmidhuber 1987; Bengio et al. 1991; Ravi and Larochelle 2017; Andrychowicz et al. 2016; Vinyals et al. 2016; Snell et al. 2017; Bartunov and Vetrov 2018; Finn et al. 2017; Li et al. 2017; Kim et al. 2018; Finn et al. 2018; Rusu et al. 2019; Yao et al. 2019; Edwards and Storkey 2016; Garnelo et al. 2018a; Kim et al. 2019; Hewitt et al. 2018; Bornschein et al. 2017; Reed et al. 2017; Rezende et al. 2016). Since our task is regression, few-shot classification methods, such as matching networks (Vinyals et al. 2016) and prototypical networks (Snell et al. 2017), are not applicable. Existing few-shot learning methods that can handle regression tasks are applicable for our task, such as model-agnostic meta-learning (Finn et al. 2017) and conditional neural processes (Garnelo et al. 2018a). However, they are not intended for spatial regression. On the other hand, our model is based on GPs, which have been successfully used for spatial regression. Some few-shot learning methods based on GPs have been proposed (Harrison et al. 2018; Tossou et al. 2019; Fortuin et al. 2019). Adaptive learning for probabilistic connectionist architectures (ALPaCA) (Harrison et al. 2018) and adaptive deep kernel learning (Tossou et al. 2019) incorporate the information in small labeled data in kernel functions using neural networks, but they assume zero mean functions. Although meta-learning mean functions (Fortuin et al. 2019) use a neural network for the mean function, the mean function does not change outputs depending on the given small labeled data. On the other hand, the proposed method uses a neural network-based mean function that outputs task-specific values by extracting a task representation from the small labeled data. The effectiveness of our mean function is shown in the ablation study in our experiments. Task-similarity aware nonparametric meta-learning (TANML) (Venkitaraman and Wahlberg 2020) is related to the proposed method since both are meta-learning methods that use kernels. TANML uses kernels for calculating the similarity between tasks. In contrast, the proposed method uses kernel for calculating covariance between locations in each task.

Our model is related to conditional neural processes (NPs) (Garnelo et al. 2018a, 2018b) as both use neural networks for task representation inference and for prediction with inferred task representations. However, since NP prediction is based on fully parametric models, they are less flexible in adapting to the given target observations than GPs, which are nonparametric models. In contrast, our GP-based model enjoys the benefits of the nonparametric approach, swift adaptation to the target observations, even though the mean and kernel functions are modeled parametrically. Our model is also related to similarity-based meta-learning methods, such as matching networks (Vinyals et al. 2016) and prototypical networks (Snell et al. 2017), since the kernel function represents similarities between data points. Although existing similarity-based meta-learning methods were designed for classification tasks, our model is designed for regression tasks.

The proposed method is also related to model-agnostic meta-learning (MAML) (Finn et al. 2017) in the sense that both methods trains models so that the expected error on unseen data is minimized when adapted to a few observed data. For the adaptation, MAML requires costly back-propagation through iterative gradient descent steps. On the other hand, the proposed method achieves an efficient adaptation in a closed form using a GP framework. Ridge regression differentiable discriminator (R2D2) (Bertinetto et al. 2018) is a neural network-based meta-learning method, where the last layer is adapted by solving a ridge regression problem in a closed form. Although R2D2 and (Lee et al. 2019) adapt with a linear model, the proposed method adapts with a nonlinear GP model, which enables us to adapt to complicated patterns more flexibly.

Adaptively initialized task optimizer (AVIATOR) (Ye et al. 2020) and multimodal MAML (MMAML) (Vuorio et al. 2019) extended MAML by modifying models using task representations. In particular, AVIATOR uses task representations for generating initial model parameters, and MMAML uses task representations for generating parameters that modulate models. The proposed method is related to them in that the task representation is used to define a model, then the model is adapted by minimizing a loss on the support set.

Transfer learning methods, such as multi-task GPs (Yu et al. 2005; Bonilla et al. 2008; Wei et al. 2017) and co-kriging (Myers 1982; Stein and Corsten 1991), have been proposed; they transfer knowledge derived from source tasks to target tasks. However, they do not assume a few observations in target tasks. In addition, since these methods use target data to learn the relationship between source and target tasks, they require computationally costly re-training given new tasks that are not present in the training phase. On the other hand, the proposed method can be applied to unseen tasks by inferring task representations from a few observations without re-training.

3 Proposed method

3.1 Task

Fig. 2
figure 2

Our model. Each pair of location vector \({\mathbf {x}}_{n}\) and attribute value \(y_{n}\) in a few labeled data set (support set) is fed to neural network \(f_{\mathrm {z}}\). By averaging the outputs of the neural network, we obtain task representation \({\mathbf {z}}\). The task representation and query location vector \({\mathbf {x}}\) are fed to neural networks \(f_{\mathrm {b}}\), \(f_{\mathrm {k}}\), and \(f_{\mathrm {m}}\) to calculate kernel k and mean m. Attribute value \({\hat{y}}\) of the query location vector is predicted by using the kernel and mean based on a GP. Shaded nodes represent observed data

In a training phase, we are given spatial datasets for \(|{\mathcal {R}}|\) regions, \({\mathcal {D}}=\{{\mathcal {D}}_{r}\}_{r\in {\mathcal {R}}}\), where \({\mathcal {R}}\) is the set of regions, and \({\mathcal {D}}_{r}\) is the dataset for region r. For each region, there are \(|{\mathcal {C}}_{r}|\) attributes, \({\mathcal {D}}_{r}=\{{\mathcal {D}}_{rc}\}_{c\in {\mathcal {C}}_{r}}\), where \({\mathcal {C}}_{r}\) is the set of attributes in region r, and \({\mathcal {D}}_{rc}\) is the dataset of attribute c in region r. The attribute sets can be different across the regions. Each dataset consists of a set of location vectors and attribute values, \({\mathcal {D}}_{rc}=\{({\mathbf {x}}_{rcn},y_{rcn})\}_{n=1}^{N_{rc}}\), where \({\mathbf {x}}_{rcn}\in \mathbb {R}^{2}\) is a two-dimensional vector specifying the location of the nth point, e.g., longitude and latitude, and \(y_{rcn}\in \mathbb {R}\) is the scalar value on attribute c at that location.

In a test phase, we are given a few labeled observations in a target region, \({\mathcal {D}}_{r^{*}c^{*}}=\{({\mathbf {x}}_{r^{*}c^{*}n},y_{r^{*}c^{*}n})\}_{n=1}^{N_{r^{*}c^{*}}}\), where target region \(r^{*}\) is not one of the regions in the training datasets, \(r^{*}\notin {\mathcal {R}}\), and target attribute \(c^{*}\) is not contained in the training datasets, \(c^{*}\notin {\mathcal {C}}_{r}\) for all \(r\in {\mathcal {R}}\). Our task is to predict target attribute value \({\hat{y}}_{r^{*}c^{*}}\) at location \({\mathbf {x}}_{r^{*}c^{*}}\) in the target region.

Location vector \({\mathbf {x}}_{rcn}\) represents the relative position of the point in region r. We used longitudes and latitudes normalized with zero mean for the location vectors in our experiments. Spatial data sometimes include auxiliary information such as elevation. In that case, we can include the auxiliary information in \({\mathbf {x}}_{rcn}\in \mathbb {R}^{M+2}\), where M is the number of additional types of auxiliary information.

3.2 Preliminaries: Gaussian processes

Before introducing the proposed model, we review GP regression, which forms the basis of the proposed model. In GP regression, GPs are used for the prior of a nonlinear function,

$$\begin{aligned} g(x)\sim \mathcal {GP}(m({\mathbf {x}}),k({\mathbf {x}},{\mathbf {x}}')), \end{aligned}$$
(1)

where \(m({\mathbf {x}})\) is a mean function, and \(k({\mathbf {x}},{\mathbf {x}}')\) is a kernel function. Let \({\mathbf {y}}=(y_{n})_{n=1}^{N}\) be the N-dimensional vector of the attribute values, and \({\mathbf {X}}=({\mathbf {x}}_{n})_{n=1}^{N}\) be the \(N\times N\) matrix of the location vectors. The joint distribution of \({\mathbf {y}}\) given \({\mathbf {X}}\) with GP regression follows a Gaussian distribution,

$$\begin{aligned} p({\mathbf {y}}|{\mathbf {X}})={\mathcal {N}}({\mathbf {m}},{\mathbf {K}}), \end{aligned}$$
(2)

where \({\mathbf {m}}=(m({\mathbf {x}}_{n}))_{n=1}^{N}\) is the N-dimensional vector with the values of mean function \(m(\cdot )\) at location vectors \({\mathbf {X}}\), \({\mathbf {K}}\) is the \(N\times N\) matrix of the kernel function evaluated between location vectors \({\mathbf {X}}\), and \({\mathbf {K}}_{nn'}=k({\mathbf {x}}_{n},{\mathbf {x}}_{n'})\). The predictive distribution at test point \({\mathbf {x}}\) given \({\mathbf {X}}\) and \({\mathbf {y}}\) as the training data is,

$$\begin{aligned} p(y|{\mathbf {x}},{\mathbf {y}},{\mathbf {X}})={\mathcal {N}}(m({\mathbf {x}}) +{\mathbf {k}}^{\top }{\mathbf {K}}^{-1}({\mathbf {y}}-{\mathbf {m}}),k({\mathbf {x}},{\mathbf {x}})-{\mathbf {k}}^{\top }{\mathbf {K}}^{-1}{\mathbf {k}}), \end{aligned}$$
(3)

where \({\mathbf {k}}\) is the N-dimensional vector of the kernel function between the test point and training points, \({\mathbf {k}}=(k({\mathbf {x}},{\mathbf {x}}_{n}))_{n=1}^{N}\).

3.3 Model

Let \({\mathcal {S}}=\{({\mathbf {x}}_{n},y_{n})\}_{n=1}^{N}\) be a few labeled observations, which are called the support set. We here present our neural embedding-based Gaussian processes for predicting attribute scalar value \({\hat{y}}\) at location vector \({\mathbf {x}}\), which is called the query, given support set \({\mathcal {S}}\). Our model is used for training as described in Sect. 3.4 as well as target spatial regression in a test phase. Figure 2 illustrates our model. Our model infers task representation \({\mathbf {z}}\) from support set \({\mathcal {S}}\) as described in Sect. 3.3.1. Then, using the inferred task representation \({\mathbf {z}}\), we predict attribute scalar value \({\hat{y}}\) of location vector \({\mathbf {x}}\) by a neural network-based GP as described in Sect. 3.3.2. We omit indices for regions and attributes for simplicity in this subsection.

3.3.1 Inferring task representation

First, each pair of the location vector and attribute value, (\({\mathbf {x}}_{n},y_{n}\)), in the support set is converted into K-dimensional latent vector \({\mathbf {z}}_{n}\in \mathbb {R}^{K}\) by a neural network: \({\mathbf {z}}_{n}=f_{\mathrm {z}}([{\mathbf {x}}_{n},y_{n}])\), where \(f_{\mathrm {z}}\) is a feed-forward neural network with the \((M+3)\)-dimensional input layer and K-dimensional output layer, and \([\cdot ,\cdot ]\) represents the concatenation. Second, the set of latent vectors \(\{{\mathbf {z}}_{n}\}_{n=1}^{N}\) in the support set are aggregated to K-dimensional latent vector \({\mathbf {z}}\in \mathbb {R}^{K}\) by averaging: \({\mathbf {z}}=\frac{1}{N}\sum _{n=1}^{N}{\mathbf {z}}_{n}\), which is a representation of the task extracted from support set \({\mathcal {S}}\). We can use other aggregation functions, such as summation (Zaheer et al. 2017), attention-based (Kim et al. 2019), and recurrent neural networks (Vinyals et al. 2016).

3.3.2 Predicting attribute values

Our prediction function assumes a GP with neural network-based mean and kernel functions that depend on the inferred task representation \({\mathbf {z}}\). In particular, the mean function is modeled by

$$\begin{aligned} m({\mathbf {x}};{\mathbf {z}}) = f_{\mathrm {m}}([{\mathbf {x}},{\mathbf {z}}]), \end{aligned}$$
(4)

where \(f_{\mathrm {m}}\) is a feed-forward neural network that outputs a scalar value. The kernel function is modeled by

$$\begin{aligned} k({\mathbf {x}},{\mathbf {x}}';{\mathbf {z}})&= \exp \left( -\parallel f_{\mathrm {k}}([{\mathbf {x}},{\mathbf {z}}])-f_{\mathrm {k}}([{\mathbf {x}}',{\mathbf {z}}])\parallel ^{2} \right) +f_{\mathrm {b}}({\mathbf {z}})\delta ({\mathbf {x}},{\mathbf {x}}'), \end{aligned}$$
(5)

where \(f_{\mathrm {k}}\) is a feed-forward neural network, \(f_{\mathrm {b}}\) is a feed-forward neural network that outputs a positive scalar value, \(\delta (\cdot ,\cdot )\) is the Kronecker delta, \(\delta ({\mathbf {x}},{\mathbf {x}}')=1\) if \({\mathbf {x}}\) and \({\mathbf {x}}'\) are identical, and zero otherwise. The kernel function is positive definite since it is a Gaussian kernel and \(f_{\mathrm {b}}({\mathbf {z}})\) is positive. By incorporating task representation \({\mathbf {z}}\) in the mean and kernel functions using neural networks, we can model nonlinear functions that depend on the support set.

In GPs, zero mean functions are often used since the GPs with zero mean functions can approximate an arbitrary continuous function, if given enough data (Micchelli et al. 2006). However, GPs with zero mean functions predict zero at areas far from observed data points (Iwata and Ghahramani 2017), which is problematic in few-shot learning. Modeling the mean function by a neural network (4) allows us to predict values effectively even in areas far from observed data points in a target region due to the high generalization performance of neural networks.

Location vector \({\mathbf {x}}\) is transformed by neural network \(f_{\mathrm {k}}\) before computing the kernel function by the Gaussian kernel in (5). The use of the neural network yields flexible modeling of the correlation across locations depending on the task representation. The noise parameter is also modeled by neural network \(f_{\mathrm {b}}\), which enables us to infer the noise level from the support set without re-training.

The predicted value for query \({\mathbf {x}}\) is given by

$$\begin{aligned} {\hat{y}}({\mathbf {x}},{\mathcal {S}};{\varvec{\varPhi }})=f_{\mathrm {m}}([{\mathbf {x}},{\mathbf {z}}])+{\mathbf {k}}^{\top }{\mathbf {K}}^{-1}({\mathbf {y}}-{\mathbf {m}}), \end{aligned}$$
(6)

where \({\mathbf {K}}\) is the \(N\times N\) matrix of the kernel function evaluated between location vectors in the support set, \({\mathbf {K}}_{nn'}=k({\mathbf {x}}_{n},{\mathbf {x}}_{n'})\), \({\mathbf {k}}\) is the N-dimensional vector of the kernel function between the query and support set, \({\mathbf {k}}=(k({\mathbf {x}},{\mathbf {x}}_{n}))_{n=1}^{N}\), \({\mathbf {y}}\) is the N-dimensional vector of attribute values in the support set, \({\mathbf {y}}=(y_{n})_{n=1}^{N}\), \({\mathbf {m}}\) is the N-dimensional vector of the mean function evaluated on locations in the support set, \({\mathbf {m}} = (f_{\mathrm {m}}([{\mathbf {x}}_{n},{\mathbf {z}}]))_{n=1}^{N}\), and \({\varvec{\varPhi }}\) are the parameters of neural networks \(f_{\mathrm {z}}\), \(f_{\mathrm {m}}\), \(f_{\mathrm {k}}\), and \(f_{\mathrm {b}}\). An advantage of our model is that the predicted value given the support set is analytically calculated without iterative optimization, by which we can minimize the expected prediction error efficiently based on gradient-descent methods.

When noise \(f_{\mathrm {b}}({\mathbf {z}})\) is small, the predicted value approaches the observed values at locations close to the observed locations. This property of GPs is beneficial for few-shot regression without re-training. If a neural network without GPs is used for the prediction function, the predicted values might differ from the observations even at the observed locations when re-training based on the observations is not conducted. The first term in (6) is similar to conditional neural processes, where a neural network is used for the prediction function. The second term in (6) is related to similarity-based meta-learning methods since the second term uses the similarities between the query and support set that are calculated by the kernel function. Therefore, our model can be seen as an extension of the conditional neural process and similarity-based meta-learning approach, where both of them are naturally integrated within a GP framework. When \({\mathbf {x}}\) is far from (close to) the observed locations, the first (second) term becomes dominant due to kernel \({\mathbf {k}}\) (Iwata and Ghahramani 2017). This is reasonable since similarity-based approaches are more reliable when there are observations nearby. The variance of the predicted attribute value of the query is given by

$$\begin{aligned} \mathbb {V}[y|{\mathbf {x}},{\mathcal {S}};{\varvec{\varPhi }}] = k({\mathbf {x}},{\mathbf {x}};{\mathbf {z}})-{\mathbf {k}}^{\top }{\mathbf {K}}^{-1}{\mathbf {k}}. \end{aligned}$$
(7)

3.4 Learning

We estimate neural network parameters \({\varvec{\varPhi }}\) by minimizing the expected prediction error on a query set given a support set using an episodic training framework (Ravi and Larochelle 2017; Santoro et al. 2016; Snell et al. 2017; Finn et al. 2017; Li et al. 2019). Although training datasets \({\mathcal {D}}\) contain many observations, they should be used in a way that closely simulates the test phase. Therefore, with the episodic training framework, support and query sets are generated by a random subset of training datasets \({\mathcal {D}}\) for each training iteration. In particular, we use the following objective function:

$$\begin{aligned} \hat{{\varvec{\varPhi }}}= \arg \min _{{\varvec{\varPhi }}} \mathbb {E}_{r\sim {\mathcal {R}}}[\mathbb {E}_{c\sim {\mathcal {C}}_{r}}[ \mathbb {E}_{({\mathcal {S}},{\mathcal {Q}})\sim {\mathcal {D}}_{rc}}[ L({\mathcal {S}},{\mathcal {Q}};{\varvec{\varPhi }})]], \end{aligned}$$
(8)

where \(\mathbb {E}\) represents an expectation,

$$\begin{aligned} L({\mathcal {S}},{\mathcal {Q}};{\varvec{\varPhi }}) = \frac{1}{N_{\mathrm {Q}}} \sum _{({\mathbf {x}},y)\in {\mathcal {Q}}} \parallel {\hat{y}}({\mathbf {x}},{\mathcal {S}};{\varvec{\varPhi }})-y \parallel ^{2}, \end{aligned}$$
(9)

is the mean squared error on query set \({\mathcal {Q}}\) given support set \({\mathcal {S}}\), and \(N_{\mathrm {Q}}\) is the number of instances in the query set. Usually, GPs are trained by maximizing the marginal likelihood of training data (support set), where test data (query set) are not used. On the other hand, the proposed method minimizes the prediction error on a query set when a support set is observed, by which we can simulate a test phase and learn a model that improves the prediction performance on target tasks. When we want to improve the predictive density for each test location, we can use the following negative predictive log likelihood:

$$\begin{aligned} L({\mathcal {S}},{\mathcal {Q}};{\varvec{\varPhi }}) =-\frac{1}{N_{\mathrm {Q}}} \sum _{({\mathbf {x}},y)\in {\mathcal {Q}}} \log {\mathcal {N}}(y|{\hat{y}}({\mathbf {x}},{\mathcal {S}};{\varvec{\varPhi }}),\mathbb {V}[y|{\mathbf {x}},{\mathcal {S}};{\varvec{\varPhi }}]), \end{aligned}$$
(10)

instead of the mean squared error (9). This is related to training GPs with the log pseudo-likelihood (Rasmussen and Williams 2006), where the leave-one-out predictive log likelihood is used as the objective function. When we want to improve the predictive joint density for a set of test locations, we can use the following negative predictive log joint likelihood:

$$\begin{aligned} L({\mathcal {S}},{\mathcal {Q}};{\varvec{\varPhi }}) =-\log {\mathcal {N}}({\mathbf {y}}_{\mathrm {Q}}|\hat{{\mathbf {y}}}({\mathbf {X}},{\mathcal {S}};{\varvec{\varPhi }}),\mathbb {V}[{\mathbf {y}}_{\mathrm {Q}}|{\mathbf {X}},{\mathcal {S}};{\varvec{\varPhi }}]), \end{aligned}$$
(11)

where \({\mathbf {y}}_{\mathrm {Q}}\) is the \(N_{\mathrm {Q}}\)-dimensional vector of attribute values in the query set, \(\hat{{\mathbf {y}}}({\mathbf {X}},{\mathcal {S}};{\varvec{\varPhi }})=(\hat{{\mathbf {y}}}({\mathbf {x}},{\mathcal {S}};{\varvec{\varPhi }}))_{{\mathbf {x}}\in {\mathcal {Q}}}\) is the \(N_{\mathrm {Q}}\)-dimensional vector of predicted attribute values of the query set by Eq. (6), \(\mathbb {V}[{\mathbf {y}}_{\mathrm {Q}}|{\mathbf {X}},{\mathcal {S}};{\varvec{\varPhi }}])={\mathbf {K}}_{\mathrm {Q}}-{\mathbf {K}}_{\mathrm {Q}}^{\top }{\mathbf {K}}^{-1}{\mathbf {K}}_{\mathrm {Q}}\in \mathbb {R}^{N_{\mathrm {Q}}\times N_{\mathrm {Q}}}\) is the covariance of the query set, and \({\mathbf {K}}_{\mathrm {Q}}\in \mathbb {R}^{N_{\mathrm {Q}}\times N_{\mathrm {Q}}}\) is the kernel matrix evaluated on the query set by Eq. (5). The predictive likelihood has been used for a meta-learning method (Chen et al. 2020) instead of the marginal likelihood.

The training procedure of our model is shown in Algorithm 1. In each iteration, we randomly generate support and query sets (Lines 2 – 5) from dataset \({\mathcal {D}}_{rc}\) by randomly selecting region r and attribute c for simulating a test phase. Given the support and query sets so generated, we calculate the loss (Line 6). We update model parameters by using any of the stochastic gradient-descent methods, such as Adam (Kingma and Ba 2015) (Line 7). By training the model using randomly generated support and query sets, the trained model can predict values with a wide variety of observed location distributions, attributes and regions in a test phase.

The computational complexity for evaluating loss (9) and (10) is \(O(N_{\mathrm {Q}}+N_{\mathrm {S}}^{3})\), where \(N_{\mathrm {S}}\) is the number of instances in the support set since we need the inverse of the kernel matrix with size \(N_{\mathrm {S}}\times N_{\mathrm {S}}\). In few-shot learning, the number of target observed data is very small, and so a very small support size \(N_{\mathrm {S}}\) is used in training. Therefore, our model can be optimized efficiently with the episodic training framework. This is in contrast to the high computational complexity of training for standard GP regression, which is cubic in the number of training instances. The computational complexity for evaluating loss (11) is \(O(N_{\mathrm {Q}}^{3}+N_{\mathrm {S}}^{3})\) since we need the inverse of the covariance with size \(N_{\mathrm {Q}}\times N_{\mathrm {Q}}\). When we use a large query set size for training, losses (9) and (10) are preferable to (11) in terms of computational efficiency.

figure a

4 Experiments

4.1 Data

We evaluated the proposed method using the following three spatial datasets: NAE, NA, and JA. NAE and NA were the climate data in North American, which were obtained from https://sites.ualberta.ca/~ahamann/data/climatena.html. As the location vector, NA used longitude and latitude. With NAE, elevation in meters above sea level was additionally used in the location vector. With NAE and NA data, we used the following 26 bio-climate values as attributes shown in Table 1. We generated 1829 non-overlapping regions covering North America, where the size of each region was 100 \(\times \) 100km, and attribute values were observed at 1 \(\times \) 1km grid squares in each region. JA was the climate data in Japan, which was obtained from http://nlftp.mlit.go.jp/ksj/gml/datalist/KsjTmplt-G02.html. We used the following seven climate values as attributes shown in Table 2. The data contained 273 regions, where attribute values were observed at 1 \(\times \) 1km grid square, and there were at most 6,400 locations in a region. For all data, we randomly selected training, validation, and target regions without replacement. Also, we splitted the attributes into training, validation, and target attributes. The statistics of each data set are shown in Table 3. In each target region, values on a target attribute at five locations were observed, and values at the other locations were used for evaluation. The location vectors and attributes were normalized with zero mean and one standard deviation for each region and for each attribute.

Table 1 Attributes in NAE and NA data
Table 2 Attributes in JA data
Table 3 Statistics of NA, NAE and JA data

4.2 Proposed method setting

As the neural networks in our model, \(f_{\mathrm {z}}\), \(f_{\mathrm {b}}\), \(f_{\mathrm {k}}\), and \(f_{\mathrm {m}}\), we used three-layered feed-forward neural networks with 256 hidden units. The dimensionality of the output layer with \(f_{\mathrm {z}}\) and \(f_{\mathrm {k}}\) was 256, and that with \(f_{\mathrm {b}}\) and \(f_{\mathrm {m}}\) was one. We used rectified linear unit, \(\mathrm {ReLU}(x)=\max (0,x)\), for activation. Optimization was performed using Adam (Kingma and Ba 2015) with learning rate \(10^{-3}\) and dropout rate 0.1. The maximum number of training epochs was 5000, and the validation datasets were used for early stopping. The support set size was \(N_{\mathrm {S}}=5\), and query set size was \(N_{\mathrm {Q}}=64\).

4.3 Comparison methods

We compared the proposed method with conditional neural process (NP), Gaussian process regression (GPR), Gaussian process autoencoder (GPVAE), neural network (NN), fine-tuning with NN (FT), model-agnostic meta-learning with NN (MAML), adaptively initialized task optimizer (AVIATOR), multimodal MAML (MMAML), prototypical networks (PN), and ridge regression differentiable discriminator (R2D2).

With NP, a task representation was inferred from the support set using a neural network as in the proposed method, and then the attribute values of queries were predicted using another neural network. We used the same neural network architecture with the proposed method for inferring task representations \(f_{\mathrm {z}}\). The architecture of the neural network for prediction was the same as that with \(f_{\mathrm {m}}\) in the proposed method, which was used as the mean function.

GPR predicted the attribute values by a GP regression with a Gaussian kernel given the support set. The kernel parameters, which were the signal variance, length scale, and noise variance, were estimated from the training datasets by minimizing the expected prediction error using the episodic training framework.

GPVAE was a variational autoencoder (Kingma and Welling 2014) with GP priors on latent variables (Casale et al. 2018; Ashman et al. 2020). With GPVAE, latent variables were encoded using a neural network from location vectors. Then, attribute values were predicted by a decoder neural network from the latent variables. The parameters of encoder and decoder neural networks were estimated using the training datasets.

NN used a three-layered feed-forward neural network with 256 hidden units, and the ReLU activation was used. The input of the NN was a location vector, and its output was the predicted value of the attribute. NN parameters, which were shared across all tasks, were estimated using the training datasets. The NN did not use labeled data in target tasks.

FT fine-tuned the parameters of the trained NN with labeled data for each target task. For fine-tuning, we used Adam with learning rate \(10^{-3}\). The number of epochs for fine-tuning was 100, which was selected from \(\{10,100\}\) based on the target performance.

MAML used the same neural network as NN. The parameters were trained so that the prediction performance was improved when fine-tuned with a support set. The number of fine-tuning epochs was five. MAML was implemented with Higher, which is a library for higher-order optimization (Grefenstette et al. 2019).

AVIATOR and MMAML obtained a task representation using a neural network as in the proposed method. They were trained the neural network as in MAML, where the neural network was defined by the task representation. AVIATOR generated the initial parameters of a neural network using the task representation. MMAML generated parameters that modulate a neural network using the task representation.

With PN, a three-layered feed-forward neural network with 256 hidden and output units was used for embedding location vectors. Attribute values were predicted by a weighted average of the support instances, where the weights were calculated by softmaxed negative squared Euclidean distance between the query and support embedded instances.

R2D2 used a neural network of the same architecture with PN for embedding. A ridge regression model was used for predicting attribute values given embedded instances, where the ridge regression parameters were adapted using the support set for each region.

NP, GPR, GPVAE, NN, MAML, AVIATOR, MMAML, PN, and R2D2 used the episodic training framework in the same way as the proposed method. All the methods were optimized with Adam with learning rate \(10^{-3}\), and implemented with PyTorch (Paszke et al. 2017).

4.4 Results

Table 4 (a) Test mean squared errors and (b) test log likelihoods averaged over ten experiments

The test mean squared errors (a) and test log likelihoods (b) in the target tasks averaged over ten experiments are shown in Table 4. The test log likelihoods were calculated for each test location. For the test mean squared error evaluations, all methods were trained with the mean squared error objective function in Eq. (9), and for the test log likelihood evaluations, all methods were trained with the negative log likelihood objective function in Eq. (10). The proposed method achieved the best performance in all cases except for the test likelihood with JA data. NP was worse than the proposed method because its predictions were poor when task representations were not properly inferred. On the other hand, the proposed method performed well with any tasks in at least areas close to the observations, as its GP framework offers a smooth nonlinear function that passes over the observations. GPR was worse than the proposed method since GPR only shares kernel parameters across different tasks. In contrast, the proposed method shares neural networks across different tasks, which enables us to learn flexible spatial patterns in various attributes and regions and use them for target tasks. NN and GPVAE suffered the low performance since they cannot use the target data. Fine-tuning (FT) decreased the error, but it remained worse than that of the proposed method. This is because FT consisted of two separate steps: pretrain and fine-tuning, and did not learn how to transfer knowledge. In contrast, the proposed method trained the neural network in a single step so that test performance is maximized when the support set is given in the episodic training framework. MAML performance was low since it had difficulty in learning the parameters that fine-tuned well with just a small number of epochs with various regions and attributes, where target function shapes vary drastically. Note that due to the high computational complexity of MAML, where it demands that the gradients of many gradient-descent steps be calculated, MAML makes it infeasible to use a large number of fine-tuning epochs. On the other hand, with the proposed method, since predicted values given the support set are calculated analytically based on a GP, the neural networks are optimized efficiently in terms of fitting the support set, and therefore the trained model attained high prediction performance for various attributes and regions. AVIATOR and MMAML used the task representation to obtain task-specific neural networks, and their performance was better than MAML. However, it was worse than the proposed method since the proposed method used GPs that were suitable for spatial regression. Since the number of training attributes was small with JA data, and training data were insufficient to train neural networks, the test likelihood of the proposed method was not significantly different from that of GPR. The expressive power of PN and R2D2 is low since they are adapted based on the weighted average and linear regression, respectively. Therefore, their performance was worse than the proposed method, which is adapated based on GPs.

Fig. 3
figure 3

Average test mean squared errors with (a) different target support sizes, (b) different numbers of training attributes, (c) different numbers of training regions, (d) different training support sizes, and (e) different training query sizes. The bar shows the standard error

Figure 3a shows the average test mean squared errors with different target support sizes with the proposed method, NP, and GPR. We omitted the results with NN, FT, and MAML since their performance was low as shown in Table 4. All methods yielded decreased error as the target support size increased. The proposed method achieved low errors with different target support sizes since it uses neural networks to learn the relationship between support and query sets using the training datasets. NP achieved low error rates when the target support size was small. However, NP had higher error rates than GPR when the size was ten. Since NP used a fixed trained neural network to incorporate the support set information, it was difficult to adapt prediction functions to a large support set. In contrast, since the proposed method and GPR can adapt them easily to support sets by calculating the posterior in a closed form, their errors were effectively decreased as the target support size increased.

Figure 3b shows the average test mean squared errors with different numbers of training attributes. The errors with the proposed method and NP decreased as the training attribute numbers increased. This is reasonable since the possibility that tasks similar to target tasks are included in the training datasets increases as the number of training attributes increases. Since GPR shared only kernel parameters across different tasks, its performance was not improved even when many attributes were used. Fig. 3c shows the average test mean squared errors with different numbers of training regions. The errors with the proposed method and NP decreased as the training regions increased. Figure 3d shows the average test mean squared errors with different training support sizes with the proposed method. When the support size in the training phase was the same with that in the test phase, i.e., \(N_{\mathrm {S}}=5\), the performance was best. When the test support size can differ in target tasks, we need to train with a wide range of training support sizes. Figure 3d shows the average test mean squared errors with different training query sizes with the proposed method. As the training query size increased, the performance improved. It would be because the evaluation of the test error in the training phase improved using larger training query size.

Table 5 shows the average computation time in seconds for learning from the training datasets and the time for predicting test attribute values for each region on computers with 2.60GHz CPUs. Although the proposed method had slightly longer training time than NP, GPR, or NN since it uses both the neural networks and GP, it was faster than MAML-based methods (MAML, AVIATOR and MMAML). All methods had short test times since the number of observed locations was small. The proposed method had shorter test time than FT because the proposed method calculated predictions analytically given the target data, while the FT required multiple updates for optimization given the target data.

Table 5 Average computational time in seconds for learning from the training datasets and the time for predicting test attribute values for each region
Fig. 4
figure 4

Predictions for five attributes and regions of target tasks yielded by the proposed method, NP, and GPR. The top row shows the true attribute values. Red circles indicate observed locations. Values below each plot show the mean squared error

Figure 4 visualizes the predictions for five attributes and regions of target tasks with the proposed method, NP, and GPR. The proposed method attained appropriate predictions in various attributes and regions. NP did not necessarily output predicted values that were similar to the observations. For example, in Fig. 4(a,NP), the predicted values of NP at two left observed locations differed from the true value. On the other hand, the proposed method and GPR predicted values similar to the observation at the locations. Since GPR could not extract the rich knowledge present in the training datasets, it sometimes failed to predict values. For example, in Fig. 4(a,GPR), the predicted values differed from the true values in the lower area. In contrast, the proposed method and NP predicted values at the area well using neural networks. The proposed method improved prediction performance by adopting both advantages of GPs and neural networks.

Table 6 Ablation study

Table 6 shows the results of the ablation study of the proposed method. In terms of the test mean squared error, the proposed method with the mean squared error objective function (ErrObj) was better than that with the likelihood objective function (LikeObj). In terms of the test log likelihood, LikeObj was better than ErrObj. These results imply that the objective function should be selected properly depending on the applications. The proposed method with the marginal likelihood objective function (MarObjS and MarObjSQ) was worse than ErrObj and LikeObj. Although standard GPs are usually trained with the marginal likelihood of training data, it is different from the test mean squared error and test log likelihood. On the other hand, ErrObj and LikeObj directly minimize the evaluation measurements by simulating the test phase using the episodic training framework. This result demonstrates the effectiveness to use the test performance for the objective function for few-shot learning. The proposed method with the mean function without the support information (NoSptM) and that with the zero mean function (ZeroMean) performed worse than the proposed method. This result indicates the importance to use non-zero mean functions that incorporate the support information, and the advantage of the proposed method over existing GP-based meta-learning methods those that use zero mean functions (Harrison et al. 2018; Tossou et al. 2019) and those that do not use the support information (Harrison et al. 2018). Although the test mean squared error of the proposed method did not get worse with the kernel function without the support information (NoSptK), the test log likelihood got worse. This result implies that the kernel function with the support information is important for predicting the uncertainty. The performance of NoSptM was lower than that of NoSptK. This result indicates that incorporating the support information in the mean function is more beneficial for spatial regression.

5 Conclusion

We proposed a few-shot learning method for spatial regression. The proposed method can predict attribute values given a few observations, even if the target attribute and region are not included in the training datasets. The proposed method uses a neural network to infer a task representation from a few observed data. Then, it uses the inferred task representation to calculate the predicted values by a neural network-based Gaussian process framework. Experiments on climate spatial data showed that the proposed method achieved better prediction performance than existing methods. Although our results are encouraging, we must extend our approach in several directions. Although the proposed method uses a Bayesian framework given the mean and covariance functions based on GPs, the mean and covariance functions are trained by a point estimation. Therefore, when the number of tasks is small, there is the risk of meta-overfitting. We want to mitigate the risk of meta-overfitting using a Bayesian estimation of the mean and covariance functions (Rothfuss et al. 2020). In addition, we want to apply our framework to other types of tasks, such as spatio-temporal regression, regression for non-spatial data, and classification.