1 Introduction

Spatio-temporal predictions are critically important for planning and further developing smart cities. For instance, accurate bike and taxi demand prediction can enhance mobility services and city traffic management. Recently, deep learning-based approaches achieved high effectiveness in various spatio-temporal prediction tasks, including traffic forecasting and crowd flow prediction [1, 15]. The effectiveness of such approaches depends heavily on the availability of large amounts of training data. However, spatio-temporal data is typically (i) locked in organizational silos and rarely available across different cities and tasks and (ii) not sufficiently exploited for pre-training generalizable models. Consequently, we identify two crucial challenges in the spatio-temporal domain:

  • Lack of data of the target city and prediction task: Spatio-temporal prediction typically requires data from the specific target city and the task of interest. However, while rich data is available for a few selected cities and tasks, data for the specific city and task of interest is often unavailable.

  • Lack of pre-training strategy in the spatio-temporal domain: Pre-training requires a large amount of data from different tasks to learn an initialization for the target task, to converge better and faster. However, the adoption of pre-training in the spatio-temporal domain is currently limited.

Meta-learning is typically used to transfer knowledge from multiple cities (e.g., MetaStore [10] and MetaST [13]). However, existing approaches only learn from a specific task (e.g., bike demand prediction) and do not exploit the correlations between the tasks (e.g., taxi and bike demand prediction) and geographic regions. Examples of such correlations are illustrated in Fig. 1a and Fig. 1b, which indicate similar bike usage trends in central business areas in Chicago and New York City. Similarly, Fig. 1c and Fig. 1d indicate similar demand for bikes and taxis across regions. Therefore, we aim to build a predictive model that benefits from incorporating such correlations across tasks and cities.

Fig. 1.
figure 1

(a) and (b) depict bike usage patterns in Chicago (CHI) and New York City (NYC), whereas (c) and (d) illustrate the bike and taxi usage in Washington, D.C. (DC), between 9 and 10 am.

In this paper, we propose MetaCitta, a novel deep meta-learning approach for spatio-temporal predictions across cities and tasks. In contrast to other approaches [10, 13], MetaCitta adopts the knowledge not only from several cities but also different tasks. As mentioned above, different tasks in the same city can exhibit spatial correlations, and a task across cities can exhibit task-specific correlations. Therefore, we learn spatially invariant and task invariant feature representations using the Maximum Mean Discrepancy (MMD) [4]. These representations enable MetaCitta to make accurate predictions when data is unavailable for the target city and task. We also demonstrate how to adopt MetaCitta as a pre-training strategy when some target data is available for fine-tuning.

In summary, our contributions are as follows:

  • We propose MetaCittaFootnote 1, a novel deep meta-learning algorithm for prediction in spatio-temporal networks.

  • To the best of our knowledge, we are the first to utilize the knowledge of multiple tasks and cities to improve spatio-temporal prediction.

  • MetaCitta outperforms best-performing baselines by \(9.39\%\) (zero-shot) and \(3.86\%\) (pre-training + data-abundant fine-tuning) on average on six real-world datasets regarding RMSE and is more efficient than state-of-the-art meta-learning methods regarding training time.

2 Problem Statement

In the following, we provide a formal definition of the problem of spatio-temporal prediction (based on [12, 15]) via meta-learning from multiple cities and tasks.

Definition 1 (City and Region)

We represent a city c as an \(m \times n\) grid map [12] with equally sized cells. We refer to each cell as a region.

For example, we can split the city of Chicago into \(20 \times 20\) equally sized regions, where each region represents a squared area.

Definition 2 (Spatio-Temporal Image)

A spatio-temporal image (as coined in [12], an image in short) \(x^{c,\tau }_{t}\) of a city c and a task \(\tau \) is a multi-channel image having the dimension \(\mathbb {R}^{q \times m \times n}\) where q is the number of observations relevant for the task \(\tau \) at a given time point t.

For example, an image of Chicago for the taxi prediction task contains \(q=2\) observations (taxi pickup and drop-off) for each region at a time point t.

Definition 3 (Spatio-Temporal Prediction)

Given a sequence of spatio-temporal images of length i, \(\mathcal {X}_{c,\tau } = \langle x^{c,\tau }_{t-i-1},x^{c,\tau }_{t-i}, \dots ,x^{c,\tau }_{t-1} \rangle \), where \(t-i-1, t-i, \dots , t-1\) are consecutive time points, spatio-temporal prediction estimates the image at the next time point t, i.e., \(x^{c,\tau }_{t} \in \mathbb {R}^{q \times m \times n}\).

Given Chicago’s spatio-temporal images for the taxi prediction in a given period, we aim to predict the image, i.e., taxi drop-off and pickup, at the next time point.

Definition 4 (Meta-learning from Multiple Cities and Tasks)

Given are a set of cities \(\mathcal {C} = \{ c_1, c_2, \dots \}\), a target city \(c_{target} \in \mathcal {C}\), as well as a set of tasks \(\mathcal {T}= \{ {\tau }_1, {\tau }_2, \dots \}\) and a target task \(\tau _{target} \in \mathcal {T}\). Training data is available in the form of image sequences for each combination of a city \(c \in \mathcal {C}\) and a task \(\tau \in \mathcal {T}\) except for the target task \(\tau _{target}\) of the target city \(c_{target}\): \(\mathcal {D} = \{\mathcal {X}_{c,\tau } | c \in \mathcal {C}, \tau \in \mathcal {T}\} \setminus \{\mathcal {X}_{c_{target},\tau _{target}}\}\). The goal is to predict \(x^{c_{target},\tau _{target}}_{t}\) based on \(\mathcal {D}\).

As an example, consider three cities (Chicago (CHI), Washington, D.C. (DC) and New York City (NYC)) and two tasks (bike demand prediction (bike) and taxi demand prediction (taxi)). With NYC as the target city \(c_{target}\) and taxi as the target task \(\tau _{target}\), the following image sequences are available: \(\mathcal {D} = \{\mathcal {X}_{\textit{DC},\textit{bike}},\mathcal {X}_{\textit{CHI},\textit{bike}},\mathcal {X}_{\textit{DC},\textit{taxi}}, \mathcal {X}_{\textit{CHI},\textit{taxi}}, \mathcal {X}_{\textit{NYC},\textit{bike}}\}\). Based on these image sequences, the goal of meta-learning from multiple cities and tasks is to predict the image \(x^{\textit{NYC},\textit{taxi}}_{t}\).

We also consider a variation of Definition 4 where limited data for the target task of the target city is available, specifically \(\mathcal {D} = \{\mathcal {X}_{c,\tau } | c \in \mathcal {C}, \tau \in \mathcal {T}\}\) with \(\mathcal {X}_{c_{target},\tau _{target}}\) restricted to only a few days.

3 The MetaCitta Approach

MetaCitta extends a typical spatio-temporal network (ST-Net) which has a spatial encoder, a temporal encoder, and a prediction component. Figure 2 illustrates MetaCitta’s training with an example of three cities and two tasks. Spatio-temporal image sequences for each city-task pair are encoded by the spatial and temporal encoder. Spatial and task alignment is achieved through alignment losses (MMD loss). The outer optimization step trains on data from the target city (\(c_3\)) but not the target task (\(\tau _2\)).

Fig. 2.
figure 2

Training procedure of MetaCitta with three cities (\(c_1\), \(c_2\) and \(c_3\)) and two tasks (\(\tau _1\) and \(\tau _2\)). \(c_3\) is the target city and \(\tau _2\) is the target task. The fully connected layers \(\text {fc}_1\), \(\text {fc}_2\) and \(\text {fc}_3\) follow Eqs. (1), (2) and (3).

3.1 Spatial Encoder

The spatial encoder \(\boldsymbol{f_s}\) captures the spatial dependencies between regions (see Fig. 1c and Fig. 1d). Given an image \(x^{c,\tau }_{t}\) of a city c and a task \(\tau \) at the time t, \(F^{c,\tau }_{s,t} = \boldsymbol{f_s}(x^{c,\tau }_{t})\) is the encoded representation of the input image \(x^{c,\tau }_{t}\). In MetaCitta, the spatial encoder \(\boldsymbol{f_s}\) is a CNN consisting of three blocks where each block contains a convolution layer, followed by batch normalization and a ReLU activation layer. We reduce the dimensionality of the spatial encoder’s output \(F^{c,\tau }_{s,t}\) by using a linear layer \(\text {fc}_1\):

$$\begin{aligned} F^{c,\tau }_{\text {fc}_1,t} = \text {ReLU}( \boldsymbol{W_{{\textbf {fc}}_1}}F^{c,\tau }_{s,t} + \boldsymbol{b_{{\textbf {fc}}_1}}), \end{aligned}$$

where \(\boldsymbol{W_{{\textbf {fc}}_1}}\) and \(\boldsymbol{b_{{\textbf {fc}}_1}}\) are trainable parameters, and \(F^{c,\tau }_{\text {fc}_1,t}\) is the spatial representation of city c for the task \(\tau \) at the time t.

3.2 Temporal Encoder

The temporal encoder \(\boldsymbol{f_g}\) captures the temporal dependencies (see Fig. 1a and Fig. 1b). Given the spatial representations \(F^{c,\tau }_{\text {fc}_1,t-i-1}, F^{c,\tau }_{\text {fc}_1,t-i}, \dots , F^{c,\tau }_{\text {fc}_1,t-1}\) of a city c and a task \(\tau \) over time, \(F^{c,\tau }_{g,t} = \boldsymbol{f_g}(F^{c,\tau }_{\text {fc}_1,t-i-1}, F^{c,\tau }_{\text {fc}_1,t-i}, \dots , F^{c,\tau }_{\text {fc}_1,t-1})\), is the encoded temporal representation at a time t. In MetaCitta, we use a GRU as the temporal encoder \(\boldsymbol{f_g}\). The task-specific representation is generated by applying a linear layer \(\text {fc}_2\) on \(F^{c,\tau }_{g,t}\):

$$\begin{aligned} F^{c,\tau }_{\text {fc}_2,t} = \text {ReLU}( \boldsymbol{W_{{\textbf {fc}}_2}}F^{c,\tau }_{g,t} + \boldsymbol{b_{{\textbf {fc}}_2}}), \end{aligned}$$

where \(\boldsymbol{W_{{\textbf {fc}}_2}}\) and \(\boldsymbol{b_{{\textbf {fc}}_2}}\) are trainable parameters.

3.3 Prediction

In an ST-Net, a linear operation and activation function transform the output into the desired shape and range. Then, a task-specific loss function is applied to train the network. MetaCitta’s ST-Net uses a linear layer \(\text {fc}_3\) and the tanh activation on the task-specific representation \(F^{c,\tau }_{\text {fc}_2,t}\) of a city c and a task \(\tau \):

$$\begin{aligned} \hat{x}^{c,\tau }_{t} = tanh( \boldsymbol{W_{{\textbf {fc}}_3}}F^{c,\tau }_{\text {fc}_2,t} + \boldsymbol{b_{{\textbf {fc}}_3}}), \end{aligned}$$

where \(\boldsymbol{W_{{\textbf {fc}}_3}}\) and \(\boldsymbol{b_{{\textbf {fc}}_3}}\) are trainable parameters, and \(\hat{x}^{c,\tau }_{t} \in \mathbb {R}^{q \times m \times n}\) is the prediction at time t. We utilize the mean squared error as the task-specific loss:

$$\begin{aligned} L^{c, \tau } = \frac{1}{N} \sum _{i=1}^{N}(\hat{x}^{c,\tau }_{t}-x^{c,\tau }_{t})^2, \end{aligned}$$

where N is the number of images, \(\hat{x}^{c,\tau }_{t}\) and \(x^{c,\tau }_{t}\) are the predicted and ground truth labels, respectively.

3.4 Training Procedure

Different tasks originate from different distributions. MetaCitta should learn the distribution-invariant properties from the tasks across the cities to perform well on an unseen task. To learn distribution-invariant properties, we use the Maximum Mean Discrepancy (MMD) [4] to minimize the distance between two distributions. In this way, the network learns the common or invariant properties between the distributions and thus better generalizes to unseen distributions.

Different tasks of a city have the same underlying regions and thus share similar spatial characteristics. Therefore, to extract the spatially invariant representation between the tasks \(\tau _1\) and \(\tau _2\) in a city c, we apply the MMD constraint to the spatial representations generated by the spatial encoder. As a result, we compute the spatial alignment loss \(L^{spatial}_{mmd}\):

$$\begin{aligned} L^{spatial}_{mmd} = MMD(F^{c,\tau _1}_{\text {fc}_1,t}, F^{c,\tau _2}_{\text {fc}_1,t}). \end{aligned}$$

Similarly, to extract the task-invariant features from the task \(\tau \) in two different cities, \(c_1\) and \(c_2\), we apply the MMD constraint on the task-specific representations generated by the temporal encoder. As a result, we compute the task alignment loss \(L^{task}_{mmd}\):

$$\begin{aligned} L^{task}_{mmd} = MMD(F^{c_1,\tau }_{\text {fc}_2,t}, F^{c_2,\tau }_{\text {fc}_2,t}). \end{aligned}$$

Finally, to enrich the model with the target city features, we perform knowledge transfer from all available tasks of the target city.

figure a

The MetaCitta training procedure is depicted in Fig. 2 and described in Algorithm 1. The algorithm takes the training data \(\mathcal {D}\), a set of cities C, a set of tasks \(\mathcal {T}\), the target city \(c^{target}\) and target task \(\tau ^{target}\) as input. The goal is to learn the model parameters \(\theta \), where \(\theta _s\) are the parameters responsible for generating the spatial representation and \(\theta _{\tau }\) for the task-specific representation. The algorithm performs the following three steps:

  1. 1.

    The Spatial Alignment (lines 3–7) is performed between different tasks of a city on pairs of examples (line 5) and by updating the model parameters using the spatial alignment loss and the task-specific loss (lines 6–7).

  2. 2.

    The Task Alignment (lines 8–12) happens by extracting example pairs from the same task but of different cities (line 10), and with an update step using the task-alignment loss and the task-specific loss (lines 10–12).

  3. 3.

    The Outer Optimization (lines 13–16) step is done by directly performing an update step on all the available tasks of the target city.

Using the example from Sect. 2 with NYC as \(c_{target}\) and taxi as \(\tau _{target}\), the spatial alignment considers \((\mathcal {X}_{\textit{DC},\textit{bike}}, \mathcal {X}_{\textit{DC},\textit{taxi}})\) and \((\mathcal {X}_{\textit{CHI},\textit{bike}}, \mathcal {X}_{\textit{CHI},\textit{taxi}})\), while the task alignment involves \((\mathcal {X}_{\textit{CHI},\textit{bike}}, \mathcal {X}_{\textit{DC},\textit{bike}})\), and \((\mathcal {X}_{\textit{CHI},\textit{taxi}}, \mathcal {X}_{\textit{DC},\textit{taxi}})\). The outer alignment updates using \(\mathcal {X}_{\textit{NYC},\textit{bike}}\).

Pre-training. Following Definition 4, MetaCitta is tailored towards cases where no data from the target task \(\tau _{target}\) in the target city \(c_{target}\) is available. However, if such data \(\mathcal {X}_{c_{target},\tau _{target}}\) is (partially) available, MetaCitta can also be used for pre-training and fine-tuning, i.e., it first learns initialization weights from available data of other cities and tasks (\(\{\mathcal {X}_{c,\tau } | c \in \mathcal {C}, \tau \in \mathcal {T}\} \setminus \{\mathcal {X}_{c_{target},\tau _{target}}\}\)) and is then fine-tuned on data of the target task \(\mathcal {X}_{c_{target},\tau _{target}}\). To use MetaCitta for pre-training, we skip the task alignment and instead perform spatial alignment between all available pairs of source cities and tasks. This method extracts the more general properties of the input data typically captured in the initial layers of a network [8, 14], which is required for pre-training.

4 Evaluation Setup

This section describes the datasets, baselines, and experimental settings used to evaluate MetaCitta.

4.1 Datasets

We utilize two tasks (taxi and bike demand) from three cities: NYC, CHI, and DC. Each dataset consists of six months of data, with five months used for training and one month used as a test set. The data for TaxiNYCFootnote 2 (36M), TaxiCHIFootnote 3 (7M), TaxiDCFootnote 4 (3M), and BikeNYCFootnote 5 (9M) is from 01/2019 to 06/2019. The data for BikeCHIFootnote 6 (2.5M) and BikeDCFootnote 7 (1.2M) is from 07/2020 to 12/2020.

4.2 Baselines

We compare MetaCitta to the following seven baselines:

  • The Historical Average (HA) is calculated by taking the mean value regarding the specific hour for the specific region.

  • No pre-training (NoPretrain): The network is initialized with random weights and fine-tuned on the target task.

  • Joint : The model is trained jointly by mixing all the samples.

  • MAML [3] is a meta-learning method that learns an initialization from multiple tasks. We use the same underlying ST-Net as in MetaCitta.

  • MetaStore [10] is a MAML-based approach that learns to generate the city-specific parameters based on the city’s encoding during training.

  • MetaST [13] is also based on MAML and learns a pattern-based spatio-temporal memory from source cities.

  • MLDG [9] extends MAML for domain generalization. It randomly leaves one task out and updates its parameters on the left-out task during training.

4.3 Experimental Settings

In our experiments, each city is divided into \(20 \times 20\) grid cells of size \(1km^2\) each. We use the same underlying ST-Net and parameters for MetaCitta and the baselines (except for HA) to allow for a fair comparison. As spatial encoders, we use CNNs with 32 filters of size \(3 \times 3\). As temporal encoders, we use GRUs where the input and hidden sizes are set to 256. Their input sequence length is 12 at an interval of 1 hour. The size of the fully connected layers (\(\text {fc}_1\), \(\text {fc}_2\)) is 256. The batch size is 64, and each model is trained for 500 epochs on an NVIDIA GeForce GTX 1080 Ti (11 GB) at a learning rate of \(1e^{-5}\) using an Adam optimizer. For the MAML-based approaches, the inner and outer learning rates are set to \(1e^{-5}\), and there are 5 update steps.

5 Evaluation

In this section, we evaluate MetaCitta by comparing it to the baselines, by conducting an ablation study, and by training time comparison.

Table 1. Comparison of MetaCitta to the baselines, where RMSE is the Root Mean Squared Error and MAE is the Mean Absolute Error. Improv. (%) is the relative improvement of MetaCitta over the best baseline (underlined).

5.1 Comparison with Baselines

MetaCitta was evaluated in two settings: zero-shot and fine-tuning. Fine-tuning was done in two conditions: data-limited (15 days of target data) and data-abundant (5 months of target data). Results in Table 1 demonstrate that MetaCitta performs best in both settings and all datasets regarding RMSE.

In the zero-shot setting, HA performs worst, indicating that a simple heuristic approach cannot correctly capture the city’s complex spatio-temporal dynamics. In this setting, MLDG is the best-performing baseline in terms of the RMSE because it is trained to perform particularly well on an unseen task by updating its parameters based on the left-out task.

MetaCitta and the pre-trained baselines outperform NoPretrain in fine-tuning due to their ability to extract generic spatio-temporal properties that aid in convergence to the target task. These properties include traffic behavior, such as high demand during peak hours (morning and evening) and low off-peak demand (e.g., after midnight). Joint, MAML, and MetaST perform better than MLDG as they are trained to perform well on all source tasks and adapt quickly to new tasks. MetaStore performs relatively worse in fine-tuning than in the zero-shot setting, as it only fine-tunes the final prediction layers (\(\text {fc}_2\) and \(\text {fc}_3\)), reducing its adaptability to the target task.

On average across datasets, compared to the respective best baselines, MetaCitta is \(9.39\%\) better than MLDG in the zero-shot setting, \(5.04\%\) better than Joint in the data-limited fine-tuning setting, and \(3.86\%\) better than Joint in the data-abundant fine-tuning setting regarding the RMSE. According to the MAE, MetaCitta exceeds the baselines in all cases except three (e.g., zero-shot on TaxiCHI), where Joint performs better. Since MetaCitta consistently has a lower RMSE, we conclude that MetaCitta performs better in predicting extreme values, such as high or low demand values during or after peak periods, while Joint mainly predicts average values for each region.

Fig. 3.
figure 3

Comparison of Joint and MetaCitta on taxi pickup prediction of Chicago, June 6th 2019, between 4 and 5 am. Nine selected regions and their number of taxi pickups in the mentioned hour are shown in detail.

To further investigate this behavior, we visualize the predictions of Joint and MetaCitta in the data-limited fine-tuning setting on the TaxiCHI dataset. Figure 3 illustrates the ground truth and the predicted pickup demands in different city regions between 4 and 5 am. The regions to the south and southwest of the zoomed area represent the Chicago Loop area, which is Chicago’s central business district. The Northwest region is Chicago’s Near North Side, a residential area. While the taxi demand in the Chicago Loop area is high during the day, it is low during off-peak hours. On the other hand, the demand for taxis in the residential area is higher compared to the business area. This is because people use taxis, as public transport does not run at 4 am. While MetaCitta correctly captures this shift in demand, as illustrated in Fig. 3c, Joint continues to forecast average high values in the business district, as observed in Fig. 3b.

5.2 Ablation Study

We analyze the contribution of MetaCitta’s training components: spatial alignment, task alignment, and outer optimization. By removing one component at a time, we evaluate their effectiveness and present the results in Table 2.

Removing MetaCitta’s outer optimization step leads to the most significant drop in performance, indicating the importance of having some knowledge of the target city, even if from another task. This knowledge can be obtained directly from other tasks, as observed in Fig. 2a and 2b, where similar regions indicate similar spatio-temporal behavior. Also, the performance decreases when spatial and task alignment is removed, indicating the impact of adding the MMD losses.

Table 2. Change in performance after removal of a component from MetaCitta.

5.3 Training Time Comparison

The training times for MetaST, MAML, MLDG, MetaStore, MetaCitta and Joint on the Chicago bike prediction task are 107.21, 106.68, 87.71, 7.99, 5.65 and 3.18 hours, respectively. As MetaCitta applies an extra alignment loss to extract invariant features, it needs slightly more training time than Joint, which has a simplified design. MetaCitta requires less training time than MetaST (\(94.73\%\)), MAML (\(94.70\%\)), MLDG (\(93.56\%\)) and MetaStore (\(29.28\%\)), as these baselines perform a two-stage optimization requiring time-consuming calculation of higher-order derivatives. Overall, MetaCitta performs most precisely and trains much faster than state-of-the-art meta-learning.

6 Related Work

Deep learning has recently shown great success and is widely adopted for spatio-temporal predictions [1, 15]. However, the success of these approaches depends on the availability of large amounts of data, which is typically a bottleneck.

Typically, transfer learning is used to deal with data scarcity, where a network trained for a specific task is fine-tuned on the target task. Recently, meta-learning has gained popularity as a way to learn from multiple tasks. Unlike transfer learning, meta-learning does not specialize in a specific task. Instead, it focuses on finding the parameters from the source tasks, so it can be quickly adapted to the target task (i.e., learn to learn) [6]. Model Agnostic Meta Learning [3] (MAML) and Meta Learning for Domain Generalization [9] (MLDG) are state-of-the-art meta-learning approaches widely used in different domains. MAML performs two optimization steps: task-specific updates on the inner step and global meta-updates on the outer step. MLDG extends MAML by leaving out a task at the inner level and performing meta-updates based on the left-out task.

Motivated by these approaches, several attempts have been made to transfer knowledge in the spatio-temporal domain. [2, 5, 7, 11, 12], use transfer learning to transfer knowledge from a data-rich source city to a data-poor target city. MetaStore [10] and MetaST [13] are MAML-based approaches to transfer knowledge from multiple cities to a target city. However, existing approaches transfer the knowledge for a specific task, introduce additional parameters and require small amounts of data from the target city and target task.

In contrast to these methods, MetaCitta learns from different tasks in multiple cities and can be effectively used in zero-shot and fine-tuning settings.

7 Conclusion

In this paper, we presented MetaCitta, a novel deep meta-learning approach for spatio-temporal predictions in cases where no or only limited training data of a target city and the task is available. MetaCitta leverages knowledge across cities and tasks by learning spatial and task-invariant feature representations. MetaCitta outperforms state-of-the-art meta-learning approaches regarding the RMSE in zero-shot and fine-tuning settings and requires \(94.70\%\) less training time compared to the state-of-the-art meta-learning approach MAML.