1 Introduction

Multivariate time series data are gathered from a single sensor, which monitors multiple parameters, including target series and exogenous series. Target series can be defined as a time series designated for making predictions. Exogenous series are other series that influence target series. The monitoring range of a single sensor is limited and cannot represent the environmental information of an entire city or country. Therefore, it is necessary to develop multiple sensors in different geospatial locations to simultaneously detect the environment of the entire region. These sensors with geospatial correlations collect time series called geo-sensory time series. Geo-sensory time series covers a broad scope of applications, such as traffic prediction [1], weather prediction [2], air quality prediction [3], and urban water distribution prediction [4]. Hence, forecasting geo-sensory time series has grown in importance since there is a rising demand for geo-sensory time series prediction.

One of the most significant challenges to realize accurate prediction is mining valuable knowledge from geo-sensory time series. Geo-sensory time series data are collected from certain geospatial and temporal scenarios; therefore, there are multi-scale spatial-temporal correlations in both spatial and temporal dimensions:

  1. (1)

    Inter-sensor spatial-temporal correlations. Massive sensors with geospatial locations are deployed in different regions. The series generated by other sensors directly impact the series generated by the target sensor, as depicted in Fig. 1a. To distinguish from the target series (target sensor series), we define the target series of other sensors as the relevant sensor series. The environmental information of neighbouring sensors is usually affected due to the distance advantage. For instance, wind can blow air pollutants to neighbouring regions. In addition, some regions have similar exogenous features, such as meteorological features and traffic occupancy rates, which can generate similar environmental information. Therefore, spatial correlations are highly dynamic, changing over time. Hence, selecting relevant sensor series is of great significance to geo-sensory time series prediction. Additionally, historical information influences the current value differently, as depicted in Fig. 1b.

  2. (2)

    Intra-sensor spatial-temporal correlations. For a certain (target) sensor, the intra-sensor time series contains the target series and exogenous series [5]. In the spatial dimension, the target series is usually affected by exogenous series, as depicted in Fig. 1a. In the temporal dimension, the current value is usually affected by historical information, as depicted in Fig. 1c.

Fig. 1
figure 1

a Spatial correlation. The graph structure inside the rectangle indicates the intra-sensor spatial correlations. The graph structure outside the rectangle represents the inter-sensor spatial correlations. b Temporal correlation. The contributions of the exogenous series and relevant sensor series to the predictive value are different

In recent years, many deep learning models have been proposed for geo-sensory time series prediction. For instance, the hDS-RNN [4] model for water flow and pressure forecasting, a novel multi-channel attention model (MCAM) [6] for fine-grained air quality inference, the adaptive spatial-temporal graph attention network (ASTGAT) [7] for traffic flow forecasting, and the causal-based graph Neural network (CausalGNN) [8] for COVID-19 pandemic forecasting have focused more on extracting spatial-temporal correlations between sensors and have achieved state-of-the-art results. The drawback is that these models only consider sensor time series and ignore exogenous series. For geo-sensory time series prediction, the target series is affected by various exogenous series, such as meteorological conditions. Existing works have focused on blending the information of sensor series and exogenous series [9,10,11]. Some researchers have attempted to employ multilayer perceptron (MLP) [12] and parametric-matrix-based methods [13] to extract exogenous features and then fuse them into the sensor features. Although these studies confirm that prediction performance can be improved by considering various exogenous series, they cannot explicitly and dynamically select relevant sensor series and exogenous series to make predictions. Hence, it is becoming challenging to distinguish the contribution of exogenous series and relevant sensor series to the predictive value [14].

Recently, multi-level attention networks (GeoMAN) [15] have employed local spatial attention to obtain the dynamic correlations between target series and each exogenous series and have used global spatial attention to capture the dynamic correlations between different sensors. GeoMAN employs a temporal attention mechanism in the temporal dimension to model the dynamic temporal correlations in a time series. The model is a typical geo-sensory time series prediction model for capturing inter- and intra-sensor spatial correlations. However, existing works first incorporate all time series, which consist of target series, exogenous series, and relevant series, and capture temporal correlations between different time intervals, and GeoMAN is no exception. Sensor series change over time and vary geographically; hence, blindly blending the information of relevant sensor series and exogenous series makes it impossible to explore deep-seated temporal correlation. Although works for geo-sensory time series prediction tasks have been significant, the aforementioned challenges are still not addressed well.

To overcome the aforementioned challenges, our focus is on capturing inter- and intra-sensor spatial-temporal correlations. Inspired by graph attention networks that capture spatial dependencies by assigning different weights to different neighbourhood nodes, we propose a joint network of non-linear graph attention and temporal attraction force (J-NGT) to achieve geo-sensory time series prediction. Our model can simultaneously capture inter-sensor and intra-sensor spatial-temporal correlations by two graph attention mechanisms.

The main contributions of our study are summarized as follows:

  1. (1)

    We propose a joint network containing two graph attention mechanisms, i.e., a non-linear graph attention mechanism and a temporal attraction force mechanism.

  2. (2)

    We design a non-linear graph attention mechanism to obtain the inter- and intra-sensor spatial correlations. In the non-linear graph attention mechanism, first, we specifically employ IMV-LSTM, which is a tensor LSTM, to transform the input features into higher-level features to gain sufficient expressive power and then calculate the weights between nodes.

  3. (3)

    Inspired by the Law of universal gravitation, we propose a temporal attraction force mechanism to sufficiently capture inter- and intra-sensor temporal correlations.

  4. (4)

    We conduct extensive experiments on three real-world datasets to evaluate our model. The results show that our model outperforms comparison models in previous works. The geo-sensory time series data are split into multivariate time series and sensor series, and we perform our model over the sub-datasets. The results show that considering both inter- and intra-sensor spatial-temporal correlations can enhance prediction accuracy. We replace each component with state-of-the-art methods in our model framework to demonstrate the effectiveness of different components. The result suggests that our proposed components play a role in improving prediction accuracy.

The remainder of this paper is organized as follows. We provide a literature review on time series prediction methods and graph neural networks for geo-sensory time series prediction in Section 2. We define the notations and problem formulation in Section 3. We describe the non-linear graph attention mechanism and temporal attraction force mechanism to capture spatial-temporal correlations in Section 4. We design experiments to test the validity of our model in different fields and analyse the experimental results in Section 5. We summarize our work and future work in Section 6.

2 Related work

2.1 Geo-sensory time series prediction

The existing time series prediction methods can be divided into statistical models, machine learning models, and deep learning models. Statistical models are the classical model, employed to predict stable and autocorrelated time series data. Although autoregressive (AR) [16], moving average (MA) [17], the autoregressive moving average (ARMA) [18], and the autoregressive integrated moving average (ARIMA) [19] have advantages in dealing with the univariate time series, they cannot model the dynamic spatial-temporal correlations of multivariate time series. The vector autoregressive (VAR) [20] model considers the relationships between stable and autocorrelated time series, and the information is limited.

Machine learning models have advantages in dealing with nonlinear and non-stationary data. Among them, artificial neural networks (ANN) [21], Gaussian process regression [22], support vector regression [23], and ensemble learning [24] have achieved respectable results in small numbers of multivariate time series prediction tasks. For example, Zhang et al. [25] presented the least squares support vector regression to achieve stock index and bond index prediction. Wang et al. [26] introduced sparse Gaussian conditional random fields for multivariate time series prediction. Although statistical models and machine learning models are widely used in time series forecasting, they do not scale well to geo-sensory time series because of multi-scale spatial-temporal correlations. [27]

Deep learning models have proven to be reliable for time series prediction. The recurrent neural network (RNN) with the capacity to gain short-term dependencies can perform time series prediction tasks. Due to the vanishing gradient, it is becoming difficult for standard RNN to gain long-term dependencies. To overcome the vanishing gradient, LSTM [28] and GRU [29], which are successful variants of RNN, were constructed to learn long-term dependencies. Recently, some advanced RNN variants were proposed. For instance, Feng et al. [30] introduced the clockwork RNN, which runs the hidden layer at different clock speeds to solve the long-term dependency problem. Zhang et al. [31] modified the GRU architectures, in which gates explicitly regulate two distinct types of memories to predict medical records and multi-frequency phonetic time series. Ma et al. [32] designed a temporal pyramid RNN to gain long-term and multi-scale temporal dependencies. The above RNN methods focus on capturing long-term dependencies; hence, they cannot fully exploit spatial relationships between variables. Therefore, the above existing works were far from satisfactory in geo-sensory time series prediction.

Modelling spatial-temporal correlation is the key to achieving a better prediction performance for geo-sensory time series. Ge et al. [10] calculated the sensor’s similarity matrix and selected k similar sensors, which were combined with exogenous series, to extract spatial features. Liang et al. [15] introduced a multi-level attention mechanism to gain local and global spatial-temporal correlations. Although the above studies consider the inter- and intra-sensor spatial correlations, which were most relevant to ours, they ignored the unique temporal characteristics of geo-sensory time series, e.g., sensors with geospatial correlation vary over time. In contrast, our model simultaneously models both inter- and intra-sensor spatial-temporal correlations to distinguish the contribution of the relevant sensor series and exogenous series in the spatial and temporal dimensions.

2.2 Graph neural network

Recently, graph neural networks (GNNs) have become popular due to their success in graph structure data. Many studies formulate geo-sensory time series on graphs to utilize spatial information fully. In particular, graph convolutional networks (GCNs) and graph attention mechanisms have become widespread in practice. The idea of existing studies that designed graph convolutional networks is to select the neighboruhood of sensors to capture the spatial correlations. Yu et al. [33] introduced a spatial-temporal GCN to gain bidirectional spatial-temporal dependencies from the neighbourhood of central nodes for traffic forecasting. Wang et al. [34] designed a GCN to learn the topological structure of sensor networks to capture spatial correlations for traffic safety prediction. Song et al. [35] developed a spatial-temporal synchronous mechanism to obtain localized spatial-temporal correlations for traffic flow prediction. These dynamic spatial-temporal correlations are localized due to the restriction of the range of neighbourhoods. To address these issues, Wang et al. [36] introduced a geographical spatial convolution to obtain complex spatial relationships among regions for traffic accident risk forecasting.

The graph attention mechanism is a novel graph neural network architecture for node classification of graph data [37]. The goal of the graph attention mechanism is to judge the relationships between nodes. Kong et al. [7] designed the graph talking-heads attention layer to capture the highly dynamic relationships between nodes in the traffic network. Lu et al. [12] explored multi-layer graph spatial attention networks to capture the dependencies between inbound and outbound flows in metro passenger flow. Shi et al. [38] introduced graph attention evolving networks that preserve similarities between nodes to evolve graph attention network weights across all temporal graphs. Han et al. [39] employed a graph attention mechanism to calculate the weight between nodes for representing the temporal dependence. In our work, we extend the graph attention network with IMV-LSTM to capture inter- and intra-sensor spatial correlations. IMV-LSTM transforms the individual variables into higher-level features, which can obtain sufficient expressive power.

2.3 Law of universal gravitation

Chi et al. [40] introduced the Law of universal gravitation to calculate the attraction force between nodes, representing the similarity between links for link prediction and improved prediction accuracy. Motivated by the above study, considering that the temporal correlations between different time intervals can be expressed as the gravitational force between nodes, we propose a temporal attraction force mechanism to calculate inter- and intra-sensor temporal correlations.

3 Preliminaries

Assume there are M sensors, each of which generates N kinds of time series. We specify one sensor as the target sensor for making predictions, while other sensors are used as relevant sensors. The target sensor generates N-1 kinds of exogenous series and one target series. We first construct two types of directed graphs to describe the spatial relationships of the time series. Both take the target series as the central node and the other series as neighbouring nodes. One graph structure indicates the spatial correlations of the target (sensor) series and relevant sensors, and the numbers of nodes are M, as depicted in Fig. 2a. We employ Y = (y1, y2, …, yM−1) = (y1, y2, …, yT) ∈ (M−1) × T to denote relevant sensor series, \( {\mathbf{y}}_t=\left({y}_t^1,{y}_t^2,\dots, {y}_t^{M-1}\right)\in {\mathbb{R}}^{\left(M-1\right)} \) to denote the M-1 relevant sensor time series at time t, and\( {y}_t^i\in \mathbb{R} \) to represent the node feature. Let \( {\mathbf{y}}^{\boldsymbol{p}}=\left({y}_1^p,\dots, {y}_T^p\right)\ \mathrm{with}\ {y}_t^p\in \mathbb{R} \) represent the target series of target sensors during the past T time points.

Fig. 2
figure 2

Graph structure. Nodes represent the target series of sensors(orange circles), exogenous series(green circles), and target series(pink circles). a Graph structure between sensors, b Graph structure between target series and exogenous series

Another graph indicates the spatial correlations of the target series and exogenous series, as depicted in Fig. 2b. Among N kinds of time series of the target sensor, one time series is the target series, and the other time series, i.e., \( {\mathbf{x}}_t=\left({x}_t^1,{x}_t^2,\dots, {x}_t^{N-1}\right)\in {\mathbb{R}}^{N-1} \), is the (N-1) exogenous series at time t, and \( {x}_t^j\in \mathbb{R} \) represents the feature of a node. We employ X = (x1, …, xT) = (x1, x2, …, xN−1) ∈ (N−1)×T as the exogenous series of window sizes T.

Given the previous readings of all sensors and the exogenous series, the model aims to predict the current value of the target sensor, denoted as

$$ {\hat{y}}_{T+1}^p=\mathcal{F}\left(\mathbf{X},\mathbf{Y},{\mathbf{y}}^{\boldsymbol{p}}\right) $$
(1)

4 Model

Two-stage attention-based encoder-decoder networks are currently the most popular method for time series prediction. The encoder with spatial attention selects the relevant features, while the decoder with temporal attention captures the long-term dependencies. We propose a joint network of non-linear graph attention and temporal attraction force for geo-sensory time series prediction. First, we propose a non-linear graph attention mechanism to capture the inter-sensor spatial correlations for multiple sensors and the intra-sensor spatial correlations for a single sensor, as depicted in Fig. 3a. Then, we design a temporal attraction force to capture the inter- and intra-sensor temporal correlations between the current value and the previous values, as depicted in Fig. 3b.

Fig. 3
figure 3

Graphical illustration of the joint network of non-linear GAT and temporal attraction force. a Non-linear graph attention mechanism. The non-linear GAT employs IMV-LSTM to transform the input features into higher-level features. Green, pink and orange circles, such as input data \( {x}_t^1 \), yp, and \( {y}_t^1 \), which are used as the input, represent a one-dimensional input feature. Blocks containing rectangles with circles inside represent higher-level features. b Temporal attraction force mechanism. The output of the non-linear graph attention mechanism, i.e., \( {\overset{\sim }{\mathbf{D}}}_t \)ht, and\( {\overset{\sim }{\mathbf{H}}}_t \), are used as the input to the temporal attraction force mechanism

4.1 IMV-LSTM

IMV-LSTM, which is a standard LSTM, uses hidden state tensors to update gate control units and memory cells. Hence IMV-LSTM can directly encode the individual variables into hidden states to reduce the time complexity. For example, given a two-variable input sequence, IMV-LSTM can encode the input feature into the hidden matrix of size 4 × 2, i.e., 4-dimensional hidden state per variable. IMV-LSTM, which is a standard LSTM neural network, contains an input gate \( {\overset{\sim }{\dot{\mathbf{i}}}}_t \), forget gate \( {\overset{\sim }{\mathbf{f}}}_t \), output gates \( {\overset{\sim }{\mathbf{o}}}_t \), and memory cells st. For multiple variables, the iterative update process of an MV-LSTM unit is as follows:

$$ \left.\begin{array}{c}{\overset{\sim }{\dot{\mathbf{i}}}}_t\\ {}{\overset{\sim }{\mathbf{f}}}_t\\ {}{\overset{\sim }{\mathbf{o}}}_t\end{array}\right\}=\sigma \left(\mathbf{W}\circledast {\mathbf{D}}_{t-1}+\mathbf{U}\circledast {\mathbf{y}}_t\right]+\mathbf{b}\Big) $$
(2)
$$ {\mathbf{j}}_t=\tanh \left({\mathbf{W}}_j\circledast {\mathbf{D}}_{t-1}+{\mathbf{U}}_j\circledast {\mathbf{y}}_t+{\mathbf{b}}_j\right) $$
(3)
$$ {\mathbf{s}}_t={\overset{\sim }{\mathbf{f}}}_t\odot {\overset{\sim }{\mathbf{c}}}_{t-1}+{\overset{\sim }{\mathbf{i}}}_t\odot {\mathbf{j}}_t $$
(4)
$$ {\mathbf{D}}_t={\overset{\sim }{\mathbf{o}}}_t\odot \tanh \left({\mathbf{s}}_t\right) $$
(5)

where Dt − 1 ∈ (M−1) ∗ m and yt ∈ M−1 are the previous hidden state and the current input, respectively. The cell update matrix jt ∈ (M−1) ∗ m is used to update the memory cell st. The transition tensor W ∈ (M−1) ∗ m ∗ m, Wj ∈ (M−1) ∗ m ∗ m, U ∈ (M − 1) ∗ m and Uj ∈ (M − 1) ∗ m are parameters to learn. ⊛ is a tensor-dot operation, which is the product of two tensors. σ and ⊙ are a logistic sigmoid function and an element-wise multi-plication.

4.2 Non-linear graph attention

Geo-sensory time series contains two spatial correlations, namely, the inter-sensor spatial correlations for multiple sensors and the intra-sensor spatial correlations for a single sensor. A non-linear graph attention mechanism is developed to capture the two kinds of spatial correlations. Different from the graph attention network (GAT), the non-linear graph attention mechanism depends on knowing the graph structure up front. We formulate the Geo-sensory time series as two types of graph structures, as depicted in Fig. 2, where nodes represent the relevant sensor series (orange circles), exogenous series (green circles), and target series (pink circles). We take the target series as the central node and the other series as neighbourhood nodes.

4.2.1 Inter-sensor spatial correlation

Relevant sensors have decisive impacts on a target sensor. Hence, the non-linear graph attention mechanism aims to calculate the impacting weights between the target sensor and relevant sensors, namely, the inter-sensor spatial correlation, as depicted in Fig. 3a. The input of NGAT is a set of node features, y = {\( {y}_t^1,{y}_t^2,\dots, {y}_t^{M-1},{y}_t^p \)}, where \( {\mathbf{y}}_t=\left({y}_t^1,{y}_t^2,\dots, {y}_t^{M-1}\right) \), where \( {y}_t^i\in \mathbb{R} \) represents the target series of relevant sensors, \( {y}_t^p\in \mathbb{R} \) represents the target series of target sensors \( {y}_t^p \), and M is the number of nodes.

The node feature is an individual variable, which does not have sufficient expressive power. In the original GAT, a learnable linear transformation is applied to transform the input features into higher-level features [37]. Each node is parametrized by a weight matrix, \( \mathbf{W}{y}_t^i \), W ∈ m. However, the expressive power of linear transformation is limited. Therefore, we employ IMV-LSTM as transformation. We obtain specific transformation representations by IMV-LSTM:

$$ {\mathbf{D}}_t={f}_{imv}^{sen}\left({\mathbf{D}}_{t-1},{\mathbf{y}}_t\right) $$
(6)
$$ {\mathbf{h}}_t={f}_{imv}^{pre}\left({\mathbf{h}}_{t-1},{y}_t^p\right) $$
(7)

where Dt ∈ (M − 1) ∗ m and ht ∈ m are the hidden state tensors of the relevant sensor series and target series, respectively. \( {f}_{imv}^{sen} \)and \( {f}_{imv}^{pre} \) are IMV-LSTM units that can be computed according to Eq. (2)–(5). IMV-LSTM uses tensor to represent higher-level features of sensor series at time t, such that each row vector of the hidden state tensor represents the higher-level feature of an individual variable:

$$ {\mathbf{D}}_t=\left({\mathbf{d}}_t^1,{\mathbf{d}}_t^2,\dots, {\mathbf{d}}_t^{M-1}\right) $$
(8)

where the element \( {\mathbf{d}}_t^i\in {\mathbb{R}}^m \) of Dt is the hidden state vector specific to relevant sensor i.

We exploit GAT to calculate the spatial correlations between the target series and relevant sensors. In other words, assigning different weights to different relevant sensors. Here, we calculate the weight coefficients between the target series and relevant sensor i. First, a weight vector is employed to parametrize, and then the LeakyReLU nonlinearity(with negative input slope α= 0.2) is applied. The calculated correlation coefficient is expressed as:

$$ {e}_t^i=\mathrm{LeakyReLU}\left({a}^{\top}\left[{\mathbf{d}}_t^i;{\mathbf{h}}_t\right]\right) $$
(9)
$$ {\displaystyle \begin{array}{c}{\alpha}_t^i=\mathrm{softmax}\left({e}_t^i\right)\\ {}=\frac{\exp \Big(\mathrm{LeakyReLU}\left({a}^{\top}\left[{\mathbf{d}}_t^i;{\mathbf{h}}_t\right]\right)}{\sum_{{\mathbf{d}}_t^i\in {\mathbf{D}}_t}\exp \left(\mathrm{LeakyReLU}\left({a}^{\top}\left[{\mathbf{d}}_t^i;{\mathbf{h}}_t\right]\right)\right)}\end{array}} $$
(10)

where \( \left[{\mathbf{d}}_t^i;{\mathbf{h}}_t\right] \) is a concatenation operation, and ∗represents transpose. The attention weight\( {\alpha}_t^i \) represents the importance of the relevant sensor i to the target series, namely, the inter-sensor spatial correlations. A softmax function is applied to normalize them across all relevant sensors. Once obtained, the attention weights are assigned to different relevant sensors:

$$ {\displaystyle \begin{array}{c}{\overset{\sim }{\mathbf{D}}}_t=\left({\alpha}_t^1{\mathbf{d}}_t^1,{\alpha}_t^2{\mathbf{d}}_t^2,\dots, {\alpha}_t^{M-1}{\mathbf{d}}_t^{M-1}\right)\\ {}=\left({\overset{\sim }{\mathbf{d}}}_t^1,{\overset{\sim }{\mathbf{d}}}_t^2,\dots, {\overset{\sim }{\mathbf{d}}}_t^{M-1}\right)\end{array}} $$
(11)

where \( {\overset{\sim }{\mathbf{d}}}_t^i \) is the new higher-level feature of relevant sensor i.

4.2.2 Intra-sensor spatial correlation

Inside the target sensor, there are complex correlations between the target series and exogenous series. For instance, an air quality sensor reports different time series, such as PM2.5, PM10, and CO. In the real world, PM2.5 concentration is affected by the concentration of PM10 and CO. To address this issue, we also apply the non-linear graph attention mechanism to calculate correlations between the target series and the exogenous series, namely, intra-sensor spatial correlation. Given the target series, \( {y}_t^p\in \mathbb{R} \) represents the feature of the central node, as well as the exogenous series, \( {\mathbf{x}}_t=\left[{x}_t^1,{x}_t^2,\dots, {x}_t^{N-1}\right]\in {\mathbb{R}}^{N-1} \), where \( {x}_t^j\in \mathbb{R} \) represents the feature of neighbourhood node j. We can construct a non-linear graph attention mechanism to calculate intra-sensor spatial correlations, as depicted in Fig. 3a. First, transforming the input features into higher-level features via IMV-LSTM,

$$ {\mathbf{H}}_t={f}_{imv}^{\mathrm{exo}}\left({\mathbf{H}}_{t-1},{\mathbf{x}}_{\mathbf{t}}\right) $$
(12)
$$ {\mathbf{H}}_t=\left[{\mathbf{h}}_t^1,{\mathbf{h}}_t^2,\dots, {\mathbf{h}}_t^{N+1}\right] $$
(13)

where Ht ∈ (N − 1) ∗ n represents the hidden state tensors of the exogenous series and \( {\mathbf{h}}_t^j\in {\mathbb{R}}^n \) is the higher-level features of the j-th node. \( {f}_{imv}^{\mathrm{exo}} \) is an IMV-LSTM unit that can be computed according to Eq. (2)–(5) with the newly input Ht − 1 and xt. The correlation coefficients between the target series and exogenous series are calculated as follows:

$$ {l}_t^j=\mathrm{LeakyReLU}\left({a}^{\top}\left[{\mathbf{h}}_t^j;{\mathbf{h}}_t\right]\right) $$
(14)
$$ {\displaystyle \begin{array}{c}{\beta}_t^j=\mathrm{softmax}\left({l}_t^j\right)\\ {}=\frac{\exp \Big(\mathrm{LeakyReLU}\left({a}^{\top}\left[{\mathbf{h}}_t^j;{\mathbf{h}}_t\right]\right)}{\sum_{{\mathbf{h}}_t^j\in {\mathbf{D}}_t}\exp \left(\mathrm{LeakyReLU}\left({a}^{\top}\left[{\mathbf{h}}_t^j;{\mathbf{h}}_t\right]\right)\right)}\end{array}} $$
(15)

where the attention weight \( {\beta}_t^j \) is used to measure the importance of the j-th exogenous series at time t. ht has been computed according to Eq. (7). With these attention weights, the new features are produced as follows:

$$ {\displaystyle \begin{array}{c}{\overset{\sim }{\mathbf{H}}}_t=\left[{\beta}_t^1{\mathbf{h}}_t^1,{\beta}_t^2{\mathbf{h}}_t^2,\dots, {\beta}_t^{N-1}{\mathbf{h}}_t^{N-1}\right]\\ {}=\left[{\overset{\sim }{\mathbf{h}}}_t^1,{\overset{\sim }{\mathbf{h}}}_t^2,\dots, {\overset{\sim }{\mathbf{h}}}_t^{N-1}\right]\end{array}} $$
(16)

4.3 Temporal attraction force mechanism

Most research on the temporal dependence of geo-sensory time series has focused on blindly blending the information of relevant sensors and exogenous series and then calculating the temporal dependence by a temporal attention mechanism. Hence, these studies rarely distinguish the contribution of the historical target series, exogenous series, and relevant sensor series into the predictive value.

In the temporal dimension, there may be temporal correlations between pairs of time points, which can be regarded as the diverse attractive factors between time points. Therefore, we assume that a dynamic attraction force always exists between any pair of time points. To facilitate the understanding of attraction force, we analyse it as follows: In the universe, planets revolve around a star in a galaxy. The universal gravitation between the planet and the star prevents the planet from flying away from the star. Similar to the universe, the current time point gains information from the historical time points, which can be taken as the historical time points revolving around the current time point. Hence, the gravitational force between the current time point and the historical time points is named the temporal correlation. The temporal correlations are used to measure the importance of historical time points to the current time point.

Given the mass of the planet and the star, i.e., m1 and m2, universal gravitation is presented as follows:

$$ F=G\ \frac{m_1{m}_2}{r^2} $$
(17)

where r denotes the distance between the planet and the star, and G is the gravitational constant.

The gravitation formula was demonstrated by [40] to measure the correlations between nodes for link prediction in the social network. The social network is a typical graph-structured data. People have a real link if the attraction force reaches a specified threshold. We can map time series to a graph structure in the temporal dimension, where a time point is regarded as a node, and the edge weight is the temporal correlation (attraction force). Inspired by the Law of universal gravitation, we propose a modified gravitation formula, namely the temporal attraction force mechanism, for measuring the temporal correlation. In our work, the feature of time point is defined as its mass, which is matrix rather than constant, e.g., \( {\overset{\sim }{\mathbf{D}}}_t \) and \( {\overset{\sim }{\mathbf{H}}}_t \). The distance is the difference between two time points, e.g., r = ti, t is the current time point, and i is the historical time point. In addition, the gravitational constant G is replaced by a learnable weight matrix.

From another point of view, we score the temporal correlation between the current and historical time points by performing a learnable linear transformation (i.e., G) on the fused features, which is further scaled by the squared distance of the two-time points. Here, the features of two time points are fused by element-wise multiplication. The purpose of scaling is to improve sensitivity to time intervals. Moreover, the softmax function is employed to gain the weights.

The temporal attraction force mechanism takes the output of the non-linear graph attention mechanism as input, as depicted in Fig. 3b. The inter-sensor temporal correlation for relevant sensors is given by

$$ {p}_{ti}={\mathbf{G}}_p\frac{{\overset{\sim }{\mathbf{D}}}_t\ast {\overset{\sim }{\mathbf{D}}}_i}{{\left(t-i\right)}^2} $$
(18)
$$ {\alpha}_{ti}=\frac{\exp \left({p}_{ti}\right)}{\sum_{k=1}^{t-1}\exp \left({p}_{tk}\right)} $$
(19)

where \( {\overset{\sim }{\mathbf{D}}}_t \) is the feature of relevant sensors at time point t,\( {\overset{\sim }{\mathbf{D}}}_i \) is the feature of relevant sensors at time point i (t > i), and Gp is a parameter to learn. (t − i) is the temporal distance, which is the difference between time point t and time point i. * is element-wise multiplication. Then, a softmax function is applied to pti to make correlations easily comparable across all time points. The correlation coefficient αti represents the importance of the relevant sensor features at time point i for the relevant sensor features at time t. Once obtained, the correlation coefficients are utilized to compute a weighted sum of the features corresponding to them as the new temporal features at time t:

$$ {{\overset{\sim }{\mathbf{D}}}_t}^{\prime }={\sum}_{i=1}^{t-1}{\alpha}_{ti}{\overset{\sim }{\mathbf{D}}}_i $$
(20)

Similarly, the intra-sensor temporal correlation for two time points is given by

$$ {q}_{ti}={\mathbf{G}}_{\boldsymbol{q}}\frac{{\overset{\sim }{\mathbf{H}}}_t\ast {\overset{\sim }{\mathbf{H}}}_i}{{\left(t-i\right)}^2} $$
(21)
$$ {\beta}_{ti}=\frac{\exp \left({q}_{ti}\right)}{\sum_{k=1}^{t-1}\exp \left({q}_{tk}\right)} $$
(22)

where \( {\overset{\sim }{\mathbf{H}}}_t \) is the feature of exogenous series at time point t,\( {\overset{\sim }{\mathbf{H}}}_i \) is the feature of relevant sensors at time point i (t > i), and Gq is a parameter to learn. (t − i) is the distance, which is the difference between time point t and time point i. * is element-wise multiplication. We normalize these weight coefficients by a softmax function. With these weight coefficients, the new temporal feature of the exogenous series is computed as

$$ {{\overset{\sim }{\mathbf{H}}}_t}^{\prime }=\frac{1}{t-1}{\sum}_{i=1}^{t-1}{\beta}_{ti}{\overset{\sim }{\mathbf{H}}}_i $$
(23)

The univariate time series prediction models analyse the temporal dependence between the previous values and the current values of the prediction series. However, multivariate time series prediction models generally ignore the long-term dependencies of the target series. To ensure that the model no longer suffers from the loss of historical information, we consider the temporal correlations of the target series. The temporal correlations can be obtained by the temporal attraction force mechanism as follows:

$$ {o}_{ti}={\mathbf{G}}_{\boldsymbol{o}}\frac{{\mathbf{h}}_t\ast {\mathbf{h}}_i}{{\left(t-i\right)}^2} $$
(24)
$$ {\gamma}_{ti}=\frac{\exp \left({o}_{ti}\right)}{\sum_{k=1}^{t-1}\exp \left({o}_{tk}\right)} $$
(25)

Then, the new temporal feature of the target series is defined as follows:

$$ {\overset{\sim }{\mathbf{h}}}_t^{\prime }=\frac{1}{t-1}{\sum}_{i=1}^{t-1}{\gamma}_{ti}{\mathbf{h}}_i $$
(26)

Considering all the temporal correlations, we briefly integrate the new temporal features of the inter-sensor, intra-sensor, and target series as follows:

$$ {\mathbf{C}}_t=\left[{{\overset{\sim }{\mathbf{D}}}_t}^{\prime };{{\overset{\sim }{\mathbf{H}}}_t}^{\prime };{\overset{\sim }{\mathbf{h}}}_t^{\prime}\right] $$
(27)

The temporal attraction force mechanism sufficiently selects the historical information of the target series, exogenous series, and relevant sensor series. It takes advantage of inter- and intra-sensor features to strengthen the temporal correlation. Finally, we use a linear transformation to make the final output as follows:

$$ {\hat{y}}_{T+1}^p=\mathcal{F}\left(\mathbf{X},\mathbf{Y},{\mathbf{y}}^{\boldsymbol{p}}\right)={\mathbf{v}}_y^{\top}\left({\mathbf{W}}_y{\mathbf{C}}_t+{\mathbf{b}}_w\right)+{b}_v $$
(28)

4.4 Complexity analyses

In this section, we analyse the complexity of our model. Our model has five parts: two non-linear graph attention mechanisms and three temporal attraction force mechanisms. For IMV-LSTM, the time complexity is O(D2/N + N*D), where D is the number of IMV-LSTM neurons and N is the input variable. The time complexity of a non-linear graph attention mechanism computing N nodes may be expressed as O(D2 + N*D). The time complexity for the temporal attraction force mechanism is O(N*D2). Therefore, the total time complexity is O((N + M + 3)*D2 + (N + M)*D). In the actual training process, we use a GPU to improve the training speed.

5 Experiment

5.1 Dataset description

We utilize three real-world geo-sensory time series datasets to evaluate our model, including Beijing air quality datasets, traffic flow (PEMS08) datasets, and Ireland Weather datasets, as shown in Table 1.

Table 1 Details of the datasets
  1. 1.

    Air quality datasetFootnote 1

This dataset collected hourly the concentration of many different pollutants (i.e., PM2.5, PM10, SO2 NO2, CO, O3) as well as some meteorological readings (i.e., temperature, pressure, dew point temperature, precipitation, wind direction, wind speed) from 12 nationally controlled air quality monitoring sites. The time period is from March 1st, 2013, to February 28th, 2017. Among them, we employ the Aotizhongxin station as the target sensor and the others as relevant sensors. Generally, PM2.5 is the primary pollutant of air quality; thus, we take it as the target series and the concentration of other pollutants as the exogenous series.

  1. 2.

    PEMS08 datasetFootnote 2

This dataset consists of the data of 170 sensors collected from the Caltrans Performance Measurement System, which is aggregated into 5-minute windows from the raw data [41]. The dataset ranges from July 1, 2016, to August 31, 2016. We choose a station as the target sensor and the others as relevant sensors. We set the traffic flow as the target series and other key attributes of traffic observations (i.e., occupy and speed) as exogenous series. Constrained by our experimental environment, we decided to simplify and select 20 sensors to verify our model.

  1. 3.

    Weather datasetFootnote 3

The dataset records hourly weather data from 23 Met Éireann weather stations in Ireland. The time period is from January 1, 2018, to February 1, 2022. We take the temperature as the target series and choose 8 relevant features as exogenous series. The Cork station is set as the target sensor, and the other 22 stations are set as relevant sensors.

In fact, there are missing values in all datasets due to sensor power outages or communication errors. We employ linear interpolation to fill the missing values. We partition the datasets into the training, validation, and test sets by a ratio of 6:2:2.

5.2 Methods for comparison

We select statistical models, machine learning models, and state-of-the-art deep learning models as comparison models. The modes are introduced as follows:

ARIMA [42]

ARIMA is a typical statistical model for univariate time series prediction. The ARIMA model converts nonstationary time series to stationary data by utilizing difference processing.

SVR [43]

SVR is an application of support vector machines to time series prediction problems. A significant advantage of SVR is that it can deal with small numbers of high-dimensional datasets.

DA-RNN [5]

The attention-based encoder-decoder network for time series prediction employs an input attention mechanism to obtain spatial correlations and temporal attention to capture temporal dependencies.

DSTP [44]

The model employs a two-phase attention mechanism to strengthen the spatial correlations and temporal attention mechanism to capture temporal dependencies for long-term and multivariate time series prediction.

hDS-RNN [4]

The model develops a hybrid spatial-temporal attention mechanism, which can enhance spatial-temporal correlation learning.

DAQFF [45]

DAQFF is a hybrid deep learning model that employs one-dimensional CNNs and bidirectional LSTM to extract trend features and possible spatial correlation features of multiple stations.

GeoMAN [15]

GeoMAN is a multi-level attention-based recurrent neural network for geo-sensory time series prediction. The model considers local spatial correlations between target series and exogenous series as well as global spatial correlations between sensors.

IMV-LSTM [14]

IMV-LSTM is an extension of LSTM. The model utilizes tensorized hidden states and an associated updating scheme to update gate control units and memory cells.

5.3 Experimental settings

We execute a grid search strategy and choose the best values for three types of key hyperparameters in J-NGT. For the number of time windows T, we set T∈{5, 10, 15, 20, 25}. To determine the dimensions of higher-level features for the relevant sensor series and the exogenous series, we set m = n∈{16, 32, 64, 128}. For all RNN-based models (i.e., DA-RNN, DSTP, hDS-RNN, DAQFF, GeoMAN, and IMV-LSTM), we similarly adopt a grid search strategy to determine the best performance of these models for a fair comparison. We take T as the window and m as the size of the hidden states for RNN. For DAQFF, three convolution layers have different filter sizes, and we set them to 64, 32, and 16. The models are trained for 10 epochs with a batch size of 128. The initial learning rate is set as 0.001 and decays by 10% every 3 epochs.

To assess the performances of J-NGT and comparison models, we adopt three evaluation metrics: mean absolute error (MAE), root mean square error (RMSE), and R squared (R2). MAE and RMSE are employed to measure the error between the predicted and observed values. R Squared (R2) is chosen as the indicator to measure the fitting effect of the model. The range of R2 is determined as [0,1]. If R2 is close to 1, it means that the prediction accuracy of our model is high. MAE, RMSE, and R2 are defined as follows:

$$ \mathrm{MAE}=\frac{1}{N}{\sum}_{i=1}^N\mid {y}_t^i-{\hat{y}}_t^i\mid $$
(29)
$$ \mathrm{RMSE}=\sqrt{\frac{1}{N}{\sum}_{i=1}^N{\left({y}_t^i-{\hat{y}}_t^i\right)}^2} $$
(30)
$$ \mathrm{R}2=1-\frac{\sum_{i=1}^N\mid {y}_t^i-{\hat{y}}_t^i\mid }{\sum_{i=1}^N\mid {y}_t^i-\overline{y}\mid } $$
(31)

where \( {\hat{y}}_t \) and yt are the predicted value and observed value at time t, \( \overline{y} \) is the average value of the observed values, and N represents the number of samples.

5.4 Comparison and result analysis

In this section, we give the experimental results on three real-world datasets, as shown in Tables 2, 3 and 4. The best results of each dataset are marked. In addition, we show the fitting results with bar charts in Fig. 4 to clearly observe their differences.

Table 2 Prediction results on geo-sensory time series datasets with sensor series and exogenous series
Table 3 Prediction results on geo-sensory time series datasets with sensor series
Table 4 Prediction results on geo-sensory time series datasets with exogenous series
Fig. 4
figure 4

Fitting results on three datasets

Table 2 displays the performances of each model on three datasets with sensor series and exogenous series. As seen in the table, J-NGT significantly outperforms all the comparison models on three evaluation metrics. In this paper, we discuss MAE. J-NGT has a relatively lower MAE from 3.7% to 90.1% than other models on the three datasets. In the table, it can be seen that the performance of ARIMA is the worst. The MAE value of J-NGT is 8.5449 on the air quality dataset, which is approximately 86.0% less than that of ARIMA(61.0411) since the ARIMA only model target series ignores the sensor series and exogenous series of geo-sensory time series. SVR takes into account all information of geo-sensory time series, so the MAE of SVR is significantly lower than that of ARIMA. Although SVR improves prediction accuracy to a certain extent, it requires considerable computational cost for large numbers of multivariate time series prediction tasks. Because geo-sensory time series have long-term dependencies, J-NGT achieves better performance than SVR by considering a much longer dependency.

The attention-based RNNs, such as DA-RNN, DSTP, and hDS-RNN, employ various attention mechanisms to obtain spatial correlations and use temporal attention mechanisms to capture temporal correlations. Although these models outperform SVR and ARIMA, they mixed sensor and exogenous series to capture spatial correlations resulting in the MAE values of J-NGT being smaller than that of the attention-based RNNs. For example, J-NGT shows 39.0%, 47.7%, and 39.6% improvements in MAE compared to the above attention-based RNN models on the air quality dataset. The above methods focus on blending the information of relevant sensor series and exogenous series and hardly distinguish the contribution of the exogenous series and relevant sensor series into predictive value.

DAQFF employs one-dimensional convolutional neural networks to extract the local trend features and spatial correlation features of multiple stations, but implicit features do not contain geospatial spatial correlations. The results reveal that it and DA-RNN achieve comparable performance. GeoMAN outperforms the above models since it is capable of capturing spatial correlations of intra-sensor and spatial correlations of inter-sensor. Since J-NGT not only considers both spatial correlations but also captures the temporal correlations of inter- and intra-sensor, it achieves better performance than GeoMAN. For instance, J-NGT shows a 9.6% improvement in MAE compared to GeoMAN on the air quality dataset. The IMV-LSTM network outperforms the attention-based RNN model (i.e., DA-RNN, DSTP, and hDS-RNN) by 46.1% at most since the IMV-LSTM network models individual variables can capture different dynamics to make accurate predictions.

In summary, the J-NGT can outperform the comparison models. This illustrates that capturing the inter- and intra-sensor spatial-temporal correlations can provide more reliable input features for accurate prediction. The non-linear graph attention mechanism can calculate the correlation weights between the target series and other series(relevant sensor series and exogenous series). The temporal attraction force mechanism can sufficiently select the historical information of the target series, exogenous series, and relevant sensor series. For visual comparison, Fig. 4 provides the R2 of all models. J-NGT achieves the best fitting effect across the three datasets.

5.5 Evaluation of the sub-dataset

To verify the importance of both the inter- and intra-sensor spatial-temporal correlations, we compare J-NGT with comparison models on the sub-datasets. We divide our dataset into two sub-datasets, one containing sensor series and target series and the other containing exogenous series and target series. J-NGT can be separated into two parts: J-NGT-sen, which captures inter-sensor spatial-temporal correlations for the sensor series dataset, and J-NGT-exo, which captures intra-sensor spatial-temporal correlations for the exogenous series dataset. Tables 3 and 4 compare the performances of the submodule and comparison models on the six sub-datasets. We do not compare the experimental results of ARIMA, DAQFF, and GeoMAN on sub-datasets since ARIMA only considers the target series, and DAQFF, as well as GeoMAN, are introduced to predict geo-sensory time series.

In Tables 3 and 4, we discover that J-NGT-sen and J-NGT-exo exceed almost all comparison models on both the sub-datasets of the Air Quality dataset and Weather dataset. For the sub-datasets of the PEMS08 dataset, J-NGT consistently achieves the best performance compared with SVR, LSTM, DA-RNN, DSTP, and hDS-RNN. The IMV-LSTM model achieves competitive results on the sub-datasets of the PEMS08 dataset, outperforming J-NGT-sen and J-NGT-exo. However, J-NGT will not suffer from this issue since we focus on capturing both inter-sensor and intra-sensor spatial-temporal correlations.

To show the necessity of capturing both inter-sensor and intra-sensor spatial-temporal correlations, we compare J-NGT with comparison models on the sub-datasets and the full datasets. J-NGT has a relatively lower MAE from 3.9% to 12.8% than J-NGT-sen and J-NGT-exo on the three datasets. For visual comparison, we provide the experimental results on MAE and R2, as depicted in Figs. 56 and 7. From the data in the figures, it is apparent that J-NGT obtains the highest performance on three geo-sensory time series datasets with sensor series and exogenous series.

Fig. 5
figure 5

Prediction results on the Air Quality dataset

Fig. 6
figure 6

Prediction results on the PEMS08 dataset

Fig. 7
figure 7

Prediction results on the Weather dataset

5.6 Effects of different components

We conduct a detailed experiment to demonstrate the effectiveness of different components. Specifically, we replace each component with state-of-the-art models in our model framework. First, the J-NGT with replaced components is named as follows.

J-NGT/linear transformation

We employ the learnable linear transformation, i.e., a weight matrix, WF, instead of the IMV-LSTM of J-NGT.

J-NGT/temporal attention

We replace the temporal attraction force mechanism with the temporal attention mechanism. The temporal attention mechanism is commonly utilized to capture the long-term dependencies and has demonstrated outstanding performance.

Table 5 presents the prediction result of J-NGT and its variants. We highlight several observations from these results:

  1. (1)

    The best results on all datasets are obtained with a non-linear graph attention mechanism and temporal attraction force mechanism.

  2. (2)

    Replacing the non-linear graph attention mechanism from the full model leads to a degradation of model performance on all datasets. The result suggests that non-linear graph attention mechanism components play a role in improving prediction accuracy.

  3. (3)

    The temporal attraction force mechanism has not been used to obtain temporal correlations. To verify its validity, we replace it with a temporal attention mechanism. As Table 5 shows, the performance of J-NGT/temporal attention drops on all datasets. The result shows that the temporal attraction force mechanism is effective in capturing temporal correlations.

Table 5 Prediction results of J-NGT and its variants

Taken together, these results suggest that J-NGT fully captures both inter- and intra-sensor spatial-temporal correlations and still achieves the best prediction performance.

5.7 Statistical test

To accurately evaluate the performance of J-NGT, we perform the statistical test in the MAE of three datasets. Based on the previous work that uses a Two-tailed T-test [46], we mainly discuss it. We set the significance degree α as 0.05. When the value of the two-tailed T-test is larger than the critical value of the table lookup, i.e., p ≤ 0.05, the assumption that mi < m0 cannot be rejected suggests that there is a significant difference.

On the Air Quality dataset, the average MAE value of J-NGT is computed as \( \mu ={\sum}_{i=1}^k{m}_i/k=8.5449 \), where mi is the i-th MAE value, and the value of k is 5. The variance is computed as \( {\delta}^2={\sum}_{i=1}^k{\left({m}_i-\mu \right)}^2/\left(k-1\right)=2.1\times {10}^{-4} \). The critical value is computed as \( t=\sqrt{k}\mid \mu -{m}_0\mid /\delta =3.608 \), where m0 = 8.5683 is the assumed maximum MAE value. In the same way, on PEMS08 and Weather, the critical value t is 4.124 (m0 = 20.4772)and 3.350 (m0 = 0.4001), respectively. We can observe that the critical value t of three datasets is all greater than that given 2.776 (that are marked in black in Table 6) from the two-tailed T-test table, which indicates that the MAE of J-NGT is smaller than the assumed maximum MAE value with a confidence degree (1-α=0.95).

Table 6 Two-tailed T-test table

From the above statistical tests, it can be seen that the performance of J-NGT on the three datasets is significantly better than that of the comparison models.

6 Conclusion

We propose a joint network of non-linear graph attention and temporal attraction force (J-NGT) for geo-sensory time series prediction. Specifically, we propose two graph attention mechanisms to strengthen both inter- and intra-sensor spatial-temporal correlations, which can improve prediction accuracy. To investigate the effectiveness of J-NGT, we construct three-part experiments:

  1. (1)

    Compared with statistical models, machine learning models, and deep learning models on three real-world datasets, the experimental results show that J-NGT can outperform them. In particular, the comparison with the GeoMAN and DQAFF models suggests that J-NGT enhances the ability to obtain inter- and intra-sensor spatial-temporal correlations.

  2. (2)

    The geo-sensory time series data are split into multivariate time series and sensor series, and we perform our model over the sub-datasets. The results show that it is necessary to simultaneously model both spatial-temporal correlations.

  3. (3)

    The experiments replacing our proposed component from the full model validate the necessity of each component in J-NGT.

The advantages of J-NGT are summarized as follows:

  1. (1)

    Considering the exogenous series and sensor series, it is beneficial to overcome the limitation of lacking spatial-temporal information.

  2. (2)

    We propose the non-linear graph attention mechanism to learn the spatial correlations of inter- and intra- sensors. To capture the temporal correlations, we design the temporal attraction force mechanism to sufficiently select the historical information of the target series, exogenous series, and relevant sensor series. A joint network of non-linear graph attention and temporal attraction force can distinguish the contribution of exogenous series and relevant sensor series into predictive value.

There are several promising directions for future work. First, we will design a similarity measure algorithm to calculate the temporal distance between temporal points to replace the distance of the temporal attraction force mechanism. The temporal distance is a crucial hyper-parameter. As the temporal distance increases, the easier it is for the gradient explosion. How to calculate the temporal distance will be a significant problem. Second, we will employ a novel graph attention mechanism to recover consecutive missing values of the geo-sensory time series. Currently, we linearly fill the missing values. How to fill the consecutive missing values with contextual information is an important problem for geo-sensory time series prediction.