Introduction

In recent years, the rapid advancement of sensing devices in intelligent transportation systems (ITS) has facilitated the collection of large volumes of traffic data. This data not only plays an important role in constructing traffic prediction models but also aids in understanding the movement patterns of vehicles or crowds [4, 18]. In the filed of traffic prediction, origin-destination (OD) demand prediction has gradually emerged as an independent and significant research area. Different from traditional prediction, OD prediction focuses on forecasting traffic demand from specific origin locations to particular destinations using historical observations [36].

OD flow data is typically sourced from various channels, including sensors, smart card records, and taxi booking information [23]. Each of these channels provides valuable insights into different aspects of transportation behavior. Moreover, OD flows can be categorized into static flows and dynamic flows based on their inherent characteristics [8, 21]. Static flows refer to stable and consistent movement patterns occurred during specific time intervals, such as peak commuting hours between residential and commercial areas [35]. These flows follow regular routines and display predictable traffic demand patterns. In contrast, dynamic flows capture temporal fluctuations in population movement, reflecting changes in individual travel behavior influenced by factors like day of the week, time of day, and city events. Dynamic flows are more variable, posing modeling challenges due to the constantly changing nature of human mobility. Hence, effective OD flow modeling requires the incorporation of both static and dynamic flow features.

Faced with the challenge of OD demand prediction, traditional statistical methods encounter limitations with nonlinear and complex OD data. As a result, researchers are increasingly turning towards deep learning models to address these spatial-temporal complexities. Convolutional neural networks (CNNs) are commonly employed for spatial dependencies, especially suitable for grid-like data structures [24]. Recurrent neural networks (RNNs), including variants like long short-term memory (LSTM), are favored for capturing temporal patterns and modeling long-term dependencies in sequential data [37]. Moreover, attention mechanisms play a crucial role in OD demand prediction [30], enabling models to focus on relevant spatial or temporal features while reducing irrelevant ones. By directing attention towards specific regions or time steps, models can effectively capture the most significant dependencies for OD demand prediction.

More recently, there has been a growing interest in graph convolutional neural networks (GCNs) because of their impressive capabilities in learning graph representations. By leveraging message-passing mechanisms and aggregating information from neighboring nodes, GCNs extract useful dependencies among nodes and encode complex spatial relationships. Several GCN-based prediction approaches, such as MTGNN [33], DMSTGCN [12], and STHAN [19], have emerged for spatial-temporal prediction. These models exploit the inherent graph structure of transportation networks and demonstrate promising prediction performance.

Although existing models have achieved some success in OD demand prediction, three major issues still require further addressed:

  • Consideration of static and dynamic external factors: OD demand prediction is influenced by static external factors such as road structure and geographical features, as well as dynamic external factors like weather conditions and traffic accidents. However, existing research in this area has not adequately considered these factors and their simultaneous integration.

  • Effective modeling of spatial-temporal dynamics: Current OD demand prediction models often fail to effectively capture the inherent complex spatial-temporal dynamics of traffic systems. One of the main limitations lies in the oversimplified representation of spatial and temporal dependencies. The complex interactions between different regions are overlooked, thus failing to adapt to the dynamic nature of traffic patterns.

  • Complexity of temporal information and patterns: For OD demand prediction, static flow and dynamic flow represent different temporal information and patterns observed in the movement of people or vehicles between specific OD pairs. However, previous methods face challenges in modeling both types of flows, primarily due to the complexity involved in capturing the overall demand patterns.

Based on the above analysis, we propose ST-MEN, a spatial-temporal memory enhanced multi-level attention network to improve the accuracy of OD demand prediction. For issue 1, ST-MEN incorporates both static and dynamic external factors as attributes of road segments. These attributes are integrated into the model using a low-complexity external attention mechanism, facilitating the model to learn and focus on the relevant features associated with these factors. For issue 2, ST-MEN develops a dynamic spatial feature extraction module, which effectively captures the evolving dependencies among nodes while preserving important OD information. This integration of dynamic spatial feature extraction enables ST-MEN to flexibly adjust to changing demand patterns and more accurately capture the spatial-temporal dynamics. For issue 3, a continuous temporal feature extraction module is devised to capture multiple temporal patterns, including both discrete-time and continuous-time information, leveraging an enhanced memory network. For model evaluation, we design an enhanced version of ST-MEN with multi-scale feature cascade fusion to further optimize the model performance. The contributions of this work are summarized as follows:

  • This paper proposes a novel OD demand prediction model ST-MEN to integrate both static and dynamic external factors within a spatial-temporal framework.

  • A dynamic spatial feature extraction module is developed to effectively capture the dynamic changes among nodes in the traffic network.

  • Complex temporal information and patterns are learned through the application of an enhanced memory network.

  • Multi-scale features cascade fusion is incorporated to enhance the performance of the proposed model.

  • Experiments conducted on real-world datasets demonstrate that ST-MEN outperforms the existing models.

The rest of this paper is structured as follows: “Related work” discusses some related works, “Preliminary” outlines the problem definition, “Methodology” details the proposed model ST-MEN, “Experiments” provides extensive experimental evaluations, and “Conclusion” concludes the paper.

Related work

As ride-hailing services gain popularity, OD demand prediction has become a hot topic among researchers. Various models have been developed to address this challenge, including classic ones like the autoregressive integrated moving average (ARIMA) [25] and support vector regression (SVR) [28]. However, it is evident that these traditional models are incapable to handle the complex node connections, leading to suboptimal performance in real-world scenarios. With the advancements in deep learning technology, researchers have increasingly turned to deep learning-based models. Examples includes ST-ResNet [38], residual convolutional models [3], hybrid convolutional networks [34], generative adversarial networks [43], and quantum convolutional neural networks [9]. While these models have demonstrated superior performance compared to traditional ones, they are typically suitable for handling standard grid data. When dealing with non-Euclidean structures such as traffic topology graphs, they often fail to produce satisfactory results.

In recent years, graph neural networks and graph attention mechanisms have been widely adopted to address non-Euclidean problems. Wang et al. [31] introduce a recurrent strategy and a global attention network to fulfill the real-time demands of autonomous driving. To predict renewable power generation, Damaševičius et al. [7] introduce an attention-based RNN method aided by decomposition techniques. Nayakanti et al. [22] develop a motion prediction framework leveraging simple but effective attention mechanisms. Bacanin et al. [2] present a forecasting model for cloud computing loads, combining decomposition techniques, an LSTM network enhanced with attention mechanisms, and a customized particle swarm optimization algorithm. Zhang et al. [39] devise a multi-head dual sparse self-attention model designed specifically for predicting remaining useful life. Predić et al. [26] introduce a decomposition-assisted attention RNN model optimized using a customized particle swarm optimization algorithm for forecasting cloud loads. To forecasting the remaining useful life of an aircraft engines, Zhao et al. [42] propose a multi-level integrated self-attention method.

For OD demand prediction research, some models simply treat the OD matrix as a two-dimensional image, resembling multi-resolution spatial-temporal deep learning approaches [16, 24]. While spectral and convolutional methods show improvement over traditional deep learning models, they still fail to accurately capture complex spatial-temporal correlations. With advancements in graph neural networks, many researchers have explored incorporating stations or regions as nodes in the graph [15]. GEML [32] combines GNNs and LSTM to handle spatial-temporal features. DNEAT [37] extracts temporal features at different time granularities. HMOD [41] is designed on the basis of a dynamic graph representation learning model, with a multi-level memory structure. STHAN [19] constructs the spatial-temporal heterogeneous graph incorporating multiple spatial and temporal relationships. GDCF [13] devises an encoder-decoder architecture and adopt a two-phase training mechanism to generate generic node representations. These methods leverage graph neural networks to better handle the spatial-temporal dependencies of non-Euclidean structured data, resulting in superior performance. However, existing models often overlook external factors and fail to model complex spatial-temporal dependencies effectively.

Preliminary

In this section, necessary definitions are provided and the problem of OD demand prediction is formalized (Fig. 1). Additionally, a comparison between classical self-attention and external attention is introduced.

Problem definition

Definition 1

(Dynamic graph) A dynamic graph is defined as \(G = ({V,E})\), where \(V=\{v_i\}_{i=1}^N\) represents the set of nodes deployed on road segments, and \(E=\{e_j\}_{j=1}^M\) represents the set of edges. Each edge \(e_j=(o_j,d_j,t_j, \tau _j, \mathbf {f_j})\) denotes a trip, where \(o_j\) and \(d_j\) signify the origin and destination of the trip, and \(t_j\) and \(\tau _j\) denote the departure time and travel time of the trip, respectively. \(\mathbf {f_j}\) involves feature information such as weather, wind speed, etc. \(G^t\) denotes the graph learned at time step t, including all trips that occurred before t.

Definition 2

(OD demand matrix) An OD demand matrix from time step \(t-\tau \) to t is denoted as \({Y^{t-\tau :t}} \in \mathbb {R}^{N\times N}\), where \(Y_{ij}\) represents the traffic flow from node \(v_i\) to \(v_j\) during time interval from \(t-\tau \) to t.

\({\textbf {OD demand prediction problem}}\). Given a dynamic graph \(G^t\) at time step t, the purpose of OD demand prediction is to learn a function \(\mathbf {g(.)}\) to predict the OD demand matrix from time step t to \(t+\tau \):

$$\begin{aligned} \mathcal {G}:[G^t] \xrightarrow {\mathbf {g(.)}} \hat{\textbf{Y}}^{t:t+\tau } \end{aligned}$$
(1)
Fig. 1
figure 1

The overall architecture of ST-MEN (Spatial-Temporal Memory Enhanced Multi-Level Attention Network) consists of several key components: the external feature extraction (EFE) block, the dynamic spatial feature extraction (DSFE) block, and the continuous temporal feature extraction (CTFE) block. An enhanced version called ST-MEN-F incorporates an enhanced temporal feature extraction (ETFE) component. The top parts shows the overall architecture and the bottom part shows the detail of each block

Fig. 2
figure 2

Comparison of classical self-attention and external attention

Attention mechanism

This subsection compares the classic self-attention mechanism with the external attention mechanism [11], as illustrated in Fig. 2, to justify the choice of external attention for modeling external factors in “Dynamic spatial feature extraction (DSFE)”. In the self-attention mechanism, a feature map \(F \in \mathbb {R}^{N \times d}\) is taken as input, where N is the number of items and d is the number of feature dimensions. Then, the input is projected into three matrices: \( Q\in \mathbb {R}^{N\times d'} \), and \(\{K, V\} \in \mathbb {R}^{N\times d}\) through linear transformation. Thus, the self-attention mechanism can be represented as follows:

$$\begin{aligned} \begin{aligned}&A=(\alpha )_{i,j}={\text {softmax}}\left( Q K^T\right) \\&F_{\text{ out } }=A V \end{aligned} \end{aligned}$$
(2)

Here, \(A \in \mathbb {R}^{N\times N}\) refers to the attention matrix, \(\alpha _{i,j}\) denotes the similarity between elements i and j, and \(F_{out} \in \mathbb {R}^{N\times N}\) represents the output value matrix. However, the high computational complexity of self-attention (\(O(dN^2)\)) poses a challenge for regular models. The core of self-attention lies in enhancing input features through linear transformation. Yet, using an \(N \times N \) self-attention matrix or an N-element self-value matrix in this transformation yields little difference. Moreover, self-attention typically only considers correlations among elements in a single dataset, while ignoring correlations among elements in different datasets. This limitation restricts its performance potential.

Methodology

In this section, we introduce the proposed spatial-temporal memory enhanced multi-level attention network ST-MEN in details. In “External feature extraction (EFE)”, we introduce the external feature extraction (EFE) block, towards effectively modeling external factors for traffic data. In “Dynamic spatial feature extraction (DSFE)”, we design dynamic spatial feature extraction (DSFE) block, which captures the dynamic spatial dependencies of traffic data. In “Enhanced temporal feature extraction (ETFE)”, continuous temporal feature extraction(CTFE) block is introduced to capture the temporal dependencies of traffic data. In “Continuous temporal feature extraction (CTFE)”, we introduce the enhanced temporal feature extraction (ETFE) block, which further augments the ability of temporal feature extraction at different scales to further improve prediction accuracy. The overall framework and its variant of the proposed ST-MEN are illustrated in “Overall architecture”. Finally, the loss function utilized in the model is presented in “Loss function”.

External feature extraction (EFE)

To comprehensively model the external factors influencing the OD demand, ST-MEN incorporates an external feature extraction module, which integrates both static and dynamic external factors. The selection and extraction of key external factors are based on the consensus reached by previous studies in traffic prediction. However, few methods consider both static and dynamic factors simultaneously in OD demand prediction. Static external factors, like road structure and geographical features [27], significantly influence travel patterns and demand. By capturing and modeling these static factors, the model gains a better understanding of their impact on OD demand. Moreover, dynamic factors, such as weather conditions and traffic incidents [40], are also considered. These factors evolve over time and affect travel demand. By integrating dynamic external factors, ST-MEN can adapt to changing conditions, resulting in more accurate and robust predictions.

Formally, EFE calculates the importance of features using the input data F and an external memory unit \(M \in \mathbb {R}^{S \times d}\). The standard form can be formulated as:

$$\begin{aligned} \begin{aligned}&A=(\alpha )_{i, j}={\text {Norm}}\left( F M^T\right) \\&F_e=A M \end{aligned} \end{aligned}$$
(3)

where \((\alpha )_{i, j}\) measures the similarity between the i-th column and the j-th row of the memory unit M. Note that M is a learnable matrix acting as the dataset memory, independent of input data. A is the attention map derived from learned dataset-level prior knowledge. It resembles self-attention mechanisms, providing both local and global information to the model and facilitating precise parameter sharing. During training, M gets updated based on the similarities in A.

As depicted in Fig. 2b, EFE incorporates external attention by utilizing two separate memory units, \(M_k\in \mathbb {R}^{S\times d}\) and \(M_v\in \mathbb {R}^{S\times d}\), to adjust the computation of the initial attention. The integration of these memory units enhances the model’s capability to extract information from external features, leading to improved experimental results. This process can be formalized as:

$$\begin{aligned} \begin{aligned} A&={\text {Norm}}\left( F M_k^T\right) \\ F_e&=A M_v \end{aligned} \end{aligned}$$
(4)

The computation complexity of this process is O(dSN), where d and S are learnable hyper-parameters. Therefore, EFE operates linearly as the dataset size increases. It can handle larger datasets and improve model performance without substantially raising computational costs.

For the normalization operation, Softmax is a prevalent technique in self-attention to ensure that the attention weights sum up to 1, i.e., \(\sum _j \alpha _{i,j}=1\). However, the matrix multiplication used to compute the attention map can be sensitive to the scaling of input data. To overcome this issue, the model adopts double normalization [10], which independently normalizes rows and columns, as represented in the following equation:

$$\begin{aligned} \begin{aligned} (\tilde{\alpha })_{i, j}&=F M_k^T \\ (\hat{\alpha })_{i, j}&=\exp \left( (\tilde{\alpha })_{i, j}\right) / \sum _k \exp \left( (\tilde{\alpha })_{k, j}\right) \\ (\alpha )_{i, j}&=(\hat{\alpha })_{i, j} / \sum _k (\hat{\alpha })_{i, k} \end{aligned} \end{aligned}$$
(5)

where \((\tilde{\alpha })_{i, j}\) represents the similarity between the i-th column and the j-th row of the memory unit \(M_k\), \((\hat{\alpha })_{i, j}\) represents the columns that have been normalized, and \(\alpha _{i, j}\) represents the columns and rows that have been normalized.

Fig. 3
figure 3

Illustration of ResMLP layer

Dynamic spatial feature extraction (DSFE)

In order to efficiently capture time-varying spatial features, a DSFE module is devised, consisting of a linear layer, three parallel ResMLP layers, and an average pooling operation, as illustrated in Fig. 1b. Initially, the obtained feature map \(F_e \in {R^{N \times N}}\) is taken as input and transformed into a set of \(N^2 d\)-dimensional embeddings via the linear layer. These embeddings then proceed through three parallel ResMLP layers, generating a set of \(N^2 d\)-dimensional output embeddings. Subsequently, these output embeddings are fed into an avg-pooling layer to form an \(N^2 d\)-dimensional vector representing the feature map. Finally, they pass through a linear classifier to yield the final feature map as the output. Figure 3 depicts the details of each ResMLP layer, using the Affine operation instead of layer normalization. The choice avoids reliance on batch statistics and better preserves information from neighboring nodes compared to standard normalization. It can be denoted as:

$$\begin{aligned} {\text {Aff}}_{\varvec{\alpha }, \varvec{\beta }}(X)=\textrm{Diag}(\varvec{\alpha })X+\varvec{\beta } \end{aligned}$$
(6)

where \(\alpha \) and \(\varvec{\beta }\) are learnable weight vectors used to rescale and transform input elements. For simplicity, the Affine operation, applied independently to each column of the matrix X, is represented as \(\varphi (X)\).

Overall, the ResMLP layer stacks a set of \(N^2 d\)-dimensional input features in a \(d \times N^2\) matrix X, and transforms them into a set of \(N^2 d\)-dimensional output features stored in matrix \(F_d\) through the following transformation:

$$\begin{aligned} \begin{aligned} Z&=X+\varphi ((W_1 \star \varphi (X)^{\top })^{\top }) \\ F_d&=\varphi (W_3 \star \tanh (W_2\star \varphi (tanh(Z))))+tanh (Z) \end{aligned} \end{aligned}$$
(7)

where symbol \(\star \) denotes the linear operation and tanh is the activation function. \(W_1\), \(W_2\) and \(W_3\) are the weight matrices of the ResMLP layer, with dimensions of \(N^2 \times N^2\), \(4 d \times d\), and \( d \times 4d\), respectively. The intermediate matrix Z, input matrix X, and output matrix \(F_d\) all have identical dimensions. Skip connections are utilized to preserve feature information from the previous layer and prevent excessive loss of information. The first sub-layer with skip connections facilitates information exchange across different feature maps, while the second sub-layer facilitates information exchange among nodes.

Fig. 4
figure 4

Illustration of multi-scale features cascade fusion (MFCF) structure

Enhanced temporal feature extraction (ETFE)

To further improve the ability to capture multi-scale continuous and discrete temporal dependencies, we introduce the ETFE module to generate the aggregated feature map. The most important part of this module is Multi-scale Features Cascade Fusion (MFCF) structure. The overall architecture of MFCF is illustrated in Fig. 4. Note that the bottom of Fig. 4 represents a stage in the multi-scale structure, composed of m such stages. The process begins by processing input feature maps of size \(N\times N\) through a stem module to extract smaller feature maps of size \(\frac{N}{8}\times \frac{N}{8}\). This stem module consists of two consecutive \(3\times 3\) convolutional layers, each with a stride of 2, \(N_0\) blocks, and a \(2\times 2\) convolutional layer with a stride of 2. After each \(3\times 3\) convolutional layer, a LayerNorm [1] layer and a GELU [14] unit are applied. This initial processing stage is then followed by multiple stages with identical structures, aiming to extract multi-scale features.

Following the stem module, cascade feature fusion stage with \(N_0, N_1, N_2, N_3\) blocks is proposed, respectively. Each block in this stage has a fundamental structure consisting of a token mixer, dilation convolution, and MLP layer, as shown in the top part of Fig. 4. The token mixer utilizes a lightweight \(7\times 7\) depth-wise convolution from ConvNeXt [20], which facilitates temporal feature interactions. Additionally, within each block, a \(7\times 7\) dilated depth-wise convolution is incorporated, along with three skip connections. This configuration effectively captures long-range dependencies among spatial features while minimizing parameter increase and computational overhead. The multi-scale features obtained from the final stage are then passed to the temporal feature generation module, where further processing and feature generation take place.

The ETFE module, with its MFCF structure, enables the extraction of multi-scale features, capturing both continuous and discrete temporal dependencies. This module incorporates three mechanisms, such as multi-scale features, cascade feature fusion, token mixer and dilation convolution. In particular, the ETFE module first incorporates a multi-scale structure that extracts features at different scales. By processing the input feature maps through multiple stages, each with its own set of blocks, the module captures information at various granularities. Second, with the cascade feature fusion stage consisting of multiple blocks, the module iteratively refines and combines features from different scales. This fusion mechanism helps to propagate information across scales, allowing the model to integrate and leverage temporal dependencies at different levels of abstraction. Third, the token mixer utilizes a lightweight depth-wise convolution to facilitate temporal feature interactions. Moreover, the dilation convolution, along with skip connections, helps to capture long-range dependencies among spatial features with minimal parameter increase and computational overhead.

Continuous temporal feature extraction (CTFE)

The CTFE plays a pivotal role within the ST-MEN model, focusing on capturing both continuous and discrete temporal dependencies. Illustrated in Fig. 1c, the CTFE module consists of three key components: temporal feature generation, temporal feature fusion, and temporal feature update units. It takes the aggregated feature map as input, which contains historical trips represented by edges with continuous timestamps. These timestamps hold crucial information about OD demand and serve as the foundation for node representation in the model. To effectively capture temporal dependencies, the model leverages a hierarchical memory structure inspired by [41]. In particular, the memory state maintained at each node is represented as:

$$\begin{aligned} H\in \mathbb {R}^{(D+1) \times d_H}=\left\{ {H}^0, {H}^1, \ldots , {H}^{\textrm{D}}\right\} \end{aligned}$$
(8)

where \({H}^0\) represents the memory of continuous time, while \({H}^{{d}}(d \ge 1)\) represents the memory of macro-discrete time that spans \(\Delta {T}_{{d}}\) time units. CTFE treats time as a continuous feature, updating it when an edge appears. Additionally, it maintains multiple discrete time features to display the state of each node. Next, the details of each module of CTFE will be elaborated upon.

Feature generation

The goal of CTFE is to ensure that node states are updated efficiently and timely. When memory updates occur, a new edge is created for both continuous time memory and macro-discrete time memory. At the end of each time unit, update information is triggered to compute these edges and update the node’s memory. For continuous time memory, this information includes the node state since the last update, a specially encoded time interval, and aggregated information from neighbor nodes. Meanwhile, for macro-discrete time memory, the information includes the most recent node state update, the OD matrix within specific time intervals, and aggregated information from neighboring nodes. For a specific node i, the feature generation is formulated as:

$$\begin{aligned} \begin{aligned}&{M}^{\prime }=\left\{ \begin{array}{c} {\left[ \textbf{H}^d\left\| \Theta \left( \textbf{H}^{\textrm{d}}, \mathcal {G}_{\textrm{t}}\right) \right\| \Omega _{\textrm{d}}\left( \textbf{H}^{\textrm{d}}, \mathcal {G}_{\textrm{t}}\right) \right] , d=0} \\ {\left[ \textbf{H}^d\left\| \textbf{Y}_{\textrm{t}-2^{\textrm{d}-1}}^{\textrm{i}} \Delta \textrm{T}, \textrm{t}\right\| \Omega _{\textrm{d}}\left( \textbf{H}^{\textrm{d}}, \mathcal {G}_{\textrm{t}}\right) \right] , d \ne 0} \end{array}\right. \\&\textbf{M}_d^0=\textbf{W}^{m_1} \sigma \left( \textbf{W}^{m_2} \textbf{M}^{\prime }+\textbf{b}^{m_2}\right) +\textbf{b}^{m_1} \\ \end{aligned} \end{aligned}$$
(9)

where \(\textbf{M}_d^0 \in \mathbb {R}^{d_M}\) describes the above information and \(\odot \) indicates element-wise product. \(\textbf{t}_i^{-}\) represents the update time of node i in the previous iteration, while \(\textbf{Y}_{\textrm{t}-2^{\textrm{d}-1} \Delta \textrm{T}, \textrm{t}}^{\textrm{i}}\) denotes the i-th row of the OD matrix. \(\textbf{W}^*\)and \(\textbf{b}^*\) are learnable parameters. \(\sigma \) represents the activation function.

Feature fusion

Previous message functions have typically aggregated information from specific tracks, such as continuous information at a particular granularity or discrete information. However, continuous messages may overlook certain global spatial-temporal patterns, while fine-grained messages may miss out on the spatial-temporal information present in discrete messages. To tackle this issue, a temporal feature fusion mechanism is designed to combine specific track information with other data, generating diverse forms of spatial-temporal information. The latest information stored in the corresponding storage memory can be represented as:

$$\begin{aligned} M \in \mathbb {R}^{(L+1) \times d_M}=\left\{ M^0 , M^1 ,\ldots , M^L\right\} \end{aligned}$$
(10)

where \({M}^0\) represents the output of the information function. When \(\textbf{M}^l\) is calculated, information from other time granularities is initially aggregated and then concatenated with \({M}^{l-1}\). The main objective of this operation is to integrate information from different time granularities while maintaining the corresponding time granularity details as much as possible. Next, the concatenation result is passed through a fully connected layer.

Feature update

The temporal feature update unit aims to update the current information while preserving historical information as much as possible. Inspired by [41], GRU [6] is employed to update the memory state of nodes.This process is formulated as:

$$\begin{aligned} {H}^d={\text {GRU}}\left( {H}^d, {M}_d^L\right) \end{aligned}$$
(11)

where \({M}_d^L\) denotes the message of d-th memory at the L-th layer. As the increasing of message level, multi-scale temporal features are learned.

Overall architecture

Applying a pyramid framework, we design the spatial-temporal memory enhanced multi-level attention network ST-MEN In particular, given the topology graph of the road network \(G^t\) at a certain time t, ST-MEN first employs the external attention mechanism to model the external factors, such as weather and event, and obtain the feature representation \(F_e\). Then, in the DSFE module, a parallel residual MLP group followed by average pooling is utilized to extract deep dynamic spatial features \(F_d\), enhancing the feature learning of neighboring nodes within one or two hops. In the CTFE module, both the continuous and discrete temporal features \(F_c\) are captured and updated. Finally, the prediction results are obtained by conducting the OD demand matrix.

Based on the above framework, we further design an enhanced version of the proposed ST-MEN (i.e. ST-MEN-F)as shown in Fig. 1., which introduces a Enhanced Temporal Feature Extraction (ETFE) to further augment the representation power of the CTFE block, and comparison results of two frameworks are given in Table 2.

Loss function

The OD demand prediction task aims to minimize the difference between the predicted values and the ground truth. For this purpose, the \(\mathcal {L}\) loss function is employed as the objective function:

$$ \begin{aligned} \begin{aligned} \mathcal {L}=\frac{1}{|\textbf{Y}|} \sum _{y \in \textbf{Y}}\left( I(y, \hat{y})(y-\hat{y})^2\right) \\ I(y, \hat{y})=\left\{ \begin{array}{l} 0, y=0 \& \hat{y} \le 0 \\ 1, \text{ else } \end{array}\right. \end{aligned} \end{aligned}$$
(12)

where \(\hat{y}\) denotes the predicted value, while y represents the corresponding ground truth. Given the frequent occurrence of zeros in the OD matrix, predicting them as negative numbers is deemed acceptable in real-world scenarios. However, it’s crucial to differentiate between heavily negative values and slightly negative ones, as they may convey distinct implications. Directly training on these values could potentially result in inferior performance.

Experiments

In this section, we first present the relevant datasets and experimental settings. Next, extensive experiments are conducted to verify the superiority of the proposed model.

Datasets

Experiments are conducted on two popular real-world datasets: New York Taxi and Cheng Du Taxi. The statistical data of the two datasets are summarized in Table 1.

  • New York Taxi: This dataset contains taxi traffic data in New York City from January to June 2019, covering 63 regions and 38,498,427 time slots. We select 139 days as the training set, the next 21 days as the validation set, and another next 21 days as the test set.

  • Cheng Du Taxi: This dataset contains the taxi trip records in Chengdu, China, spanning from August 3rd, 2014, to August 30th, 2014, covering 79 regions and 3,636,845 records. We allocate 19 days for the training set, 4 days for the validation set, and another 4 days for the testing set.

Table 1 The details of the datasets

Experimental settings

Experimental configurations

The proposed model aims to utilize historical trip data to predict the OD demand matrix for the next time intervals \(\tau =[30, 60, 90]\) minutes. During training, we employ the Adam optimizer with a learning rate of \(1e-5\) for both datasets. The model is trained for 500 epochs, with early stopping activated and a patience set to 20. The layer depth L of the model is fixed at 2. Memory dimension \(d_H\) and message dimension \(d_M\) are set to 128. For the MFCF structure, we uniformly set the dilation rate r of the dilated convolution to 3 across all blocks. These variations maintain consistent channel dimensions (\(S = 128\)) while employing varying block numbers (\(N_0=3,N_1=2,N_2=7,N_3=2\)). The total number of MFCF stages m is 4. The model is implemented using the Pytorch framework and experiments are conducted on an NVIDIA GeForce RTX 3070 GPU.

Evaluation metrics

The performance of different methods is evaluated with the following metrics: mean absolute error (MAE) and root mean squared error (RMSE), defined as:

$$\begin{aligned} {\text {MAE}}= & {} \frac{1}{N} \sum _{i=1}^N|y_i-\hat{y}_i| \end{aligned}$$
(13)
$$\begin{aligned} {\text {RMSE}}= & {} \sqrt{\frac{1}{N} \sum _{i=1}^N(y_i-\hat{y}_i)^2} \end{aligned}$$
(14)

Here, N represent the batch size, \(y_i\) and \(\hat{y}_i\) are the ground-truth and predicted values in a mini-batch, respectively.

Baselines

This paper compares the ST-MEN model with traditional statistic-based models as well as recently proposed deep neural network models.

  • HA: The prediction output of this conventional model is determined by taking the historical average value of the OD demand matrix as the predicted value.

  • Linear Regression (LR): This model is a classic regression model that aims to identify the linear correlation between the input and output. The input of the model consists of the historical OD demand of a single OD pair in the latest four consecutive time slots.

  • XGBoost [5]: This model employs gradient-boosted trees to learn from past patterns, and its input is similar to that of logistic regression (LR).

  • GEML [32]: This is an OD demand prediction model that utilizes snapshots and pre-defined neighborhoods. The geographical neighborhood in GEML is defined based on distance.

  • DNEAT [37]: This model is a snapshot-based OD demand prediction model that incorporates node-edge attention. Similar to GEML, DNEAT also defines neighborhood based on geographical distance.

  • TGN [29]: This model is a continuous-time dynamic graph representation learning framework.

  • HMOD [41]: The model is designed on the basis of a dynamic graph representation learning model, with a multi-level memory structure.This model is the inspiration for this research.

  • STHAN [19]:The model constructs a spatio-temporal heterogeneous graph incorporating multiple spatial and temporal relationships, and utilize meta-paths to characterize complex spatial relationships. To capture this heterogeneity, The model employ hierarchical attention, comprising node-level attention and meta-path-level attention.

  • COOL [17]: To capture a range of long-term transitional patterns, the model employs a unified self-attention decoder that combines sequential representations through multi-rank and multi-scale attention branches..

Table 2 Performance comparison of different approaches on two real-world datasets

Main results and analysis

Table 2 presents the experimental results of different models tested on two real-world datasets. Evaluation metrics include MAE and RMSE, with time intervals \(\tau \) ranging in [30, 60, 90] min. The best results are highlighted in bold, while the second-best results are underlined. Compared to conventional static models and machine learning models (e.g., HA, LR, and XGBoost), deep learning-based approaches demonstrate superior performance due to their capacity of capturing non-linear relationships and adapting to large-scale data.

The proposed models, ST-MEN and its enhanced version ST-MEN-F, exhibit superior performance across various prediction tasks among deep learning-based approaches. Specifically, for 30-minute short-term prediction on the New York Taxi dataset, ST-MEN achieves impressive MAE and RMSE scores of 0.6540 and 1.4244, while ST-MEN-F scores 0.5923 and 1.3172, respectively. Compared to the best baseline model HMOD, which records slightly higher MAE and RMSE values of 0.7317 and 1.5960, ST-MEN-F shows an improvement of nearly 18% in both metrics. Similarly, for 60-minute mid-term prediction, ST-MEN and ST-MEN-F outperform HMOD, with an notable improvement of nearly 13% in both metrics. The trend continues for 90-minute long-term prediction, where ST-MEN and ST-MEN-F demonstrate significant enhancements of nearly 10% and 3% in MAE and RMSE metrics compared to HMOD.

The superior performance of ST-MEN and ST-MEN-F is not limited to the New York Taxi dataset but also extends to the Cheng Du Taxi dataset, where they outperform the baseline models. These results highlight the effectiveness and superiority of the proposed models in predicting taxi OD demand for various prediction horizons. Their consistent outperformance of baseline models demonstrate the potential for accurate and reliable short-term, mid-term, and long-term OD demand forecasting tasks. These significant improvements arise from the effective modeling of external influencing factors and the successful capture of spatial-temporal dynamics by the models. Therefore, compared to other deep learning models, ST-MEN and ST-MEN-F exhibit heightened sensitivity to external factors and spatial-temporal dependencies, thus enhancing the accuracy of OD demand prediction effectively.

Fig. 5
figure 5

MAE for ablation study

Fig. 6
figure 6

RMSE for ablation study

Table 3 Ablation study for various variants across two datasets
Fig. 7
figure 7

The effect of layer depth on the New York taxi dataset for 30/60/90 min

Fig. 8
figure 8

The effect of layer depth on the Cheng Du taxi dataset for 30/60/90 min

Ablation study

To verify the effect of several key components on the proposed model, this paper conducts ablation studies on two datasets. The variants of ST-MEN are named as follows:

  • w/o EFE: It removes the EFE block from the basic ST-MEN framework.

  • w/o DSFE: It removes the DSFE block from the basic ST-MEN framework.

  • ST-MEN-F: It adds the ETFE block based on the basic ST-MEN framework.

Table 3, Figs. 5, and 6 report the MAE and RMSE results of ST-MEN, ST-MEN-F, and the variants on the two real-world datasets. The experimental findings clearly indicate the effectiveness of the EFE block (w/o EFE) in enhancing the model’s performance by integrating both static and dynamic external factors. For instance, in the case of the Cheng Du Taxi dataset with a time interval of 30 min, the MAE and RMSE values decrease from 1.438 and 2.846 to 1.337 and 2.495, respectively. This integration significantly contributes to the model’s improved robustness and stability when faced with various external influences.

Furthermore, the DSFE block (w/o DSFE) emerges as the most influential component in enhancing the model’s performance. For instance, in the case of the New York Taxi dataset with a time interval of 60 min, the MAE and RMSE values decrease from 0.778 and 1.805 to 0.677 and 1.653, respectively. This reduction demonstrates the efficacy of the proposed DSFE block in extracting and learning dynamic spatial features among nodes, thereby improving the overall prediction accuracy.

When both of these key modules are included simultaneously, the model consistently achieves the better results across all time intervals. This can be attributed to the model’s effective integration of external factors in combination with its ability to capture spatial-temporal dynamics. By considering both static and dynamic external factors and leveraging the spatial-temporal relationships, the model achieves superior performance, providing more accurate and reliable predictions. Moreover, it is observed that adding ETFE block can further improve the performance of the framework. As the multi-scale features, cascade feature fusion, token mixer, and dilation convolution collectively enhance the model’s ability to understand and leverage temporal dependencies across different scales and time intervals. This leads to improved performance in capturing and modeling complex temporal patterns, resulting in more accurate predictions or representations of the data.

Parameter sensitivity

Figures 7 and 8 present the impact of layer depth L on different datasets. In the case of the New York Taxi dataset with a time interval of \(\tau =30\) minutes, the performance of ST-MEN is evaluated. For \(L=1\), ST-MEN achieves the MAE of 0.722 and the RMSE of 1.648. However, as the layer depth L is increased to 2, the performance of ST-MEN improves significantly, resulting in the MAE of 0.654 and RMSE of 1.424. This improvement can be attributed to the deeper architecture of the model, which enhances its ability to learn representations and capture complex correlations and patterns within the data.

It is important to note that further increasing the number of layers beyond \(L=2\) does not lead to performance improvements and can even degrade the model’s performance. This phenomenon occurs due to the model’s tendency to overfit to the noise present in the training data, thereby hindering its generalization to new data.

The experimental findings on the Cheng Du Taxi dataset also exhibit a similar trend. Therefore, setting the layer depth L to 2 provides an optimal balance between model complexity and performance for both the New York Taxi and Cheng Du Taxi datasets. This choice allows ST-MEN to effectively capture the underlying patterns and correlations in the data without being overly influenced by noise, resulting in improved prediction accuracy.

Conclusion

This paper introduces a novel approach called spatial-temporal memory enhanced multi-level attention network for OD demand prediction (ST-MEN). The objective of ST-MEN is to comprehensively consider the influence of external factors on traffic conditions and enhance the prediction performance. In particular, ST-MEN integrates external feature extraction with dynamic spatial features, as well as continuous and discrete temporal feature extraction. This combination allows the model to capture a wide range of information related to external factors, spatial dependencies between nodes and temporal dynamics. Furthermore, multi-scale features cascade fusion is incorporated to enhance the performance of the proposed model. To validate the effectiveness of ST-MEN, extensive experiments are conducted on two real-world datasets. The results demonstrate that the proposed model outperforms existing methods in terms of OD demand prediction performance. The comprehensive consideration of external factors and the integration of various feature extraction techniques contribute to the success of ST-MEN in accurately predicting OD demand.

There are also some limitations in the proposed model, which will be the focus of the future work to further enhance the capabilities and applicability of the model. Firstly, applying the model to other spatial-temporal forecasting tasks, such as arrival time estimation, needs further investigation. By leveraging the strengths of ST-MEN, such as its attribute-augmented spatial-temporal framework and dynamic feature extraction, we will investigate whether this model is applicable to address other forecasting problems. Secondly, broader external factors like points of interest (POI) data can provide valuable insights into the dynamics of transportation systems. By incorporating these factors into the model, it will be useful to capture and model the complex correlations between external factors and OD demand, leading to more accurate predictions.