1 Introduction

Traffic forecasting has played a critical role in intelligent transportation systems, which helps the transportation department better manage and control traffic congestion. Generally represented by geo-located Multivariate Time Series (MTS), traffic data not only shows the typical characteristics of MTS, i.e., temporal dependency (Zuo et al. 2021), but also integrates the spatial information of the traffic network, i.e., the spatial dependency between the sensor traffic nodes over the road network.

In recent years, by leveraging the spatial-temporal patterns in traffic data, many deep learning models based on recurrent neural network (RNN) (Li et al. 2018), temporal convolutional network (TCN) (Wu et al. 2019), graph convolutional networks (GCN) (Li et al. 2021), etc., have been applied in traffic forecasting tasks and achieved state-of-the-art performance. They all have a strong assumption that the data is complete or has been well-preprocessed (Yu et al. 2018). However, since the traffic data is generally collected from geo-located sensors, sensor failures or communication errors will result in missing values in the collected data, thus deteriorating the performance of the forecasting model. We should remark that the missing measures are usually marked as zero in traffic data (Li et al. 2021), which should be distinguished from the non-missing measures but with zero values. A typical example (Tian et al. 2018) comes from the traffic flow data: no vehicles are detected during the night, then the traffic measures are marked as zero instead of being considered as missing. This can be commonly observed from real-life traffic data for which the missing rate evolves periodically during the day (Lopez 2018).

The missing values can either be ignored in the learning model when calculating the loss function (Wang et al. 2020) or be considered before or during the training process (Cui et al. 2020b). Ignoring the missing values, especially when the missing ratio is high (Cui et al. 2020b), hinders the model from benefiting from the rich data information for better performance. When considering the missing values in traffic data, most work (Cirstea et al. 2019) conducts data imputation during the preprocessing step, then imports the completed data into the training step, i.e., two-step processing. Recent work tends to jointly consider the missing values and the forecasting modeling during the training step (i.e., one-step processing) and declared better performance than the two-step processing (Che et al. 2018; Cui et al. 2020a, b; Tian et al. 2018; Tang et al. 2020). However, the above-mentioned work suffers from three major issues. First, the missing and zero values are usually considered to be the same, leading to unnecessary, even harmful data imputations, thus contradicting the raw data information. Second, most of the work (Che et al. 2018; Cui et al. 2020a; Tian et al. 2018; Tang et al. 2020) considers missing values from the temporal aspect, ignoring the rich information from the spatial perspective. Third, they are generally designed for processing the missing values in some basic scenarios, such as random missing values or temporal block missing values, but lack power for the complex scenarios as shown in Fig. 1. In the real world, the missing values in traffic data occur in both long-range (e.g., device power-off) and short-range (e.g. device errors) settings, in partial (e.g., local sensor errors) and entire transportation network (e.g., control center errors). Therefore, a holistic approach is required for handling various types of missing values together in complex scenarios.

Fig. 1
figure 1

Missing measures of traffic speed data from METR-LA dataset (Li et al. 2018). left) long-range missing on entire network (i.e., Spatio-temporal block) and partial network (i.e., temporal block); right) short-range random missing on partial network (i.e., temporal values) and entire network (i.e., Spatio-temporal vectors)

To handle both the Spatio-temporal patterns and complex missing-value scenarios in traffic data, we propose Graph Convolutional Networks for Traffic Forecasting with Missing Values (GCN-M). The graph neural network-based structure allows jointly modeling the Spatio-temporal patterns and the missing values in a one-step process. We construct local statistical features from spatial and temporal perspectives for handling short-range missing values. This is further enhanced by a memory module to extract global historical features for processing long-range missing blocks. The combined local-global features allow not only for identifying the missing measures from the inherent zero values but also for enriching the traffic embeddings, thus generating dynamic traffic graphs to model the dynamic spatial interactions between traffic nodes. The missing values on a partial and entire network can then be considered from spatial and temporal perspectives.

We summarize the paper’s main contributions as follows:

  • Complex missing value modeling: We study the complex scenario where missing traffic values occur on both short & long ranges and on partial & entire transportation networks.

  • Spatio-temporal memory module: We propose a memory module that can be used by GCN-M to learn both local Spatio-temporal features and global historical patterns in traffic data for handling the complex missing values.

  • Dynamic graph modeling: We propose a dynamic graph convolution module that models the dynamic spatial interactions. The dynamic graph is characterized by the learned local-global features at each timestamp, which not only offset the missing values’ impact but also help learn the graph.

  • Joint model optimization: We jointly model the Spatio-temporal patterns and missing values in one-step processing, which allows processing missing values specifically for traffic forecasting tasks, thus bringing better model performance than two-step processing.

  • Extensive experiments on real-life data: The experiments are carried out on two real-life traffic datasets. We provide detailed evaluations with 12 baselines, which show the effectiveness of GCN-M over state-of-the-art.

The rest of this paper starts with a review of the most related work. Then, we formulate the problems of the paper. Later, we present in detail our proposal GCN-M, followed by the experiments on real-life datasets and the conclusion.

2 Related works

We start with defining the notions used in the paper:

Definition 1

(One-step processing). For one-step processing models, the missing values and the traffic forecasting are jointly modeled in one single step.

Definition 2

(Two-step processing). The two-step processing models first handle the missing values in a preprocessing step, then apply a forecasting model on the completed data.

2.1 Graph convolutional networks for traffic forecasting

Graph Convolutional Network (GCN) is a special kind of Convolutional Neural Network (CNN) generalized for graph-structured data. Most of the GCN-related work focuses on graph representation, which learns node embedding by integrating the features from the node’s local neighbors based on the given graph structure, i.e., adjacency matrix. The traffic data shows strong dependencies between the spatial nodes, for which GCN can be naturally suitable. Various work (Li et al. 2018; Wu et al. 2020; Yu et al. 2018; Wang et al. 2020) empowered by GCN achieved remarkable performance when doing traffic forecasting tasks, relying on spatial and temporal completion of the data, or calculating loss function for non-zero entries, i.e., only calculating the loss on entries that contain valid sensor readings. However, these techniques may introduce derivations when modeling the Spatio-temporal relations between the sensor nodes. In other words, where non-missing measures are required to characterize the dynamic graph at each timestamp, missing values may hinder the traffic graph learning (Li et al. 2021), especially dynamic graph learning (Guo et al. 2021).

2.2 Missing value processing

The simplest solution for processing missing values in MTS would be data imputation, such as statistic imputation (e.g., mean, median), EM-based imputation (García-Laencina et al. 2010), K-nearest neighborhood (Batista et al. 2002), and matrix factorization (Dong et al. 2022). It’s generally believed that those methods fail to model temporal dynamics in a time series (Tang et al. 2020). In other words, they are not applicable for handling long-range missing values. Recent generative models (Yoon et al. 2019; Dong et al. 2022) show reliable performance for long-range time series imputation. However, isolating the imputation model from the forecasting model leads to two-step processing, and may generate sub-optimal results (Cirstea et al. 2019; Wells et al. 2013; Che et al. 2018). To handle this issue, recent studies (Che et al. 2018; Tang et al. 2020; Wang et al. 2021; Zhong et al. 2021) jointly model the missing values and forecasting task in one-step processing. For instance, GRU-D (Che et al. 2018) considers the nearby temporal statistical features to do imputations inside GRUs, whereas LSTM-I (Cui et al. 2020a) infers missing values at the current time step from preceding LSTM cell states and hidden states, and SGMN (Cui et al. 2020b) improved the state transition process via a Graph Markov Process. Limited to short-period missing context, those methods are further enhanced by LGnet (Tang et al. 2020) with the global temporal dynamics to handle the long-range missing issue, and by LSTM-M (Tian et al. 2018) with multi-scale modeling to better explore historical information. However, the above-mentioned models handle missing values by focusing on the temporal aspect without considering the complex Spatio-temporal features in traffic data. Specifically, the strong spatial connections between the sensor nodes should provide us with more information to handle the missing values. Moreover, one-step processing models are generally designed for single-step forecasting without considering the multi-step settings. Table 1 shows the method comparison for traffic forecasting with missing values.

Table 1 Existing methods for traffic forecasting with missing values

3 Problem formulation

We aim to predict future traffic data by leveraging historical traffic data. Traffic data can be represented as a multivariate time series on a traffic network. Let the traffic network \(\mathcal {G=\{V, E\}}\), where \({\mathcal {V}}=\{v_1, \ldots , v_N\}\) is a set of N traffic sensor nodes and \({\mathcal {E}}=\{e_1, \ldots , e_E\}\) is a set of E edges connecting the nodes. Each node contains F features representing traffic flow, speed, occupancy, etc. We use \({\mathcal {X}}\)=\(\{\textbf{X}_t\}_{t=1}^{\tau }\in {\mathcal {R}}^{N \times F \times \tau }\) to denote all the feature values of all the nodes over \(\tau \) time slices, \(\textbf{X}_{t}=({\textbf {x}}_{t}^{1},\ldots ,{\textbf {x}}_{t}^{N}) \in {\mathcal {R}}^{N \times F}\) denotes the observations at time t, where \({\textbf {x}}_t^{i}\in {\mathcal {R}}^{F}\) is the i-th variable of \(X_t\). We define a mask sequence \({\mathcal {M}}\)=\(\{\textbf{M}_t\}_{t=1}^{\tau }\in {\mathcal {R}}^{N \times F \times \tau }\), \(\textbf{M}_{t}=({\textbf {m}}_{t}^{1},\ldots ,{\textbf {m}}_{t}^{N}) \in {\mathcal {R}}^{N \times F}\). \({\textbf {m}}_t^{i} \in \{0,1\}^{F}\) denotes the features’ missing status for the i-th variable. To simplify, we adopt \(x_t^i \in {\mathcal {R}}\) and \(m_t^i \in {\mathcal {R}}\) to denote respectively the observation and mask value of one single feature for the i-th variable of \(\textbf{X}_t\). We take \(m_t^i=0\) if \(x_t^i\) is missing, otherwise \(m_t^i=1\).

We aim to build a model f, which can take an incomplete traffic sequence {\({\mathcal {X}}\), \({\mathcal {M}}\)} and the traffic network \({\mathcal {G}}\) as input, to predict the traffic data for the next \(T_{p}\) time steps \(\textbf{Y}=\{y_{\tau +1}, \ldots , y_{\tau +T_{p}}\} \in {\mathcal {R}}^{N\times T_{p}}\).

4 Proposal: GCN-M

Traffic data is collected under complex urban conditions. Apart from the Spatio-temporal patterns in the traffic data, we also consider the scenarios of complex missing values. We design a solution that models the local Spatio-temporal features and global historical patterns in a dynamic manner. The complex missing values are considered when building the forecasting model, i.e., one-step processing.

4.1 Model architecture

The global structure of GCN-M is shown in Fig. 2, integrating a Multi-scale Memory Network module, an Output Forecasting module, and \(\textit{l}\) Spatio-Temporal (ST) blocks. Each ST block integrates three key components: Temporal Convolution, Dynamic Graph Construction, and Dynamic Graph Convolution. The input traffic observations \({\mathcal {X}}\in {\mathcal {R}}^{N \times F \times \tau }\) and the mask sequence \({\mathcal {M}}\in {\mathcal {R}}^{N \times F \times \tau }\) are fed into the multi-scale memory network to extract the local statistic features and global historical patterns thus enriching the traffic embeddings. On the one hand, the enriched embeddings \({\mathcal {H}}_i\) on each ST block are used to mark the dynamic traffic status, thus generating dynamic graphs by combining both static node embeddings and predefined graph information. On the other hand, the learned dynamic graphs are combined with the temporal convolution module via a dynamic graph convolution to capture temporal and spatial dependencies in the traffic embeddings. We adopt residual connections between the input and output of each ST block to avoid the gradient vanishing problem. The output forecasting module takes the skip connections on the output of the final ST block and the hidden states after each temporal convolution for final prediction.

Fig. 2
figure 2

Main architecture of GCN-M

4.2 Multi-scale memory network

To extract the local statistic features and global historical patterns then form an enriched embedding, we adopt the concept of memory network, which was firstly proposed in (Weston et al. 2015) with primary application in Question-Answer (QA) systems. As shown in Fig. 3, the main idea of our memory network is to learn from historical memory components which conserve the long-range multi-scale patterns, i.e., recent, daily-periodic, and weekly-periodic dependencies. The scale range depends on the data characteristics. Specifically, we first extract local Spatio-temporal features as keys to query the memory components; the weighted historical long-range patterns will be cooperated with the local statistic features to eliminate the side effect from the missing values. Then, the local-global features will be output as the enriched traffic embeddings.

Fig. 3
figure 3

Memory module enriches traffic embeddings with multi-scale global features

4.2.1 Local spatio-temporal features

We first extract the Spatio-temporal features using the contextual information from observed parts of the time series. Unlike prior studies (Che et al. 2018), we consider both temporal and spatial aspects for generating the following statistic features of every timestamp:

Empirical Temporal Mean: The mean of previous observations reflects the recent traffic state and serves as a contextual knowledge of \(x_{t}^{i}\). Therefore, for a missing value \(x_{t}^{i}\in {\mathcal {R}}\), we construct its temporal mean using L past samples \(x_*^{i}\) before time t:

$$\begin{aligned} \bar{x}_t^i=\sum _{l=t-L}^{t-1}m_{l}^{i}x_{l}^{i}/\sum _{l=t-L}^{t-1}m_{l}^{i} \end{aligned}$$
(1)

Last Temporal Observation: We adopt the assumption in (Che et al. 2018) that any missing value inherits more or less the information from the last non-missing observation. In other words, the temporal neighbor stays close to the current missing value. We use \(\dot{x}_{t}^{i}\) to denote the last temporal observation of \(x_{t}^{i}\), their temporal distance is defined as \(\dot{\delta }_{t}^{i}\).

Empirical Spatial Mean: Another contextual knowledge of \(x_{t}^{i}\) is from the nearby nodes, which reflects the current local traffic situation. For each missing value \(x_t^i\), we construct its empirical spatial mean using S nearby samples \(x_t^*\) of the sensor node i:

$$\begin{aligned} \bar{\bar{x}}_{t}^{i}=\sum _{s=1}^{S}m_{t}^{s}x_{t}^{s}/\sum _{s=1}^{S}m_{i}^{s} \end{aligned}$$
(2)

Nearest Spatial Observation: Typically, the state of a graph node remains relatively similar to its neighbors, especially in a traffic graph where the nearby nodes share similar traffic situations. We define \(\ddot{x}_{t}^{i}\) as the nearest spatial observation of \(x_{t}^{i}\), their spatial distance is denoted as \(\ddot{\delta }_{t}^{i}\).

Generally, when \(\dot{\delta }_{t}^{i}\) or \(\ddot{\delta }_{t}^{i}\) is smaller, we tend to trust \(\dot{x}_{t}^{i}\) or \(\ddot{x}_{t}^{i}\) more. When the spatial/temporal distance becomes larger, the spatial/temporal mean would be more representative. Under this assumption, we model the temporal and spatial decay rate \(\gamma \) as

$$\begin{aligned} \gamma _{t}(\dot{\delta }_{t}^{i})= & {} exp\{-max(0, w^{i}\dot{\delta }_{t}^{i} + b^{i}\} \end{aligned}$$
(3)
$$\begin{aligned} \gamma _{s}(\ddot{\delta }_{t}^{i})= & {} exp\{-max(0, w_{t}\ddot{\delta }_{t}^{i} + b_{t}\} \end{aligned}$$
(4)

where \(w^i\), \(w_t\), \(b^i\) and \(b_t\) are model parameters that we train jointly with other parameters of the traffic forecasting model. We chose the exponentiated negative rectifier (Che et al. 2018) so that the decay rates \(\gamma _{t}\) and \(\gamma _{s}\) decrease monotonically in the range between 0 and 1. Considering the trainable decays, our proposed model incorporates the spatial/temporal estimations to define the local features of \(x_{t}^{i}\):

$$\begin{aligned} z_{t}^{i}= m_{t}^{i}x_{t}^{i} + (1-m_{t}^{i})( \gamma _{t}\dot{x}_{t}^{i} + \gamma _{s}\ddot{x}_{t}^{i} + (1-\gamma _{t})\bar{x}_{t}^{i} + (1-\gamma _{s})\bar{\bar{x}}_{t}^{i} ) \end{aligned}$$
(5)

Therefore, for \(\textbf{X}_t\in {\mathcal {R}}^{N \times F}\), we can get its local features \(\textbf{Z}_t\in {\mathcal {R}}^{N \times F}\).

4.2.2 Multi-scale memory construction

Global historical patterns play a critical role in building an enriched traffic embedding. The historical observations in multiple scales (e.g., hourly, daily, weekly) can be embedded into memory as complement information for the local features \(\textbf{Z}_{t}\in {\mathcal {R}}^{N\times F}\). The main idea is to adopt local features to query similar historical patterns in the memory and output a weighted feature representation for the current timestamp. In this manner, the enriched multi-scale historical and local features allow not only eliminating the side effect of missing values but also improving the current feature embeddings. At time t, the query \(q_{t}\) of \(\textbf{X}_{t}\) can be embedded from the local features \(\textbf{Z}_{t}\in {\mathcal {R}}^{N\times F}\):

$$\begin{aligned} q_{t}= \textbf{Z}_{t}W_{q} + b_{q} \in {\mathcal {R}}^{N \times d} \end{aligned}$$
(6)

where \(W_{q}\in {\mathcal {R}}^{F\times d}\), \(b_{q}\in {\mathcal {R}}^{N\times d}\) are parameters, d is the embedding dimension.

The input memory components are the temporal segments of multiple scales:

  • The recent (e.g., hourly) segment is: \(X_h\) = \(\{\textbf{X}_{i}\}_{i=t-\tau }^{t-1} \in {\mathcal {R}}^{N \times F \times n_{h}\tau }\), with \(n_{h}\) recent periods (e.g., hours) before t, each period contains \(\tau \) observations.

  • The daily-periodic segment is: \(X_d\) = \(\{\textbf{X}_i\}\in {\mathcal {R}}^{N \times F \times n_{d}\tau }\) with \(i\in \) \([t-n_{d}T_{d}-\tau /2:t-n_{d}T_{d}+\tau /2]\) \(\Vert \) \([t-(n_{d}-1)T_{d}-\tau /2:t-(n_{d}-1)T_{d}+\tau /2]\) \(\Vert \) ...\(\Vert \) \([t-T_{d}-\tau /2:t-T_{d}+\tau /2]\), we store \(\tau \) samples around time t for each of the past \(n_d\) days. \(T_d\) denotes the sample number during one day, and \(\Vert \) indicates the concatenation operation.

  • The weekly-periodic segment is: \(X_w\) = \(\{\textbf{X}_i\}\in {\mathcal {R}}^{N \times F \times n_{w}\tau }\) with \(i\in \) \([t-n_{w}T_{w}-\tau /2:t-n_{w}T_{w}+\tau /2]\) \(\Vert \) \([t-(n_{w}-1)T_{w}-\tau /2:t-(n_{w}-1)T_{w}+\tau /2]\) \(\Vert \) ...\(\Vert \) \([t-T_{w}-\tau /2:t-T_{w}+\tau /2]\), we store \(\tau \) samples around time t for each of the past \(n_w\) weeks. \(T_w\) denotes the sample number during one week, and \(\Vert \) indicates the concatenation operation.

The input set of \(\{\textbf{X}_i\}\) = \([X_h \Vert X_d \Vert X_w] \in {\mathcal {R}}^{N \times F \times (n_d+n_w+n_h)\tau }\) are embedded into the input memory vectors \(\{m_i\}\) and output memory vectors \(\{c_i\}\):

$$\begin{aligned} m_{i}= \textbf{X}_{i}W_{m} + b_{m} \in {\mathcal {R}}^{N \times d} \end{aligned}$$
(7)
$$\begin{aligned} c_{i}= \textbf{X}_{i}W_{c} + b_{c} \in {\mathcal {R}}^{N \times d} \end{aligned}$$
(8)

where \(W_{m},W_{c}\in {\mathcal {R}}^{F\times d}\), \(b_{m},b_{c}\in {\mathcal {R}}^{N\times d}\) are parameters.

In the embedding space, we compute the attention score between the query \(q_{t}\) and each memory \(m_{i}\) by taking the inner product followed by a Softmax:

$$\begin{aligned} p_{t,i}= Softmax(q_{t}^{T}m_{i}) \end{aligned}$$
(9)

The attention score represents the similarity of each historical observation to the query. Any pattern with a higher attention score is more similar to the context of targeting missing values. As shown in Fig. 3, the response vector from memory is then a sum over the output memory vectors, weighted by the attention score from the input:

$$\begin{aligned} o_{t}= \textstyle \sum _{i=1}^{(n_d+n_w+n_h)\tau } c_{i} p_{t,i} \in {\mathcal {R}}^{N\times d} \end{aligned}$$
(10)

We can finally integrate both local Spatio-temporal and global multi-scale features and output the enriched traffic embeddings:

$$\begin{aligned} h_{t}= (q_{t} \Vert o_{t})W_{h} + b_{h} \in {\mathcal {R}}^{N\times d} \end{aligned}$$
(11)

where \(W_{h}\in {\mathcal {R}}^{2d\times d}\), \(b_{h}\in {\mathcal {R}}^{d}\) are parameters, and \(\Vert \) denotes the concatenation operation. Therefore, for input \({\mathcal {X}}=\{\textbf{X}_t\}_{t=1}^{\tau }\in {\mathcal {R}}^{N \times F \times \tau }\), we can get its enriched traffic embeddings \({\mathcal {H}}=\{h_t\}_{t=1}^{\tau }\in {\mathcal {R}}^{N \times d \times \tau }\).

4.3 Dynamic graph construction

A predefined graph is usually constructed with the distance or the connectivity between the spatial nodes. However, recent studies (Wang et al. 2020; Wu et al. 2020; Li et al. 2021) show that the cross-region dependence does exist for those nodes which are not physically connected but share similar patterns. Learning dynamic graphs should show better performance than learning static graphs or adopting the predefined graphs. Considering the missing values in traffic data, instead of using the raw traffic observations to mark the dynamic traffic status (Li et al. 2021; Han et al. 2021b), we construct dynamic graphs (i.e., adjacency matrix) with the enriched traffic embeddings \({\mathcal {H}}_i\) at each ST block, which integrates both local and global multi-scale patterns at each time step. This allows capturing the spatial relationship between traffic nodes robustly. As shown in Fig. 4, the main idea here is to generate dynamic filters from the predefined graphs \({\mathcal {G}}\) and the traffic embeddings \({\mathcal {H}}_i\in {\mathcal {R}}^{N \times d \times \tau _{i}}\) (\(\tau _{i}\) is the sequence length at the i-th ST block), which are applied on the randomly initialized static node embeddings to construct dynamic adjacency matrices. In more detail, the core steps in Fig. 4 are illustrated as follows:

Fig. 4
figure 4

Dynamic graph construction from the enriched traffic embeddings

[Dynamic Filter Generation] Given \({\mathcal {H}}_{i}=\{h_{t}\}\in {\mathcal {R}}^{N \times d \times \tau _{i}}\), the traffic embedding \(h_{t}\) at time t is firstly combined with the predefined adjacency matrix \(A_{{\mathcal {G}}}\in {\mathcal {R}}^{N\times N}\) to generate dynamic graph filters via a diffusion convolution layer as proposed in Li et al. (2018):

$$\begin{aligned} {\mathcal {F}}_{t}= \textstyle \sum _{k=0}^{K} P_{k} h_{t} W_{k}\in {\mathcal {R}}^{N\times d} \end{aligned}$$
(12)

where K denotes the diffusion step, \(P_{k}\)= \(A_{{\mathcal {G}}}\)/\(rowsum(A_{{\mathcal {G}}})\) represents the power series of the transition matrix (Wu et al. 2019), and \(W_{k}\in {\mathcal {R}}^{d\times d}\) is the model parameter matrix.

[Hybrid Node Embedding Construction] Considering both the source and target traffic node, we initialize two random node embeddings \(E^1,E^2 \in {\mathcal {R}}^{N\times d}\), representing the static node features (Wang et al. 2020) which are not reflected in the observations but learnable during training. Thus, two dynamic filters are applied over the static node embeddings:

$$\begin{aligned} \begin{gathered} \hat{E}_{t}^1 = tanh(\alpha ({\mathcal {F}}_{t}^{1} \odot E^1)) \in {\mathcal {R}}^{N\times d}\\ \hat{E}_{t}^2 = tanh(\alpha ({\mathcal {F}}_{t}^{1} \odot E^2)) \in {\mathcal {R}}^{N\times d} \end{gathered} \end{aligned}$$
(13)

where \(\odot \) denotes the Hadamard product (Wu et al. 2019). \(\hat{E}_{t}^1\) and \(\hat{E}_{t}^1\) are hybrid node embeddings combining both static and dynamic settings of the traffic data.

[Graph Construction] As mentioned in the previous study (Wu et al. 2020), in multivariate time series forecasting, we expect that the change of a node’s condition causes the change of another node’s condition such as traffic flow. Therefore the learned relationship is supposed to be uni-directional. We construct the graph by extracting uni-directional relationships between traffic nodes. The dynamic adjacency matrix is constructed from the hybrid embeddings:

$$\begin{aligned} A_{t} = ReLU\left( tanh(\alpha ({\hat{E}_{t}^1\hat{E}_{t}^2}^T - {\hat{E}_{t}^2\hat{E}_{t}^1}^T))\right) \in {\mathcal {R}}^{N\times N} \end{aligned}$$
(14)

Therefore, we can construct the dynamic graphs \(A_{{\mathcal {D}}_{i}}=\{A_{t}\}\in {\mathcal {R}}^{N \times N \times \tau _{i}}\)for the enriched traffic embeddings \({\mathcal {H}}_i\in {\mathcal {R}}^{N \times d \times \tau _{i}}\) at the i-th ST block. As the computation and memory cost grows quadratically with the increase of graph size, in practice, it is possible to adopt a sampling approach (Wu et al. 2020), which only calculates pairwise relationships among a subset of nodes.

4.4 Temporal convolution module

The temporal convolution network (TCN) (Lea et al. 2017) consists of multiple dilated convolution layers, which allows extracting high-level temporal trends. Compared to RNN-based approaches, dilated causal convolution networks are capable of handling long-range sequences in a parallel manner. The output of the last layer is a representation that captures temporal dynamics in history. As shown in Fig. 5, considering the temporal dynamics in traffic data, we adopt the temporal convolution module (Wu et al. 2019) with the consideration of the gating mechanism over the enriched traffic embeddings \({\mathcal {H}}_i\). One dilated convolution block is followed by a tangent hyperbolic activation function to output the temporal features. The other block is followed by a sigmoid activation function as a gate to determine the ratio of information that can pass to the next module. In particular, the sigmoid gate controls which input of the current states is relevant for discovering compositional structure and dynamic variances in time series. Applying the sigmoid nonlinearity on the input states differs from other well-known architectures (e.g., LSTM or GRU), which ignore the compositional structure features in time series (Yu et al. 2018).

Fig. 5
figure 5

Temporal convolution module with dynamic graph convolution

Given the enriched traffic embeddings \({\mathcal {H}}_i=\{h_{t}\}\in {\mathcal {R}}^{N \times d \times \tau _{i} }\), a filter \({\mathcal {F}}\in {\mathcal {R}}^{1 \times \textrm{K}}\), \(\textrm{K}\) is the temporal filter size, \(\textrm{K}=2\) by default. The dilated causal convolution operation of \({\mathcal {H}}_i\) with \({\mathcal {F}}\) at time t is represented as:

$$\begin{aligned} {\mathcal {H}}_{i} \star {\mathcal {F}}_{i}(t) = \textstyle \sum _{s=0}^{\textrm{K}}{\mathcal {F}}_{i}(s) {\mathcal {H}}_{i}(t-{\textbf {d}} \times s) \in {\mathcal {R}}^{N \times d\times \tau _{i+1}} \end{aligned}$$
(15)

where \(\star \) is the convolution operator, \({\textbf {d}}\) is the dilation factor, d is the embedding dimension size, \(\tau _{i+1}\) is the new sequence length after the convolution operation, which equals to one on the last layer. Figure 5 shows a three-layer dilated convolution block with \(\textrm{K}=2\), \({\textbf {d}}\in [1,2,4]\). Considering the gating mechanism, we define the output of the temporal convolution module:

$$\begin{aligned} {\textbf {h}}_i = tanh(W_{{\mathcal {F}}^{1}} \star {\mathcal {H}}_i) \odot \sigma (W_{{\mathcal {F}}^{2}} \star {\mathcal {H}}_i) \in {\mathcal {R}}^{N \times d \times \tau _{i+1}} \end{aligned}$$
(16)

where \(W_{{\mathcal {F}}^{1}}\), \(W_{{\mathcal {F}}^{2}}\) are learnable parameters of convolution filters, \(\odot \) denotes the element-wise multiplication operator, \(\sigma (\cdot )\) is the sigmoid function.

A classic temporal convolution module stacks the temporal features at each time step t. Therefore, the upper layer contains richer information than the lower layer. The gating mechanism allows filtering the temporal features on the lower layers by weighting features on different time steps without considering the spatial node interactions at each time step. Moreover, the spatial interactions in traffic data always show a dynamic nature (Wu et al. 2020). To this end, the gating mechanism from a dynamic spatial aspect is envisaged to better capture the Spatio-temporal patterns.

4.5 Dynamic graph convolution

Spatial interactions between the traffic nodes could be used to improve traffic forecasting performance. The dynamic spatial interaction leads to considering a dynamic version of graph convolution to conduct it on different graphs at different timestamps. Different from previous work (Li et al. 2021) which uses raw traffic observations to mark the dynamic traffic status, we adopt the enriched traffic embeddings, which consider the missing-value issues to generate robust dynamic graphs.

As shown in Fig. 5, we apply the dynamic graph convolution on \({\textbf {h}}_i\), i.e., the output of the temporal convolution module, to further select the features at each time step from the spatial perspective. As mentioned in Sect. 4.3, the dynamic graphs \(A_{{\mathcal {D}}_{i}}\in {\mathcal {R}}^{N \times N \times \tau _{i}}\) are generated from the enriched traffic embeddings \({\mathcal {H}}_i\in {\mathcal {R}}^{N \times d \times \tau _{i}}\) at the i-th ST block. \(A_t\) reflects the spatial relationships between nodes at time t. The temporal features \({\textbf {h}}_i(t)\) aggregate spatial information according to the adjacency matrix \(A_t\). Inspired by DCRNN (Li et al. 2018), we consider the traffic situation as the diffusion procedure on the graph. The graph convolution will generate the aggregated spatial information at each time step:

$$\begin{aligned} {\mathcal {H}}'_{i}(t) = \textstyle \sum _{k=0}^{K} \left( A_{{\mathcal {D}}_{i}}(t) \right) ^{k} {\textbf {h}}_{i}(t) W_{k}\in {\mathcal {R}}^{N\times d} \end{aligned}$$
(17)

where K denotes the diffusion step, and \(W_k\) is the learnable parameter matrix. We adopt the residual connection (He et al. 2016) between the input and output of each ST block to avoid the gradient vanishing issue in the model’s training. Therefore, the input of the \((i+1)^{th}\) ST block is defined as:

$$\begin{aligned} {\mathcal {H}}_{i+1}(t) = {\mathcal {H}}_{i}(t) + {\mathcal {H}}'_{i}(t) \end{aligned}$$
(18)

4.6 Output forecasting module

The outputs \({\textbf {h}}_{i} \in {\mathcal {R}}^{N \times d \times \tau _{i+1} }\) of the middle temporal convolution modules and \({\mathcal {H}}_{l}\in {\mathcal {R}}^{N \times d \times 1 }\) of the last ST block are considered for the final prediction, which represent the hidden states at various Spatio-temporal levels. We add skip connections on each of the hidden states which are essentially \(1 \times \tau _{i+1}\) standard convolutions ( \(\tau _{i+1}\) denotes the sequence length at the output of the i-th ST block). The concatenated output features are defined as follows:

$$\begin{aligned} O = ({\textbf {h}}_{0} W_{s}^{0} + b_{s}^{0})\Vert \ldots \Vert ({\textbf {h}}_{i} W_{s}^{i} + b_{s}^{i}) \Vert \ldots \Vert ({\textbf {h}}_{l-1} W_{s}^{l-1} + b_{s}^{l-1}) \ \Vert ({\mathcal {H}}_{l} W_{s}^{l} + b_{s}^{l})\nonumber \\ \end{aligned}$$
(19)

where \(O \in {\mathcal {R}}^{N\times (l+1)d}\), \(W_{s}^{i}\), \(b_{s}^{i}\) are learnable parameters for the convolutions. Two fully-connected layers are added to project the concatenated features into the desired output dimension:

$$\begin{aligned} \hat{{\textbf {Y}}} = (ReLU(OW_{fc}^{1} + b_{fc}^{1})) W_{fc}^{2} + b_{fc}^{2} \in {\mathcal {R}}^{N\times T_{p}} \end{aligned}$$
(20)

where \(W_{fc}^{1}\), \(W_{fc}^{2}\), \(b_{fc}^{1}\), \(b_{fc}^{2}\) are learnable parameters for the fully-connected layers, N is the node number, \(T_p\) denotes the forecasting steps.

Given the ground truth \(\textbf{Y}\in {\mathcal {R}}^{N\times T_{p}}\) and the predictions \(\hat{{\textbf {Y}}}\in {\mathcal {R}}^{N\times T_{p}}\), we use mean absolute error (MAE) as our model’s loss function for training:

$$\begin{aligned} L = \frac{1}{NT_{p}}\textstyle \sum _{n=1}^{N}\textstyle \sum _{t=1}^{T_{p}} |\hat{{\textbf {Y}}}_{t}^{n} - {\textbf {Y}}_{t}^{n} |\end{aligned}$$
(21)

5 Experiments

In this section, we demonstrate the effectiveness of GCN-MFootnote 1 with real-life traffic datasets. The experiments were designed to answer the following research questions (RQs):

RQ 1:

Performance on raw benchmark datasets: How well does our model perform on traffic datasets with a few missing values or without?

RQ 2:

Complex scenarios of missing values: How successful is our model at forecasting traffic data considering the complex missing values scenarios?

RQ 3:

Dynamic graph modeling: How does our method perform on dynamic graph modeling considering the missing values?

RQ 4:

One-step processing VS two-step processing: How will our method perform when adopting distinct missing-value processing strategies?

5.1 Experimental settings

[Datasets] We base our experiments on the public traffic datasets: PEMS-BAY and METR-LA released by Li et al. (2018), which are widely used in the literature. PEMS-BAY records six months of traffic speed on 325 sensors in the California Bay Area. METR-LA records four months of traffic flow on 207 sensors on the highways of Los Angeles County. Both datasets contain some zero and/or missing values, though PEMS-BAY has been pre-processed by the domain experts from the data provider (Caltrans 2015) to interpolate most of the missing values. Following Li et al. (2018), the datasets are split with 70% for training, 10% for validation, and 20% for testing. In order to validate the model in complex scenarios of missing values, we introduce complex missing values in the datasets (see details in Sect. 5.4). In practice, the model should forecast future values from the input data with missing values. Therefore, in the testing set, we mask out the observations from the input sequence \({\mathcal {X}}\) (i.e., inject missing values) but maintain the complete information for the target \(\textbf{Y}\). We use recent \(\tau \)= 12 timestamps as input to predict the next \(T_p\) timestamp. Considering that the missing values are marked as zeros, we scale the input by dividing it with the max speed of the training set instead of applying Z-score normalization. This avoids changing the zero values and facilitates the computation process. Table 2 shows the summary statistics of the datasets.

Table 2 Summary statistics of PEMS-BAY and METR-LA

[Evaluation metrics] The forecasting accuracy of all tested models is evaluated by three metrics: mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE).

$$\begin{aligned} \begin{gathered} MAE({\textbf {Y}}, \hat{{\textbf {Y}}}) = \frac{1}{NT_{p}}\textstyle \sum _{n=1}^{N}\textstyle \sum _{t=1}^{T_{p}} |\hat{{\textbf {Y}}}_{t}^{n} - {\textbf {Y}}_{t}^{n} |\\ RMSE({\textbf {Y}}, \hat{{\textbf {Y}}}) = \sqrt{\frac{1}{NT_{p}}\textstyle \sum _{n=1}^{N}\textstyle \sum _{t=1}^{T_{p}} |\hat{{\textbf {Y}}}_{t}^{n} - {\textbf {Y}}_{t}^{n} |^{2} }\\ MAPE({\textbf {Y}}, \hat{{\textbf {Y}}}) = \frac{1}{NT_{p}}\textstyle \sum _{n=1}^{N}\textstyle \sum _{t=1}^{T_{p}} |\frac{\hat{{\textbf {Y}}}_{t}^{n} - {\textbf {Y}}_{t}^{n}}{ {\textbf {Y}}_{t}^{n}} |\end{gathered} \end{aligned}$$
(22)

and N denotes the node numbers, \(T_p\) represents the forecasting steps.

When evaluating each model’s performance on the testing set, we mask out the inherent zero values in the prediction targets when computing the metrics.

We conduct statistical tests to assess the statistical significance of the differences between the models. In order to compare the forecasters over multiple datasets (Ismail Fawaz et al. 2019), we adopted the critical difference diagrams recommended by Demšar (2006) and used the Friedman test (Friedman 1940) to reject the null hypothesis (i.e., check whether there are significant differences at all). We followed the pairwise post-hoc analysis recommended by Benavoli et al. (2016) and adapted the critical difference diagrams (Demšar 2006) with the change that all forecasters are compared with pairwise Wilcoxon signed-rank test (Wilcoxon 1992). Additionally, we formed cliques using Holm’s alpha (5%) correction (Holm 1979) rather than the post-hoc Nemenyi test originally used in Demšar (2006).

[Execution and Parameter Settings] The proposed model is implemented by PyTorch 1.6.0 and is trained using the Adam optimizer with a learning rate of 0.001. All the models are tested on a single Tesla V100 GPU of 32 Go memory. In the multi-scale memory module, L, S are set to 12 and 5. \(n_h\), \(n_d\), \(n_w\) are all set to 2. We apply four ST blocks in which the Temporal Convolution module contains two dilated layers with dilation factor \({\textbf {d}} \in \left[ 1,2\right] \). The embedding dimension d is set to 32.

5.2 Baseline approaches

We only compare with the baseline models whose source code is publicly available. We follow the default parameter settings described in each paper for training each model. According to the strategy for handling missing values, the baseline models can be organized into two categories:

  1. 1.

    Ignore the missing values when optimizing the model, i.e., consider missing values in the input sequence as actual zero values and mask out the missing values when computing the loss error:

    • DCRNN (Li et al. 2018): Based on the predefined graphs, DCRNN integrates GRU with dual directional diffusion convolution.

    • STGCN (Yu et al. 2018): Based on the predefined graphs, STGCN combines graph convolution into 1D convolutions.

    • Graph WaveNet (Wu et al. 2019): Graph WaveNet learns an adaptive graph and integrates diffusion graph convolutions with temporal convolutions.

    • MTGNN (Wu et al. 2020): MTGNN learns an adaptive graph and integrates mix-hop propagation layers in the graph convolution module. Moreover, it designed the dilated inception layers in temporal convolutions.

    • AGCRN (Bai et al. 2020): AGCRN learns an adaptive graph and integrates with recurrent graph convolutions with node adaptive parameter learning.

    • GTS (Shang et al. 2020): GTS learns a probabilistic graph which is combined with the recurrent graph convolutions to do traffic forecasting.

      Note::

      In practice, any model can ignore the missing values in their optimization process. We list here some classic models and the most recent models designed specifically for traffic forecasting.

  2. 2.

    Jointly model the missing values and forecasting task, i.e., one-step processing models:

    • GRU (Chung et al. 2014): Gated Recurrent Unit (GRU) can be considered as a basic structure for traffic forecasting.

    • GRU-I (Che et al. 2018): A variation of GRU, which infers the missing values with the predictions from previous steps.

    • GRU-D (Che et al. 2018): Based on GRU, GRU-D helps improve the prediction performance by incorporating the missing patterns, including the masking information and time intervals between missing and observed values.

    • LSTM-I (Cui et al. 2020b): Based on LSTM, LSTM-I is similar to GRU-I for inferring the missing values.

    • LSTM-M (Tian et al. 2018): Based on LSTM, LSTM-M is designed for traffic forecasting on data with short-period and long-period missing values.

    • SGMN (Cui et al. 2020b): Based on the graph Markov process, SGMN does traffic forecasting on data with random missing values by corporating a spectral graph convolution.

5.3 RQ 1: performance on raw benchmark datasets

Recently, a lot of traffic forecasting models (Jiang and Luo 2022) have been proposed, achieving remarkable performance on the benchmark datasets PEMS-BAY and METR-LA. Our objective is not to beat all the models in terms of forecasting accuracy, but to validate our proposal for jointly modeling missing values and forecasting. Therefore, it’s essential to know how GCN-M performs in a primary setting, i.e., on the original datasets with a few missing values or without. We pick three classic models (DCRNN (Li et al. 2018), STGCN (Yu et al. 2018) and Graph WaveNet (Wu et al. 2019)) and the three most recent models (MTGNN (Wu et al. 2020), AGCRN (Bai et al. 2020) and GTS (Shang et al. 2020)), which focus on the Spatio-temporal modeling of traffic data, and generally ignore the missing values when training the model. Additionally, we consider the group of works (Che et al. 2018; Cui et al. 2020b; Tian et al. 2018) which are specifically designed for modeling the missing values in the forecasting model, i.e., one-step processing models.

Table 3 Performance comparison on raw PEMS-BAY dataset
Table 4 Performance comparison on raw METR-LA dataset
Fig. 6
figure 6

Critical difference diagrams showing rankings by various evaluation metrics and the statistical difference comparison of 13 forecasting models on PEMS-BAY and METR-LA datasets under multiple forecasting horizons

Tables 3 and 4 show the performance comparison on the raw PEMS-BAY and METR-LA datasets, respectively. It should be noted that the original datasets already contain missing values (0.0031% missed in PEMS-BAY, 8.11% missed in METR-LA). We train the models for single-step (horizon=1) and multi-step (horizon =3,6,12) forecasting. We report the evaluation errors on each horizon step. We observe from the results that no model achieved evident better performance than the others. However, the first group of works (e.g., DCRNN) performs better than the one-step processing models, which is not surprising as they incorporate the advanced graph models (e.g., mix-hop propagation (Wu et al. 2020)) and training techniques (e.g., curriculum learning (Wu et al. 2020)) to improve the Spatio-temporal forecasting performance. Surprisingly, among the one-step processing models, GRU-D (Che et al. 2018) shows much worse performance than the others, probably due to the fact that it has been designed for health care applications, whose data is more stable than dynamic traffic data. LSTM-M (Tian et al. 2018) and SGMN (Cui et al. 2020b), designed for traffic forecasting with missing values, show relatively good performance in PEMS-BAY especially on single-step forecasting. However, they did not show a clear advantage over the first group of works. The one-step processing models are generally designed for single-step forecasting; their performance gap with the first group of works becomes larger under a multi-step forecasting setting.

We present in Fig. 6 the critical difference diagrams (Demšar 2006) which show the average rankings and visualize the statistical difference between the forecasting models, where a thick horizontal line shows a group (i.e., clique) of models that are not significantly different in terms of evaluation metrics. From Fig. 6, we observe that even though GCN-M belongs to the one-step processing models, its performance remains close to the first group of works. GCN-M is not significantly different to MTGNN, GTS, and GraphWaveNet on all three evaluation metrics. Moreover, the advanced graph models and training techniques in recent work (Wu et al. 2020) can be considered to have improved the performance of GCN-M further.

5.4 RQ 2: complex scenarios of missing values

In this section, we demonstrate the power of GCN-M in handling complex scenarios of missing values for the purpose of traffic forecasting.

As mentioned previously in Fig. 1, there are several scenarios of missing values in real-life traffic datasets (e.g., METR-LA): short-range or long-range missing; partial or entire network missing. The results in Tables 3 and 4 did not show the superiority of GCN-M over other models on the original datasets with a low missing rate. To test the model’s capability of handling complex missing values, we have designed three scenarios with various missing rates (10%, 20%, and 40%), and removed the observations from the datasets accordingly. We use \(\hat{x}_{i} \in {\mathcal {R}}^{n \times f \times t}\) to represent each of the observation tensors to be removed from \({\mathcal {X}} \in {\mathcal {R}}^{N \times F \times \tau }\), therefore a local mask tensor \(\hat{m}_{i} \in {\mathcal {R}}^{n \times f \times t} \) can be defined accordingly. This annotates the locations of missing values in the original dataset. All the local mask tensors constitute the global mask sequence \({\mathcal {M}} = \{\hat{m}_{i}\}\) which allows injecting missing values with a given missing rate. Then, we designed the scenarios of:

  • Short-range missing: we randomly set \(n\in [1,\ldots ,N]\), \(f=F\), \(t=1\)

  • Long-range missing: we randomly set \(n\in [1,\ldots ,N]\), \(f=F\), \(t=\tau \)

  • Mix-range missing: we randomly set \(n\in [1,\ldots ,N]\), \(f=F\), \(t \in [1,\ldots ,\tau ]\)

Table 5 Performance comparison on the incomplete PEMS-BAY dataset with various random settings on missing values
Table 6 Performance comparison on the incomplete METR-LA dataset with various random settings on missing values

In Tables 5 and 6, we show the performance comparison on the PEMS-BAY and METR-LA datasets under various missing value scenarios. We highlight the best results among the one-step processing models (underlined values) and all the models (bold values). Globally, GCN-M shows the best performance under all the settings when compared with other one-step processing models. The graph-based model SGMN (Cui et al. 2020b) performs much worse than other one-step processing models under long-range and mix-range missing settings, indicating that it applies only to simple missing scenarios, i.e., short-range random missing. GCN-M does not always show superiority compared with the first group of works, especially in the short-range missing scenario, where MTGNN and GTS usually show good performances. MTGNN typically performs better than GCN-M when the missing rate is low (10%), except under the mix-range missing scenario of PEMS-BAY. We can draw a conclusion from this observation: a robust Spatio-temporal forecasting model can offset the impact of the missing values to some extent, as it allows exploring the information thoroughly from the observed measures. GCN-M becomes the best forecasting model when the missing rate gets higher, as the missing values become a more critical factor that impacts the forecasting model than Spatio-temporal pattern modeling.

Compared to the short-range missing scenario, GCN-M shows a more robust performance under long-range and mix-range missing scenarios, where the recent temporal values and the nearby nodes’ values are not always observed. The multi-scale memory block in GCN-M allows enriching the traffic embedding at each timestamp, thus making the model robust in the two complex scenarios. The memory block searches for the periodic global patterns from historical data and the valuable local features from nearby nodes or recent observations at each timestamp. When nearby node values are unobserved, GCN-M favors more recent observations and vice versa. As the zero values usually show periodicity while missing values show contingency (Caltrans 2015), the memory module with the periodic historical patterns can distinguish the inherent zero values from the missing values. The current node readings combined with the historical patterns will eliminate the effect of missing values but conserve that of zero values.

Fig. 7
figure 7

Critical difference diagrams showing rankings by various evaluation metrics and the statistical difference comparison of 13 forecasting models on PEMS-BAY and METR-LA datasets with various missing rates under mix-range missing scenario

In Fig. 7, we show the critical difference diagrams (Demšar 2006) on PEMS-BAY and METR-LA datasets under the mix-range missing scenario. A thick horizontal line shows a group of models that are not significantly different in terms of different evaluation metrics. We can also observe that GCN-M is significantly different from other model groups on all the evaluation metrics, which validates the model’s performance for processing missing values under complex scenarios.

Fig. 8
figure 8

Parameters effects: we report the model errors of GCN-M on METR-LA dataset considering (ac) L observed samples before current timestamp for constructing the Empirical Temporal Mean in equation 1; (df) S observed samples nearby current node for constructing the Empirical Spatial Mean in equation 2

In Fig. 8, we show the effects of the memory module’s parameters L and S on the model’s performance. The two parameters represent the searching range of the local temporal and spatial features, respectively. We report the model’s evaluation errors with various missing rates. From the results in Fig. 8, we observe that when the searching range becomes more extensive, the model’s performance decreases more. This can be explained by the mean value of a larger space and the less recent observations will lead to a weaker information dependency with the current timestamp, thus affecting the information enrichment of the traffic embedding. In real-life datasets, we can set the parameters from a small value, such as considering local features during the last one hour (L=12) with five nearest sensor nodes (S=5).

5.5 RQ 3: dynamic graph modeling

In the dynamic traffic system, the spatial dependency can be considered as a dynamic system status, which evolves over time (Han et al. 2021b). The traffic observations at each timestamp are always adopted to characterize the dynamic traffic status and help learn the dynamic graphs (Li et al. 2021). However, due to the missed observations, the traffic status at certain timestamps can not be characterized, thus affecting the dynamic graph learning process.

This issue can be handled by the enriched traffic embeddings proposed in GCN-M. It allows considering the local static features and global historical patterns, which avoids the deviation introduced by the missing values and helps learn the dynamic graphs. To validate the performance of the learned dynamic graphs, we designed the following variants of our GCN-M model:

  • GCN-M-obs: instead of using the enriched traffic embeddings, the raw traffic observations (Li et al. 2021) are adopted to construct dynamic graphs.

  • GCN-M-adp: instead of learning dynamic graphs and applying dynamic convolution, an adaptive static graph (Wu et al. 2019) is learned to do the graph convolution.

  • GCN-M-pre: instead of learning graphs from the traffic embedding or observations, the predefined graphs (Li et al. 2018) calculated with the directed distances between traffic nodes are adopted for doing graph convolution.

  • GCN-M-com: combine both predefined and learned static graphs (Wu et al. 2019) to do the graph convolution.

We show in Table 7 the performance comparison on various model variants of the spatial graph modeling. We report the model errors on multiple horizons. We consider the complex scenario of mix-range missing values with a missing rate of 40% on both PEMS-BAY and METR-LA datasets. The results in Table 7 suggest that the dynamic graphs learned from the enriched traffic embeddings perform the best when compared to other variants. In contrast, the model obtains the worst performance when learning the dynamic graphs from the raw observations, which is mainly due to the missing values hindering the graph learning process in inferring the dynamic traffic status. GCN-M-obs performs even worse than GCN-M-adp in which the static graph is learned from the entire observations, eliminating the effect from local missing values.

Table 7 Performance comparison on graph module in mix-range missing value scenario with missing rate = 40%

5.6 RQ 4: one-step VS two-step processing

In this section, we show the performance comparison when adopting different missing-value processing strategies for traffic forecasting. As a one-step processing model, GCN-M jointly models the Spatio-temporal patterns and missing values for traffic forecasting. The two-step processing models handle the missing values in a preprocessing step, then apply a forecasting model to the completed data.

To compare the processing strategies fairly, we use GCN-M as a base model to test the two-step processing approaches. A preprocessing step is adopted to replace the GCN-M memory module designed to handle missing values.

We consider the following imputation methods to fill in missing values and apply GCN-M to the completed data.

  • MEAN (García-Laencina et al. 2010): This approach replaces missing values with the mean of observed measures based on the respective feature of the input sequence.

  • KNN (Batista et al. 2002): This method replaces missing values with the mean of k-nearest temporal neighbors. We linearly interpolate the missing values with k = 2, considering the previous and next non-empty values.

  • MICE (Van Buuren and Groothuis-Oudshoorn 2011): MICE is the multiple imputation method that fills the missing values from the conditional distributions by Markov chain Monte Carlo (MCMC) techniques.

We show in Table 8 the performance comparison between one-step processing (i.e., GCN-M) and two-step processing (i.e., GCN-M variants) models. We conducted the experiments under the complex scenario of missing values (i.e., mix-range missing) with missing rates varying from 10 to 40%. Table 8 shows that the preprocessing step with different imputation techniques leads to worse performance than the one-step setting. Even though authors in Shleifer et al. (2019) justified that the imputation techniques in the preprocessing step improve the model’s performance under simple scenarios of missing values (e.g., random short-range missing), they are not applicable for complex scenarios of missing values with the results obtained from Table 8. On the one hand, the complex missing values (e.g., on short & long ranges, partial & entire variables) challenges the imputation tasks in considering local and global Spatio-temporal patterns; on the other hand, the patterns of missing values modeled in the preprocessing step are totally isolated from forecasting models, which may not be valuable for the forecasting tasks.

Table 8 Performance comparison between one-step and two-step processing models under mix-range missing value scenario with various missing rates

5.7 Discussions

Our approach has several advantages. First, starting from the real-world data, GCN-M considered the complex scenarios of missing values in traffic data. Different from the previous work (Che et al. 2018; Cui et al. 2020b; Tian et al. 2018) which consider the missing value from a part of the real-life scenarios: either under the short-range or long-range missing settings, under partial or entire networking missing settings. GCN-M considers the complex mix-missing value context covering various real-life scenarios for missing values.

Second, GCN-M is capable of handling such complex missing value scenarios with a multi-scale memory module. This combines local Spatio-temporal features (short-range missing, partial and entire network missing) and global historical patterns (long-range missing) to generate the enriched traffic embeddings. The embeddings allow distinguishing the inherent zero values from the missing values. In this way, GCN-M jointly models the Spatio-temporal patterns and missing values in one-step processing, which generally allows a better model performance than two-step processing (Cui et al. 2020b).

Third, GCN-M allows generating reliable dynamic graphs from the enriched traffic embeddings, which opens a path for learning robust dynamic graphs under missing value settings. Moreover, the generated dynamic graphs can cooperate with various advanced graph convolution modules (Wu et al. 2020) to improve the model’s performance further.

Last but not least, even though GCN-M is designed for traffic forecasting, it is applicable to wider application domains sharing similar Spatio-temporal characteristics and missing-value scenarios, such as crowd flow forecasting (Xie et al. 2020), weather and air pollution forecasting (Han et al. 2021a; El Hafyani et al. 2022; Abboud et al. 2021), etc. The Spatio-temporal patterns in those data and the missing values caused by the sensor issues or control center errors form similar research problems to this paper.

However, GCN-M does have a limitation in terms of computational efficiency. Table 9 shows the per epoch training time comparison on the full datasets between GCN-M and the baseline models. The one-step processing baseline models are much more efficient than other models. This is basically because of their simple structure without integrating the costly graph convolution modules. GCN-M still performs better than DCRNN, but worse than other forecasting models. This is mainly caused by two factors: 1) generating the enriched traffic embeddings requires a huge computation cost on the attention score’s calculation in the memory module; 2) generating the dynamic graphs for graph convolution requires learning a large number of parameters, thus increasing computation cost. Possible solutions might be to reduce the time complexity for calculating the attention with ProbSparse Attention proposed in Zhong et al. (2021), and to apply more efficient dynamic graph convolution such as graph tensor decomposition (Han et al. 2021b) and node sampling when generating the graphs (Wu et al. 2020).

Table 9 Model efficiency: training time per epoch (s)

6 Conclusion

In this paper, we propose GCN-M, a graph convolutional network-based model for handling complex missing values in traffic forecasting. We studied the complex scenario where missing traffic values occur on both short & long ranges and on partial & entire transportation networks. The enriched traffic embeddings learned by a Spatio-temporal memory module allow handling the complex missing values and constructing dynamic traffic graphs to improve the model’s performance. A joint model optimization is applied to consider missing values and traffic forecasting in one-step processing. We compare GCN-M with the one-step processing models, which are specifically designed for processing incomplete traffic data and the recent advanced traffic forecasting models. The extensive experiments on two benchmark traffic datasets with 12 baselines demonstrate that GCN-M shows a clear advantage under various scenarios of complex missing values, as compared to the advanced traffic forecasting models, while at the same time maintaining comparable performance on complete traffic datasets. These experiments also provide an up-to-date comparison of the traffic forecasting models would it be with or without missing values. In future work, we will explore the aforementioned optimizations to reduce computational costs. From a longer-term perspective, one can consider noisy data or external events that may impact the predictions.