Abstract
Traffic forecasting has attracted widespread attention recently. In reality, traffic data usually contains missing values due to sensor or communication errors. The Spatiotemporal feature in traffic data brings more challenges for processing such missing values, for which the classic techniques (e.g., data imputations) are limited: (1) in temporal axis, the values can be randomly or consecutively missing; (2) in spatial axis, the missing values can happen on one single sensor or on multiple sensors simultaneously. Recent models powered by Graph Neural Networks achieved satisfying performance on traffic forecasting tasks. However, few of them are applicable to such a complex missingvalue context. To this end, we propose GCNM, a Graph Convolutional Network model with the ability to handle the complex missing values in the Spatiotemporal context. Particularly, we jointly model the missing value processing and traffic forecasting tasks, considering both local Spatiotemporal features and global historical patterns in an attentionbased memory network. We propose as well a dynamic graph learning module based on the learned localglobal features. The experimental results on reallife datasets show the reliability of our proposed method.
1 Introduction
Traffic forecasting has played a critical role in intelligent transportation systems, which helps the transportation department better manage and control traffic congestion. Generally represented by geolocated Multivariate Time Series (MTS), traffic data not only shows the typical characteristics of MTS, i.e., temporal dependency (Zuo et al. 2021), but also integrates the spatial information of the traffic network, i.e., the spatial dependency between the sensor traffic nodes over the road network.
In recent years, by leveraging the spatialtemporal patterns in traffic data, many deep learning models based on recurrent neural network (RNN) (Li et al. 2018), temporal convolutional network (TCN) (Wu et al. 2019), graph convolutional networks (GCN) (Li et al. 2021), etc., have been applied in traffic forecasting tasks and achieved stateoftheart performance. They all have a strong assumption that the data is complete or has been wellpreprocessed (Yu et al. 2018). However, since the traffic data is generally collected from geolocated sensors, sensor failures or communication errors will result in missing values in the collected data, thus deteriorating the performance of the forecasting model. We should remark that the missing measures are usually marked as zero in traffic data (Li et al. 2021), which should be distinguished from the nonmissing measures but with zero values. A typical example (Tian et al. 2018) comes from the traffic flow data: no vehicles are detected during the night, then the traffic measures are marked as zero instead of being considered as missing. This can be commonly observed from reallife traffic data for which the missing rate evolves periodically during the day (Lopez 2018).
The missing values can either be ignored in the learning model when calculating the loss function (Wang et al. 2020) or be considered before or during the training process (Cui et al. 2020b). Ignoring the missing values, especially when the missing ratio is high (Cui et al. 2020b), hinders the model from benefiting from the rich data information for better performance. When considering the missing values in traffic data, most work (Cirstea et al. 2019) conducts data imputation during the preprocessing step, then imports the completed data into the training step, i.e., twostep processing. Recent work tends to jointly consider the missing values and the forecasting modeling during the training step (i.e., onestep processing) and declared better performance than the twostep processing (Che et al. 2018; Cui et al. 2020a, b; Tian et al. 2018; Tang et al. 2020). However, the abovementioned work suffers from three major issues. First, the missing and zero values are usually considered to be the same, leading to unnecessary, even harmful data imputations, thus contradicting the raw data information. Second, most of the work (Che et al. 2018; Cui et al. 2020a; Tian et al. 2018; Tang et al. 2020) considers missing values from the temporal aspect, ignoring the rich information from the spatial perspective. Third, they are generally designed for processing the missing values in some basic scenarios, such as random missing values or temporal block missing values, but lack power for the complex scenarios as shown in Fig. 1. In the real world, the missing values in traffic data occur in both longrange (e.g., device poweroff) and shortrange (e.g. device errors) settings, in partial (e.g., local sensor errors) and entire transportation network (e.g., control center errors). Therefore, a holistic approach is required for handling various types of missing values together in complex scenarios.
To handle both the Spatiotemporal patterns and complex missingvalue scenarios in traffic data, we propose Graph Convolutional Networks for Traffic Forecasting with Missing Values (GCNM). The graph neural networkbased structure allows jointly modeling the Spatiotemporal patterns and the missing values in a onestep process. We construct local statistical features from spatial and temporal perspectives for handling shortrange missing values. This is further enhanced by a memory module to extract global historical features for processing longrange missing blocks. The combined localglobal features allow not only for identifying the missing measures from the inherent zero values but also for enriching the traffic embeddings, thus generating dynamic traffic graphs to model the dynamic spatial interactions between traffic nodes. The missing values on a partial and entire network can then be considered from spatial and temporal perspectives.
We summarize the paper’s main contributions as follows:

Complex missing value modeling: We study the complex scenario where missing traffic values occur on both short & long ranges and on partial & entire transportation networks.

Spatiotemporal memory module: We propose a memory module that can be used by GCNM to learn both local Spatiotemporal features and global historical patterns in traffic data for handling the complex missing values.

Dynamic graph modeling: We propose a dynamic graph convolution module that models the dynamic spatial interactions. The dynamic graph is characterized by the learned localglobal features at each timestamp, which not only offset the missing values’ impact but also help learn the graph.

Joint model optimization: We jointly model the Spatiotemporal patterns and missing values in onestep processing, which allows processing missing values specifically for traffic forecasting tasks, thus bringing better model performance than twostep processing.

Extensive experiments on reallife data: The experiments are carried out on two reallife traffic datasets. We provide detailed evaluations with 12 baselines, which show the effectiveness of GCNM over stateoftheart.
The rest of this paper starts with a review of the most related work. Then, we formulate the problems of the paper. Later, we present in detail our proposal GCNM, followed by the experiments on reallife datasets and the conclusion.
2 Related works
We start with defining the notions used in the paper:
Definition 1
(Onestep processing). For onestep processing models, the missing values and the traffic forecasting are jointly modeled in one single step.
Definition 2
(Twostep processing). The twostep processing models first handle the missing values in a preprocessing step, then apply a forecasting model on the completed data.
2.1 Graph convolutional networks for traffic forecasting
Graph Convolutional Network (GCN) is a special kind of Convolutional Neural Network (CNN) generalized for graphstructured data. Most of the GCNrelated work focuses on graph representation, which learns node embedding by integrating the features from the node’s local neighbors based on the given graph structure, i.e., adjacency matrix. The traffic data shows strong dependencies between the spatial nodes, for which GCN can be naturally suitable. Various work (Li et al. 2018; Wu et al. 2020; Yu et al. 2018; Wang et al. 2020) empowered by GCN achieved remarkable performance when doing traffic forecasting tasks, relying on spatial and temporal completion of the data, or calculating loss function for nonzero entries, i.e., only calculating the loss on entries that contain valid sensor readings. However, these techniques may introduce derivations when modeling the Spatiotemporal relations between the sensor nodes. In other words, where nonmissing measures are required to characterize the dynamic graph at each timestamp, missing values may hinder the traffic graph learning (Li et al. 2021), especially dynamic graph learning (Guo et al. 2021).
2.2 Missing value processing
The simplest solution for processing missing values in MTS would be data imputation, such as statistic imputation (e.g., mean, median), EMbased imputation (GarcíaLaencina et al. 2010), Knearest neighborhood (Batista et al. 2002), and matrix factorization (Dong et al. 2022). It’s generally believed that those methods fail to model temporal dynamics in a time series (Tang et al. 2020). In other words, they are not applicable for handling longrange missing values. Recent generative models (Yoon et al. 2019; Dong et al. 2022) show reliable performance for longrange time series imputation. However, isolating the imputation model from the forecasting model leads to twostep processing, and may generate suboptimal results (Cirstea et al. 2019; Wells et al. 2013; Che et al. 2018). To handle this issue, recent studies (Che et al. 2018; Tang et al. 2020; Wang et al. 2021; Zhong et al. 2021) jointly model the missing values and forecasting task in onestep processing. For instance, GRUD (Che et al. 2018) considers the nearby temporal statistical features to do imputations inside GRUs, whereas LSTMI (Cui et al. 2020a) infers missing values at the current time step from preceding LSTM cell states and hidden states, and SGMN (Cui et al. 2020b) improved the state transition process via a Graph Markov Process. Limited to shortperiod missing context, those methods are further enhanced by LGnet (Tang et al. 2020) with the global temporal dynamics to handle the longrange missing issue, and by LSTMM (Tian et al. 2018) with multiscale modeling to better explore historical information. However, the abovementioned models handle missing values by focusing on the temporal aspect without considering the complex Spatiotemporal features in traffic data. Specifically, the strong spatial connections between the sensor nodes should provide us with more information to handle the missing values. Moreover, onestep processing models are generally designed for singlestep forecasting without considering the multistep settings. Table 1 shows the method comparison for traffic forecasting with missing values.
3 Problem formulation
We aim to predict future traffic data by leveraging historical traffic data. Traffic data can be represented as a multivariate time series on a traffic network. Let the traffic network \(\mathcal {G=\{V, E\}}\), where \({\mathcal {V}}=\{v_1, \ldots , v_N\}\) is a set of N traffic sensor nodes and \({\mathcal {E}}=\{e_1, \ldots , e_E\}\) is a set of E edges connecting the nodes. Each node contains F features representing traffic flow, speed, occupancy, etc. We use \({\mathcal {X}}\)=\(\{\textbf{X}_t\}_{t=1}^{\tau }\in {\mathcal {R}}^{N \times F \times \tau }\) to denote all the feature values of all the nodes over \(\tau \) time slices, \(\textbf{X}_{t}=({\textbf {x}}_{t}^{1},\ldots ,{\textbf {x}}_{t}^{N}) \in {\mathcal {R}}^{N \times F}\) denotes the observations at time t, where \({\textbf {x}}_t^{i}\in {\mathcal {R}}^{F}\) is the ith variable of \(X_t\). We define a mask sequence \({\mathcal {M}}\)=\(\{\textbf{M}_t\}_{t=1}^{\tau }\in {\mathcal {R}}^{N \times F \times \tau }\), \(\textbf{M}_{t}=({\textbf {m}}_{t}^{1},\ldots ,{\textbf {m}}_{t}^{N}) \in {\mathcal {R}}^{N \times F}\). \({\textbf {m}}_t^{i} \in \{0,1\}^{F}\) denotes the features’ missing status for the ith variable. To simplify, we adopt \(x_t^i \in {\mathcal {R}}\) and \(m_t^i \in {\mathcal {R}}\) to denote respectively the observation and mask value of one single feature for the ith variable of \(\textbf{X}_t\). We take \(m_t^i=0\) if \(x_t^i\) is missing, otherwise \(m_t^i=1\).
We aim to build a model f, which can take an incomplete traffic sequence {\({\mathcal {X}}\), \({\mathcal {M}}\)} and the traffic network \({\mathcal {G}}\) as input, to predict the traffic data for the next \(T_{p}\) time steps \(\textbf{Y}=\{y_{\tau +1}, \ldots , y_{\tau +T_{p}}\} \in {\mathcal {R}}^{N\times T_{p}}\).
4 Proposal: GCNM
Traffic data is collected under complex urban conditions. Apart from the Spatiotemporal patterns in the traffic data, we also consider the scenarios of complex missing values. We design a solution that models the local Spatiotemporal features and global historical patterns in a dynamic manner. The complex missing values are considered when building the forecasting model, i.e., onestep processing.
4.1 Model architecture
The global structure of GCNM is shown in Fig. 2, integrating a Multiscale Memory Network module, an Output Forecasting module, and \(\textit{l}\) SpatioTemporal (ST) blocks. Each ST block integrates three key components: Temporal Convolution, Dynamic Graph Construction, and Dynamic Graph Convolution. The input traffic observations \({\mathcal {X}}\in {\mathcal {R}}^{N \times F \times \tau }\) and the mask sequence \({\mathcal {M}}\in {\mathcal {R}}^{N \times F \times \tau }\) are fed into the multiscale memory network to extract the local statistic features and global historical patterns thus enriching the traffic embeddings. On the one hand, the enriched embeddings \({\mathcal {H}}_i\) on each ST block are used to mark the dynamic traffic status, thus generating dynamic graphs by combining both static node embeddings and predefined graph information. On the other hand, the learned dynamic graphs are combined with the temporal convolution module via a dynamic graph convolution to capture temporal and spatial dependencies in the traffic embeddings. We adopt residual connections between the input and output of each ST block to avoid the gradient vanishing problem. The output forecasting module takes the skip connections on the output of the final ST block and the hidden states after each temporal convolution for final prediction.
4.2 Multiscale memory network
To extract the local statistic features and global historical patterns then form an enriched embedding, we adopt the concept of memory network, which was firstly proposed in (Weston et al. 2015) with primary application in QuestionAnswer (QA) systems. As shown in Fig. 3, the main idea of our memory network is to learn from historical memory components which conserve the longrange multiscale patterns, i.e., recent, dailyperiodic, and weeklyperiodic dependencies. The scale range depends on the data characteristics. Specifically, we first extract local Spatiotemporal features as keys to query the memory components; the weighted historical longrange patterns will be cooperated with the local statistic features to eliminate the side effect from the missing values. Then, the localglobal features will be output as the enriched traffic embeddings.
4.2.1 Local spatiotemporal features
We first extract the Spatiotemporal features using the contextual information from observed parts of the time series. Unlike prior studies (Che et al. 2018), we consider both temporal and spatial aspects for generating the following statistic features of every timestamp:
Empirical Temporal Mean: The mean of previous observations reflects the recent traffic state and serves as a contextual knowledge of \(x_{t}^{i}\). Therefore, for a missing value \(x_{t}^{i}\in {\mathcal {R}}\), we construct its temporal mean using L past samples \(x_*^{i}\) before time t:
Last Temporal Observation: We adopt the assumption in (Che et al. 2018) that any missing value inherits more or less the information from the last nonmissing observation. In other words, the temporal neighbor stays close to the current missing value. We use \(\dot{x}_{t}^{i}\) to denote the last temporal observation of \(x_{t}^{i}\), their temporal distance is defined as \(\dot{\delta }_{t}^{i}\).
Empirical Spatial Mean: Another contextual knowledge of \(x_{t}^{i}\) is from the nearby nodes, which reflects the current local traffic situation. For each missing value \(x_t^i\), we construct its empirical spatial mean using S nearby samples \(x_t^*\) of the sensor node i:
Nearest Spatial Observation: Typically, the state of a graph node remains relatively similar to its neighbors, especially in a traffic graph where the nearby nodes share similar traffic situations. We define \(\ddot{x}_{t}^{i}\) as the nearest spatial observation of \(x_{t}^{i}\), their spatial distance is denoted as \(\ddot{\delta }_{t}^{i}\).
Generally, when \(\dot{\delta }_{t}^{i}\) or \(\ddot{\delta }_{t}^{i}\) is smaller, we tend to trust \(\dot{x}_{t}^{i}\) or \(\ddot{x}_{t}^{i}\) more. When the spatial/temporal distance becomes larger, the spatial/temporal mean would be more representative. Under this assumption, we model the temporal and spatial decay rate \(\gamma \) as
where \(w^i\), \(w_t\), \(b^i\) and \(b_t\) are model parameters that we train jointly with other parameters of the traffic forecasting model. We chose the exponentiated negative rectifier (Che et al. 2018) so that the decay rates \(\gamma _{t}\) and \(\gamma _{s}\) decrease monotonically in the range between 0 and 1. Considering the trainable decays, our proposed model incorporates the spatial/temporal estimations to define the local features of \(x_{t}^{i}\):
Therefore, for \(\textbf{X}_t\in {\mathcal {R}}^{N \times F}\), we can get its local features \(\textbf{Z}_t\in {\mathcal {R}}^{N \times F}\).
4.2.2 Multiscale memory construction
Global historical patterns play a critical role in building an enriched traffic embedding. The historical observations in multiple scales (e.g., hourly, daily, weekly) can be embedded into memory as complement information for the local features \(\textbf{Z}_{t}\in {\mathcal {R}}^{N\times F}\). The main idea is to adopt local features to query similar historical patterns in the memory and output a weighted feature representation for the current timestamp. In this manner, the enriched multiscale historical and local features allow not only eliminating the side effect of missing values but also improving the current feature embeddings. At time t, the query \(q_{t}\) of \(\textbf{X}_{t}\) can be embedded from the local features \(\textbf{Z}_{t}\in {\mathcal {R}}^{N\times F}\):
where \(W_{q}\in {\mathcal {R}}^{F\times d}\), \(b_{q}\in {\mathcal {R}}^{N\times d}\) are parameters, d is the embedding dimension.
The input memory components are the temporal segments of multiple scales:

The recent (e.g., hourly) segment is: \(X_h\) = \(\{\textbf{X}_{i}\}_{i=t\tau }^{t1} \in {\mathcal {R}}^{N \times F \times n_{h}\tau }\), with \(n_{h}\) recent periods (e.g., hours) before t, each period contains \(\tau \) observations.

The dailyperiodic segment is: \(X_d\) = \(\{\textbf{X}_i\}\in {\mathcal {R}}^{N \times F \times n_{d}\tau }\) with \(i\in \) \([tn_{d}T_{d}\tau /2:tn_{d}T_{d}+\tau /2]\) \(\Vert \) \([t(n_{d}1)T_{d}\tau /2:t(n_{d}1)T_{d}+\tau /2]\) \(\Vert \) ...\(\Vert \) \([tT_{d}\tau /2:tT_{d}+\tau /2]\), we store \(\tau \) samples around time t for each of the past \(n_d\) days. \(T_d\) denotes the sample number during one day, and \(\Vert \) indicates the concatenation operation.

The weeklyperiodic segment is: \(X_w\) = \(\{\textbf{X}_i\}\in {\mathcal {R}}^{N \times F \times n_{w}\tau }\) with \(i\in \) \([tn_{w}T_{w}\tau /2:tn_{w}T_{w}+\tau /2]\) \(\Vert \) \([t(n_{w}1)T_{w}\tau /2:t(n_{w}1)T_{w}+\tau /2]\) \(\Vert \) ...\(\Vert \) \([tT_{w}\tau /2:tT_{w}+\tau /2]\), we store \(\tau \) samples around time t for each of the past \(n_w\) weeks. \(T_w\) denotes the sample number during one week, and \(\Vert \) indicates the concatenation operation.
The input set of \(\{\textbf{X}_i\}\) = \([X_h \Vert X_d \Vert X_w] \in {\mathcal {R}}^{N \times F \times (n_d+n_w+n_h)\tau }\) are embedded into the input memory vectors \(\{m_i\}\) and output memory vectors \(\{c_i\}\):
where \(W_{m},W_{c}\in {\mathcal {R}}^{F\times d}\), \(b_{m},b_{c}\in {\mathcal {R}}^{N\times d}\) are parameters.
In the embedding space, we compute the attention score between the query \(q_{t}\) and each memory \(m_{i}\) by taking the inner product followed by a Softmax:
The attention score represents the similarity of each historical observation to the query. Any pattern with a higher attention score is more similar to the context of targeting missing values. As shown in Fig. 3, the response vector from memory is then a sum over the output memory vectors, weighted by the attention score from the input:
We can finally integrate both local Spatiotemporal and global multiscale features and output the enriched traffic embeddings:
where \(W_{h}\in {\mathcal {R}}^{2d\times d}\), \(b_{h}\in {\mathcal {R}}^{d}\) are parameters, and \(\Vert \) denotes the concatenation operation. Therefore, for input \({\mathcal {X}}=\{\textbf{X}_t\}_{t=1}^{\tau }\in {\mathcal {R}}^{N \times F \times \tau }\), we can get its enriched traffic embeddings \({\mathcal {H}}=\{h_t\}_{t=1}^{\tau }\in {\mathcal {R}}^{N \times d \times \tau }\).
4.3 Dynamic graph construction
A predefined graph is usually constructed with the distance or the connectivity between the spatial nodes. However, recent studies (Wang et al. 2020; Wu et al. 2020; Li et al. 2021) show that the crossregion dependence does exist for those nodes which are not physically connected but share similar patterns. Learning dynamic graphs should show better performance than learning static graphs or adopting the predefined graphs. Considering the missing values in traffic data, instead of using the raw traffic observations to mark the dynamic traffic status (Li et al. 2021; Han et al. 2021b), we construct dynamic graphs (i.e., adjacency matrix) with the enriched traffic embeddings \({\mathcal {H}}_i\) at each ST block, which integrates both local and global multiscale patterns at each time step. This allows capturing the spatial relationship between traffic nodes robustly. As shown in Fig. 4, the main idea here is to generate dynamic filters from the predefined graphs \({\mathcal {G}}\) and the traffic embeddings \({\mathcal {H}}_i\in {\mathcal {R}}^{N \times d \times \tau _{i}}\) (\(\tau _{i}\) is the sequence length at the ith ST block), which are applied on the randomly initialized static node embeddings to construct dynamic adjacency matrices. In more detail, the core steps in Fig. 4 are illustrated as follows:
[Dynamic Filter Generation] Given \({\mathcal {H}}_{i}=\{h_{t}\}\in {\mathcal {R}}^{N \times d \times \tau _{i}}\), the traffic embedding \(h_{t}\) at time t is firstly combined with the predefined adjacency matrix \(A_{{\mathcal {G}}}\in {\mathcal {R}}^{N\times N}\) to generate dynamic graph filters via a diffusion convolution layer as proposed in Li et al. (2018):
where K denotes the diffusion step, \(P_{k}\)= \(A_{{\mathcal {G}}}\)/\(rowsum(A_{{\mathcal {G}}})\) represents the power series of the transition matrix (Wu et al. 2019), and \(W_{k}\in {\mathcal {R}}^{d\times d}\) is the model parameter matrix.
[Hybrid Node Embedding Construction] Considering both the source and target traffic node, we initialize two random node embeddings \(E^1,E^2 \in {\mathcal {R}}^{N\times d}\), representing the static node features (Wang et al. 2020) which are not reflected in the observations but learnable during training. Thus, two dynamic filters are applied over the static node embeddings:
where \(\odot \) denotes the Hadamard product (Wu et al. 2019). \(\hat{E}_{t}^1\) and \(\hat{E}_{t}^1\) are hybrid node embeddings combining both static and dynamic settings of the traffic data.
[Graph Construction] As mentioned in the previous study (Wu et al. 2020), in multivariate time series forecasting, we expect that the change of a node’s condition causes the change of another node’s condition such as traffic flow. Therefore the learned relationship is supposed to be unidirectional. We construct the graph by extracting unidirectional relationships between traffic nodes. The dynamic adjacency matrix is constructed from the hybrid embeddings:
Therefore, we can construct the dynamic graphs \(A_{{\mathcal {D}}_{i}}=\{A_{t}\}\in {\mathcal {R}}^{N \times N \times \tau _{i}}\)for the enriched traffic embeddings \({\mathcal {H}}_i\in {\mathcal {R}}^{N \times d \times \tau _{i}}\) at the ith ST block. As the computation and memory cost grows quadratically with the increase of graph size, in practice, it is possible to adopt a sampling approach (Wu et al. 2020), which only calculates pairwise relationships among a subset of nodes.
4.4 Temporal convolution module
The temporal convolution network (TCN) (Lea et al. 2017) consists of multiple dilated convolution layers, which allows extracting highlevel temporal trends. Compared to RNNbased approaches, dilated causal convolution networks are capable of handling longrange sequences in a parallel manner. The output of the last layer is a representation that captures temporal dynamics in history. As shown in Fig. 5, considering the temporal dynamics in traffic data, we adopt the temporal convolution module (Wu et al. 2019) with the consideration of the gating mechanism over the enriched traffic embeddings \({\mathcal {H}}_i\). One dilated convolution block is followed by a tangent hyperbolic activation function to output the temporal features. The other block is followed by a sigmoid activation function as a gate to determine the ratio of information that can pass to the next module. In particular, the sigmoid gate controls which input of the current states is relevant for discovering compositional structure and dynamic variances in time series. Applying the sigmoid nonlinearity on the input states differs from other wellknown architectures (e.g., LSTM or GRU), which ignore the compositional structure features in time series (Yu et al. 2018).
Given the enriched traffic embeddings \({\mathcal {H}}_i=\{h_{t}\}\in {\mathcal {R}}^{N \times d \times \tau _{i} }\), a filter \({\mathcal {F}}\in {\mathcal {R}}^{1 \times \textrm{K}}\), \(\textrm{K}\) is the temporal filter size, \(\textrm{K}=2\) by default. The dilated causal convolution operation of \({\mathcal {H}}_i\) with \({\mathcal {F}}\) at time t is represented as:
where \(\star \) is the convolution operator, \({\textbf {d}}\) is the dilation factor, d is the embedding dimension size, \(\tau _{i+1}\) is the new sequence length after the convolution operation, which equals to one on the last layer. Figure 5 shows a threelayer dilated convolution block with \(\textrm{K}=2\), \({\textbf {d}}\in [1,2,4]\). Considering the gating mechanism, we define the output of the temporal convolution module:
where \(W_{{\mathcal {F}}^{1}}\), \(W_{{\mathcal {F}}^{2}}\) are learnable parameters of convolution filters, \(\odot \) denotes the elementwise multiplication operator, \(\sigma (\cdot )\) is the sigmoid function.
A classic temporal convolution module stacks the temporal features at each time step t. Therefore, the upper layer contains richer information than the lower layer. The gating mechanism allows filtering the temporal features on the lower layers by weighting features on different time steps without considering the spatial node interactions at each time step. Moreover, the spatial interactions in traffic data always show a dynamic nature (Wu et al. 2020). To this end, the gating mechanism from a dynamic spatial aspect is envisaged to better capture the Spatiotemporal patterns.
4.5 Dynamic graph convolution
Spatial interactions between the traffic nodes could be used to improve traffic forecasting performance. The dynamic spatial interaction leads to considering a dynamic version of graph convolution to conduct it on different graphs at different timestamps. Different from previous work (Li et al. 2021) which uses raw traffic observations to mark the dynamic traffic status, we adopt the enriched traffic embeddings, which consider the missingvalue issues to generate robust dynamic graphs.
As shown in Fig. 5, we apply the dynamic graph convolution on \({\textbf {h}}_i\), i.e., the output of the temporal convolution module, to further select the features at each time step from the spatial perspective. As mentioned in Sect. 4.3, the dynamic graphs \(A_{{\mathcal {D}}_{i}}\in {\mathcal {R}}^{N \times N \times \tau _{i}}\) are generated from the enriched traffic embeddings \({\mathcal {H}}_i\in {\mathcal {R}}^{N \times d \times \tau _{i}}\) at the ith ST block. \(A_t\) reflects the spatial relationships between nodes at time t. The temporal features \({\textbf {h}}_i(t)\) aggregate spatial information according to the adjacency matrix \(A_t\). Inspired by DCRNN (Li et al. 2018), we consider the traffic situation as the diffusion procedure on the graph. The graph convolution will generate the aggregated spatial information at each time step:
where K denotes the diffusion step, and \(W_k\) is the learnable parameter matrix. We adopt the residual connection (He et al. 2016) between the input and output of each ST block to avoid the gradient vanishing issue in the model’s training. Therefore, the input of the \((i+1)^{th}\) ST block is defined as:
4.6 Output forecasting module
The outputs \({\textbf {h}}_{i} \in {\mathcal {R}}^{N \times d \times \tau _{i+1} }\) of the middle temporal convolution modules and \({\mathcal {H}}_{l}\in {\mathcal {R}}^{N \times d \times 1 }\) of the last ST block are considered for the final prediction, which represent the hidden states at various Spatiotemporal levels. We add skip connections on each of the hidden states which are essentially \(1 \times \tau _{i+1}\) standard convolutions ( \(\tau _{i+1}\) denotes the sequence length at the output of the ith ST block). The concatenated output features are defined as follows:
where \(O \in {\mathcal {R}}^{N\times (l+1)d}\), \(W_{s}^{i}\), \(b_{s}^{i}\) are learnable parameters for the convolutions. Two fullyconnected layers are added to project the concatenated features into the desired output dimension:
where \(W_{fc}^{1}\), \(W_{fc}^{2}\), \(b_{fc}^{1}\), \(b_{fc}^{2}\) are learnable parameters for the fullyconnected layers, N is the node number, \(T_p\) denotes the forecasting steps.
Given the ground truth \(\textbf{Y}\in {\mathcal {R}}^{N\times T_{p}}\) and the predictions \(\hat{{\textbf {Y}}}\in {\mathcal {R}}^{N\times T_{p}}\), we use mean absolute error (MAE) as our model’s loss function for training:
5 Experiments
In this section, we demonstrate the effectiveness of GCNM^{Footnote 1} with reallife traffic datasets. The experiments were designed to answer the following research questions (RQs):
 RQ 1:

Performance on raw benchmark datasets: How well does our model perform on traffic datasets with a few missing values or without?
 RQ 2:

Complex scenarios of missing values: How successful is our model at forecasting traffic data considering the complex missing values scenarios?
 RQ 3:

Dynamic graph modeling: How does our method perform on dynamic graph modeling considering the missing values?
 RQ 4:

Onestep processing VS twostep processing: How will our method perform when adopting distinct missingvalue processing strategies?
5.1 Experimental settings
[Datasets] We base our experiments on the public traffic datasets: PEMSBAY and METRLA released by Li et al. (2018), which are widely used in the literature. PEMSBAY records six months of traffic speed on 325 sensors in the California Bay Area. METRLA records four months of traffic flow on 207 sensors on the highways of Los Angeles County. Both datasets contain some zero and/or missing values, though PEMSBAY has been preprocessed by the domain experts from the data provider (Caltrans 2015) to interpolate most of the missing values. Following Li et al. (2018), the datasets are split with 70% for training, 10% for validation, and 20% for testing. In order to validate the model in complex scenarios of missing values, we introduce complex missing values in the datasets (see details in Sect. 5.4). In practice, the model should forecast future values from the input data with missing values. Therefore, in the testing set, we mask out the observations from the input sequence \({\mathcal {X}}\) (i.e., inject missing values) but maintain the complete information for the target \(\textbf{Y}\). We use recent \(\tau \)= 12 timestamps as input to predict the next \(T_p\) timestamp. Considering that the missing values are marked as zeros, we scale the input by dividing it with the max speed of the training set instead of applying Zscore normalization. This avoids changing the zero values and facilitates the computation process. Table 2 shows the summary statistics of the datasets.
[Evaluation metrics] The forecasting accuracy of all tested models is evaluated by three metrics: mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE).
and N denotes the node numbers, \(T_p\) represents the forecasting steps.
When evaluating each model’s performance on the testing set, we mask out the inherent zero values in the prediction targets when computing the metrics.
We conduct statistical tests to assess the statistical significance of the differences between the models. In order to compare the forecasters over multiple datasets (Ismail Fawaz et al. 2019), we adopted the critical difference diagrams recommended by Demšar (2006) and used the Friedman test (Friedman 1940) to reject the null hypothesis (i.e., check whether there are significant differences at all). We followed the pairwise posthoc analysis recommended by Benavoli et al. (2016) and adapted the critical difference diagrams (Demšar 2006) with the change that all forecasters are compared with pairwise Wilcoxon signedrank test (Wilcoxon 1992). Additionally, we formed cliques using Holm’s alpha (5%) correction (Holm 1979) rather than the posthoc Nemenyi test originally used in Demšar (2006).
[Execution and Parameter Settings] The proposed model is implemented by PyTorch 1.6.0 and is trained using the Adam optimizer with a learning rate of 0.001. All the models are tested on a single Tesla V100 GPU of 32 Go memory. In the multiscale memory module, L, S are set to 12 and 5. \(n_h\), \(n_d\), \(n_w\) are all set to 2. We apply four ST blocks in which the Temporal Convolution module contains two dilated layers with dilation factor \({\textbf {d}} \in \left[ 1,2\right] \). The embedding dimension d is set to 32.
5.2 Baseline approaches
We only compare with the baseline models whose source code is publicly available. We follow the default parameter settings described in each paper for training each model. According to the strategy for handling missing values, the baseline models can be organized into two categories:

1.
Ignore the missing values when optimizing the model, i.e., consider missing values in the input sequence as actual zero values and mask out the missing values when computing the loss error:

DCRNN (Li et al. 2018): Based on the predefined graphs, DCRNN integrates GRU with dual directional diffusion convolution.

STGCN (Yu et al. 2018): Based on the predefined graphs, STGCN combines graph convolution into 1D convolutions.

Graph WaveNet (Wu et al. 2019): Graph WaveNet learns an adaptive graph and integrates diffusion graph convolutions with temporal convolutions.

MTGNN (Wu et al. 2020): MTGNN learns an adaptive graph and integrates mixhop propagation layers in the graph convolution module. Moreover, it designed the dilated inception layers in temporal convolutions.

AGCRN (Bai et al. 2020): AGCRN learns an adaptive graph and integrates with recurrent graph convolutions with node adaptive parameter learning.

GTS (Shang et al. 2020): GTS learns a probabilistic graph which is combined with the recurrent graph convolutions to do traffic forecasting.
 Note::

In practice, any model can ignore the missing values in their optimization process. We list here some classic models and the most recent models designed specifically for traffic forecasting.


2.
Jointly model the missing values and forecasting task, i.e., onestep processing models:

GRU (Chung et al. 2014): Gated Recurrent Unit (GRU) can be considered as a basic structure for traffic forecasting.

GRUI (Che et al. 2018): A variation of GRU, which infers the missing values with the predictions from previous steps.

GRUD (Che et al. 2018): Based on GRU, GRUD helps improve the prediction performance by incorporating the missing patterns, including the masking information and time intervals between missing and observed values.

LSTMI (Cui et al. 2020b): Based on LSTM, LSTMI is similar to GRUI for inferring the missing values.

LSTMM (Tian et al. 2018): Based on LSTM, LSTMM is designed for traffic forecasting on data with shortperiod and longperiod missing values.

SGMN (Cui et al. 2020b): Based on the graph Markov process, SGMN does traffic forecasting on data with random missing values by corporating a spectral graph convolution.

5.3 RQ 1: performance on raw benchmark datasets
Recently, a lot of traffic forecasting models (Jiang and Luo 2022) have been proposed, achieving remarkable performance on the benchmark datasets PEMSBAY and METRLA. Our objective is not to beat all the models in terms of forecasting accuracy, but to validate our proposal for jointly modeling missing values and forecasting. Therefore, it’s essential to know how GCNM performs in a primary setting, i.e., on the original datasets with a few missing values or without. We pick three classic models (DCRNN (Li et al. 2018), STGCN (Yu et al. 2018) and Graph WaveNet (Wu et al. 2019)) and the three most recent models (MTGNN (Wu et al. 2020), AGCRN (Bai et al. 2020) and GTS (Shang et al. 2020)), which focus on the Spatiotemporal modeling of traffic data, and generally ignore the missing values when training the model. Additionally, we consider the group of works (Che et al. 2018; Cui et al. 2020b; Tian et al. 2018) which are specifically designed for modeling the missing values in the forecasting model, i.e., onestep processing models.
Tables 3 and 4 show the performance comparison on the raw PEMSBAY and METRLA datasets, respectively. It should be noted that the original datasets already contain missing values (0.0031% missed in PEMSBAY, 8.11% missed in METRLA). We train the models for singlestep (horizon=1) and multistep (horizon =3,6,12) forecasting. We report the evaluation errors on each horizon step. We observe from the results that no model achieved evident better performance than the others. However, the first group of works (e.g., DCRNN) performs better than the onestep processing models, which is not surprising as they incorporate the advanced graph models (e.g., mixhop propagation (Wu et al. 2020)) and training techniques (e.g., curriculum learning (Wu et al. 2020)) to improve the Spatiotemporal forecasting performance. Surprisingly, among the onestep processing models, GRUD (Che et al. 2018) shows much worse performance than the others, probably due to the fact that it has been designed for health care applications, whose data is more stable than dynamic traffic data. LSTMM (Tian et al. 2018) and SGMN (Cui et al. 2020b), designed for traffic forecasting with missing values, show relatively good performance in PEMSBAY especially on singlestep forecasting. However, they did not show a clear advantage over the first group of works. The onestep processing models are generally designed for singlestep forecasting; their performance gap with the first group of works becomes larger under a multistep forecasting setting.
We present in Fig. 6 the critical difference diagrams (Demšar 2006) which show the average rankings and visualize the statistical difference between the forecasting models, where a thick horizontal line shows a group (i.e., clique) of models that are not significantly different in terms of evaluation metrics. From Fig. 6, we observe that even though GCNM belongs to the onestep processing models, its performance remains close to the first group of works. GCNM is not significantly different to MTGNN, GTS, and GraphWaveNet on all three evaluation metrics. Moreover, the advanced graph models and training techniques in recent work (Wu et al. 2020) can be considered to have improved the performance of GCNM further.
5.4 RQ 2: complex scenarios of missing values
In this section, we demonstrate the power of GCNM in handling complex scenarios of missing values for the purpose of traffic forecasting.
As mentioned previously in Fig. 1, there are several scenarios of missing values in reallife traffic datasets (e.g., METRLA): shortrange or longrange missing; partial or entire network missing. The results in Tables 3 and 4 did not show the superiority of GCNM over other models on the original datasets with a low missing rate. To test the model’s capability of handling complex missing values, we have designed three scenarios with various missing rates (10%, 20%, and 40%), and removed the observations from the datasets accordingly. We use \(\hat{x}_{i} \in {\mathcal {R}}^{n \times f \times t}\) to represent each of the observation tensors to be removed from \({\mathcal {X}} \in {\mathcal {R}}^{N \times F \times \tau }\), therefore a local mask tensor \(\hat{m}_{i} \in {\mathcal {R}}^{n \times f \times t} \) can be defined accordingly. This annotates the locations of missing values in the original dataset. All the local mask tensors constitute the global mask sequence \({\mathcal {M}} = \{\hat{m}_{i}\}\) which allows injecting missing values with a given missing rate. Then, we designed the scenarios of:

Shortrange missing: we randomly set \(n\in [1,\ldots ,N]\), \(f=F\), \(t=1\)

Longrange missing: we randomly set \(n\in [1,\ldots ,N]\), \(f=F\), \(t=\tau \)

Mixrange missing: we randomly set \(n\in [1,\ldots ,N]\), \(f=F\), \(t \in [1,\ldots ,\tau ]\)
In Tables 5 and 6, we show the performance comparison on the PEMSBAY and METRLA datasets under various missing value scenarios. We highlight the best results among the onestep processing models (underlined values) and all the models (bold values). Globally, GCNM shows the best performance under all the settings when compared with other onestep processing models. The graphbased model SGMN (Cui et al. 2020b) performs much worse than other onestep processing models under longrange and mixrange missing settings, indicating that it applies only to simple missing scenarios, i.e., shortrange random missing. GCNM does not always show superiority compared with the first group of works, especially in the shortrange missing scenario, where MTGNN and GTS usually show good performances. MTGNN typically performs better than GCNM when the missing rate is low (10%), except under the mixrange missing scenario of PEMSBAY. We can draw a conclusion from this observation: a robust Spatiotemporal forecasting model can offset the impact of the missing values to some extent, as it allows exploring the information thoroughly from the observed measures. GCNM becomes the best forecasting model when the missing rate gets higher, as the missing values become a more critical factor that impacts the forecasting model than Spatiotemporal pattern modeling.
Compared to the shortrange missing scenario, GCNM shows a more robust performance under longrange and mixrange missing scenarios, where the recent temporal values and the nearby nodes’ values are not always observed. The multiscale memory block in GCNM allows enriching the traffic embedding at each timestamp, thus making the model robust in the two complex scenarios. The memory block searches for the periodic global patterns from historical data and the valuable local features from nearby nodes or recent observations at each timestamp. When nearby node values are unobserved, GCNM favors more recent observations and vice versa. As the zero values usually show periodicity while missing values show contingency (Caltrans 2015), the memory module with the periodic historical patterns can distinguish the inherent zero values from the missing values. The current node readings combined with the historical patterns will eliminate the effect of missing values but conserve that of zero values.
In Fig. 7, we show the critical difference diagrams (Demšar 2006) on PEMSBAY and METRLA datasets under the mixrange missing scenario. A thick horizontal line shows a group of models that are not significantly different in terms of different evaluation metrics. We can also observe that GCNM is significantly different from other model groups on all the evaluation metrics, which validates the model’s performance for processing missing values under complex scenarios.
In Fig. 8, we show the effects of the memory module’s parameters L and S on the model’s performance. The two parameters represent the searching range of the local temporal and spatial features, respectively. We report the model’s evaluation errors with various missing rates. From the results in Fig. 8, we observe that when the searching range becomes more extensive, the model’s performance decreases more. This can be explained by the mean value of a larger space and the less recent observations will lead to a weaker information dependency with the current timestamp, thus affecting the information enrichment of the traffic embedding. In reallife datasets, we can set the parameters from a small value, such as considering local features during the last one hour (L=12) with five nearest sensor nodes (S=5).
5.5 RQ 3: dynamic graph modeling
In the dynamic traffic system, the spatial dependency can be considered as a dynamic system status, which evolves over time (Han et al. 2021b). The traffic observations at each timestamp are always adopted to characterize the dynamic traffic status and help learn the dynamic graphs (Li et al. 2021). However, due to the missed observations, the traffic status at certain timestamps can not be characterized, thus affecting the dynamic graph learning process.
This issue can be handled by the enriched traffic embeddings proposed in GCNM. It allows considering the local static features and global historical patterns, which avoids the deviation introduced by the missing values and helps learn the dynamic graphs. To validate the performance of the learned dynamic graphs, we designed the following variants of our GCNM model:

GCNMobs: instead of using the enriched traffic embeddings, the raw traffic observations (Li et al. 2021) are adopted to construct dynamic graphs.

GCNMadp: instead of learning dynamic graphs and applying dynamic convolution, an adaptive static graph (Wu et al. 2019) is learned to do the graph convolution.

GCNMpre: instead of learning graphs from the traffic embedding or observations, the predefined graphs (Li et al. 2018) calculated with the directed distances between traffic nodes are adopted for doing graph convolution.

GCNMcom: combine both predefined and learned static graphs (Wu et al. 2019) to do the graph convolution.
We show in Table 7 the performance comparison on various model variants of the spatial graph modeling. We report the model errors on multiple horizons. We consider the complex scenario of mixrange missing values with a missing rate of 40% on both PEMSBAY and METRLA datasets. The results in Table 7 suggest that the dynamic graphs learned from the enriched traffic embeddings perform the best when compared to other variants. In contrast, the model obtains the worst performance when learning the dynamic graphs from the raw observations, which is mainly due to the missing values hindering the graph learning process in inferring the dynamic traffic status. GCNMobs performs even worse than GCNMadp in which the static graph is learned from the entire observations, eliminating the effect from local missing values.
5.6 RQ 4: onestep VS twostep processing
In this section, we show the performance comparison when adopting different missingvalue processing strategies for traffic forecasting. As a onestep processing model, GCNM jointly models the Spatiotemporal patterns and missing values for traffic forecasting. The twostep processing models handle the missing values in a preprocessing step, then apply a forecasting model to the completed data.
To compare the processing strategies fairly, we use GCNM as a base model to test the twostep processing approaches. A preprocessing step is adopted to replace the GCNM memory module designed to handle missing values.
We consider the following imputation methods to fill in missing values and apply GCNM to the completed data.

MEAN (GarcíaLaencina et al. 2010): This approach replaces missing values with the mean of observed measures based on the respective feature of the input sequence.

KNN (Batista et al. 2002): This method replaces missing values with the mean of knearest temporal neighbors. We linearly interpolate the missing values with k = 2, considering the previous and next nonempty values.

MICE (Van Buuren and GroothuisOudshoorn 2011): MICE is the multiple imputation method that fills the missing values from the conditional distributions by Markov chain Monte Carlo (MCMC) techniques.
We show in Table 8 the performance comparison between onestep processing (i.e., GCNM) and twostep processing (i.e., GCNM variants) models. We conducted the experiments under the complex scenario of missing values (i.e., mixrange missing) with missing rates varying from 10 to 40%. Table 8 shows that the preprocessing step with different imputation techniques leads to worse performance than the onestep setting. Even though authors in Shleifer et al. (2019) justified that the imputation techniques in the preprocessing step improve the model’s performance under simple scenarios of missing values (e.g., random shortrange missing), they are not applicable for complex scenarios of missing values with the results obtained from Table 8. On the one hand, the complex missing values (e.g., on short & long ranges, partial & entire variables) challenges the imputation tasks in considering local and global Spatiotemporal patterns; on the other hand, the patterns of missing values modeled in the preprocessing step are totally isolated from forecasting models, which may not be valuable for the forecasting tasks.
5.7 Discussions
Our approach has several advantages. First, starting from the realworld data, GCNM considered the complex scenarios of missing values in traffic data. Different from the previous work (Che et al. 2018; Cui et al. 2020b; Tian et al. 2018) which consider the missing value from a part of the reallife scenarios: either under the shortrange or longrange missing settings, under partial or entire networking missing settings. GCNM considers the complex mixmissing value context covering various reallife scenarios for missing values.
Second, GCNM is capable of handling such complex missing value scenarios with a multiscale memory module. This combines local Spatiotemporal features (shortrange missing, partial and entire network missing) and global historical patterns (longrange missing) to generate the enriched traffic embeddings. The embeddings allow distinguishing the inherent zero values from the missing values. In this way, GCNM jointly models the Spatiotemporal patterns and missing values in onestep processing, which generally allows a better model performance than twostep processing (Cui et al. 2020b).
Third, GCNM allows generating reliable dynamic graphs from the enriched traffic embeddings, which opens a path for learning robust dynamic graphs under missing value settings. Moreover, the generated dynamic graphs can cooperate with various advanced graph convolution modules (Wu et al. 2020) to improve the model’s performance further.
Last but not least, even though GCNM is designed for traffic forecasting, it is applicable to wider application domains sharing similar Spatiotemporal characteristics and missingvalue scenarios, such as crowd flow forecasting (Xie et al. 2020), weather and air pollution forecasting (Han et al. 2021a; El Hafyani et al. 2022; Abboud et al. 2021), etc. The Spatiotemporal patterns in those data and the missing values caused by the sensor issues or control center errors form similar research problems to this paper.
However, GCNM does have a limitation in terms of computational efficiency. Table 9 shows the per epoch training time comparison on the full datasets between GCNM and the baseline models. The onestep processing baseline models are much more efficient than other models. This is basically because of their simple structure without integrating the costly graph convolution modules. GCNM still performs better than DCRNN, but worse than other forecasting models. This is mainly caused by two factors: 1) generating the enriched traffic embeddings requires a huge computation cost on the attention score’s calculation in the memory module; 2) generating the dynamic graphs for graph convolution requires learning a large number of parameters, thus increasing computation cost. Possible solutions might be to reduce the time complexity for calculating the attention with ProbSparse Attention proposed in Zhong et al. (2021), and to apply more efficient dynamic graph convolution such as graph tensor decomposition (Han et al. 2021b) and node sampling when generating the graphs (Wu et al. 2020).
6 Conclusion
In this paper, we propose GCNM, a graph convolutional networkbased model for handling complex missing values in traffic forecasting. We studied the complex scenario where missing traffic values occur on both short & long ranges and on partial & entire transportation networks. The enriched traffic embeddings learned by a Spatiotemporal memory module allow handling the complex missing values and constructing dynamic traffic graphs to improve the model’s performance. A joint model optimization is applied to consider missing values and traffic forecasting in onestep processing. We compare GCNM with the onestep processing models, which are specifically designed for processing incomplete traffic data and the recent advanced traffic forecasting models. The extensive experiments on two benchmark traffic datasets with 12 baselines demonstrate that GCNM shows a clear advantage under various scenarios of complex missing values, as compared to the advanced traffic forecasting models, while at the same time maintaining comparable performance on complete traffic datasets. These experiments also provide an uptodate comparison of the traffic forecasting models would it be with or without missing values. In future work, we will explore the aforementioned optimizations to reduce computational costs. From a longerterm perspective, one can consider noisy data or external events that may impact the predictions.
Notes
The source code is publicly available in https://github.com/JingweiZuo/GCNM
References
Abboud M, El Hafyani H, Zuo J, et al (2021) Microenvironment recognition in the context of environmental crowdsensing. In: Workshops of the EDBT/ICDT joint conference, EDBT/ICDTWS
Bai L, Yao L, Li C, et al (2020) Adaptive graph convolutional recurrent network for traffic forecasting. Adv Neural Inf Process Syst (NeurIPS) 33
Batista GE, Monard MC et al (2002) A study of knearest neighbour as an imputation method. His 87(251–260):48
Benavoli A, Corani G, Mangili F (2016) Should we really use posthoc tests based on meanranks? J Mach Learn Res (JMLR) 17(1):152–161
Caltrans (2015) An introduction to the caltrans performance measurement system (pems). https://pems.dot.ca.gov/PeMS_Intro_User_Guide_v5.pdf
Che Z, Purushotham S, Cho K et al (2018) Recurrent neural networks for multivariate time series with missing values. Sci Rep 8(1):1–12
Chung J, Gulcehre C, Cho K, et al (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 workshop on deep learning
Cirstea RG, Yang B, Guo C (2019) Graph attention recurrent neural networks for correlated time series forecasting. MileTS19@ KDD
Cui Z, Ke R, Pu Z et al (2020a) Stacked bidirectional and unidirectional LSTM recurrent neural network for forecasting networkwide traffic state with missing values. Transp Res Part C Emerg Technol 118(102):674
Cui Z, Lin L, Pu Z et al (2020b) Graph markov network for traffic forecasting with missing data. Transp Res Part C Emerg Technol 117(102):671
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res (JMLR) 7(1):1–30
Dong H, Ding F, Tan H et al (2022) Laplacian integration of graph convolutional network with tensor completion for traffic prediction with missing data in intercity highway network. Physica A 586(126):474
El Hafyani H, Abboud M, Zuo J, et al (2022) Learning the microenvironment from rich trajectories in the context of mobile crowd sensing. Geoinformatica. https://doi.org/10.1007/s10707022004714
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92
GarcíaLaencina PJ, SanchoGómez JL, FigueirasVidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19(2):263–282
Guo S, Lin Y, Wan H, et al (2021) Learning dynamics and heterogeneity of spatialtemporal graph data for traffic forecasting. IEEE Trans Knowl Data Eng (TKDE)
Han J, Liu H, Zhu H, et al (2021a) Joint air quality and weather prediction based on multiadversarial spatiotemporal networks. In: Proceedings of the 35th AAAI conference on artificial intelligence (AAAI)
Han L, Du B, Sun L, et al (2021b) Dynamic and multifaceted spatiotemporal deep learning for traffic speed forecasting. In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pp 547–555
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 65–70
Ismail Fawaz H, Forestier G, Weber J et al (2019) Deep learning for time series classification: a review. Data Min Knowl Discov 33(4):917–963
Jiang W, Luo J (2022) Graph neural network for traffic forecasting: a survey. Expert Syst Appl 117921
Lea C, Flynn MD, Vidal R, et al (2017) Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 156–165
Li Y, Yu R, Shahabi C, et al (2018) Diffusion convolutional recurrent neural network: datadriven traffic forecasting. In: International conference on learning representations (ICLR)
Li F, Feng J, Yan H, et al (2021) Dynamic graph convolutional recurrent network for traffic prediction: benchmark and solution. ACM Trans Knowl Discov Data (TKDD)
Lopez AL (2018) Traffic state estimation and prediction in freeways and urban networks. Ph.D. thesis, Université Grenoble Alpes
Shang C, Chen J, Bi J (2020) Discrete graph structure learning for forecasting multiple time series. In: International conference on learning representations (ICLR)
Shleifer S, McCreery C, Chitters V (2019) Incrementally improving graph wavenet performance on traffic prediction. arXiv preprint arXiV:1912.07390
Tang X, Yao H, Sun Y, et al (2020) Joint modeling of local and global temporal dynamics for multivariate time series forecasting with missing values. In: Proceedings of the 34th AAAI conference on artificial intelligence (AAAI), pp 5956–5963
Tian Y, Zhang K, Li J et al (2018) LSTMbased traffic flow prediction with missing data. Neurocomputing 318:297–305
Van Buuren S, GroothuisOudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45:1–67
Wang X, Ma Y, Wang Y et al (2020) Traffic flow prediction via spatial temporal graph neural network. Proc Web Conf 2020:1082–1092
Wang S, Gao M, Wang Z et al (2021) Finegrained spatialtemporal representation learning with missing data completion for traffic flow prediction. In: International conference on collaborative computing: networking. Springer, Applications and Worksharing, pp 138–155
Wells BJ, Chagin KM, Nowacki AS, et al (2013) Strategies for handling missing data in electronic health record derived data. EGEMS 1(3)
Weston J, Chopra S, Bordes A (2015) Memory networks. In: International conference on learning representations (ICLR)
Wilcoxon F (1992) Individual comparisons by ranking methods. In: Breakthroughs in statistics. Springer, pp 196–202
Wu Z, Pan S, Long G, et al (2019) Graph wavenet for deep spatialtemporal graph modeling. In: Proceedings of the 28th international joint conference on artificial intelligence (IJCAI), pp 1907–1913
Wu Z, Pan S, Long G, et al (2020) Connecting the dots: multivariate time series forecasting with graph neural networks. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 753–763
Xie P, Li T, Liu J et al (2020) Urban flow prediction from spatiotemporal data using machine learning: a survey. Inf Fus 59:1–12
Yoon J, Jarrett D, Van der Schaar M (2019) Timeseries generative adversarial networks. Adv Neural Inf Process Syst (NeurIPS) 32
Yu B, Yin H, Zhu Z (2018) Spatiotemporal graph convolutional networks: a deep learning framework for traffic forecasting. In: Proceedings of the 27th international joint conference on artificial intelligence (IJCAI)
Zhong W, Suo Q, Jia X, et al (2021) Heterogeneous spatiotemporal graph convolution network for traffic forecasting with missing values. In: 2021 IEEE 41st international conference on distributed computing systems (ICDCS), IEEE, pp 707–717
Zhou H, Zhang S, Peng J, et al (2021) Informer: Beyond efficient transformer for long sequence timeseries forecasting. In: Proceedings of the 35th AAAI conference on artificial intelligence (AAAI), pp 11,106–11,115
Zuo J, Zeitouni K, Taher Y (2021) Smate: Semisupervised spatiotemporal representation learning on multivariate time series. In: 2021 IEEE international conference on data mining (ICDM), IEEE, pp 1565–1570
Acknowledgements
This research was supported by DATAIA convergence institute as part of the Programme d’Investissement d’Avenir, (ANR17CONV0003) operated by DAVID Lab, University of Versailles SaintQuentin, and the MASTER project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the MarieSklodowska Curie grant agreement N. 777695. The authors would like to thank as well the publication support from the Technology Innovation Institute.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Albrecht Zimmermann and Peggy Cellier.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research has been developed for the most part in the context of the main author’s Ph.D. at DAVID Lab, UVSQ, Université ParisSaclay
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zuo, J., Zeitouni, K., Taher, Y. et al. Graph convolutional networks for traffic forecasting with missing values. Data Min Knowl Disc 37, 913–947 (2023). https://doi.org/10.1007/s10618022009037
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618022009037
Keywords
 Traffic forecasting
 Missing values
 Graph convolutional networks
 Memory networks
 Neural networks
 Deep learning