Introduction

Accurate short-term traffic forecasting is considered the key to efficient network and traffic management and travelers information systems, since—at least conceptually—it enables the early response to anticipated traffic conditions and the initiation of proper mitigation strategies to prevent congestion from occurring and the reduction of trip travel times and improvement of the citizens’ experience (Jiang and Luo 2021; Kumar and Raubal 2021). The growing need for traffic forecasts embedded in real-time Connected and Cooperative Intelligent Transportation Systems (C-ITS) has also recently increased the interest in the research area of traffic forecasting, which has been blooming during the last years, leveraging also the novel Information Communication Technologies (ICT) and sensors (e.g. 5G networks, smartphones equipped with numerous sensors, connected vehicles and the installation of plentiful on-road and in-vehicle sensors), have led to the growing availability of traffic-related datasets that allow the examination of the problem from a new perspective (Vlahogianni and Barmpounakis 2017; Deo and Trivedi 2018; Mantouka et al. 2021; Yin et al. 2021a; Fafoutellis et al. 2022). Data availability, in combination with the rise of Deep Learning and Big Data analysis algorithms and the tremendous increase in computational power of modern computers, offers excellent potential to ITS developers and researchers.

Deep Learning methods are commonly acknowledged as having the best performance in terms of forecasting errors, compared to previous approaches, which is also mentioned as their main advantage (Wang et al. 2019). The main reason behind that is their potential, at least theoretically, to approximate almost any function, regardless of its degree of non-linearity and model underlying, complex temporal and spatial relations, such as those in network-wide traffic data (Ye et al. 2022). Additionally, Deep Learning models can extract features from large-scale raw data automatically and, thus, the difficult tasks of feature engineering and selection, which require much effort and, especially, domain knowledge, are not always necessary (Liu et al. 2018; Ye et al. 2022).

Although there has been a massive growth in the research interest and the amount of deep learning applications in short-term traffic forecasting, the volume of the produced research work is disproportional to the exploitation of this type of models in real-world conditions. This fact has raised reasonable considerations about the usability and actionability of deep learning in traffic forecasting, as well as the direction towards which researchers are moving and the approach they follow (Laña et al. 2021; Manibardo et al. 2021). Domain knowledge, especially in terms of road network representations, outcomes that are aligned with traffic flow theory and modeling of the spatial and temporal relations are rarely systematically assessed with regard to their importance to traffic forecasts’ accuracy and the model’s actionability. The aforementioned aspects can as well facilitate and speed up the learning process, improve the transferability and explainability and, most importantly, reduce the model’s complexity and the computational resources required and enable the scale-up of those algorithms to a city-level (Laña et al. 2021).

In recent literature, various innovative approaches for traffic forecasting have been proposed, most of them relying on deep learning. Indicatively, the literature concerning the application of deep learning techniques in traffic forecasting is reviewed in the works of (Wang et al. 2019; Boukerche and Wang 2020; Tedjopurnomo et al. 2020; Lee et al. 2021; Yin et al. 2021a), while some limitations of these methods and relevant future challenges are also discussed in (Lana et al. 2018; Manibardo et al. 2021). However, although some notable attempts were made in a similar direction, e.g. (Pavlyuk 2019; Ye et al. 2022), a systematic and critical analysis of the aspects of modeling the spatial and temporal relations and the means by which this information can be encoded and transmitted to the models is missing. In this paper, innovative road network representations and spatiotemporal relation modeling strategies are presented. Emphasis is given to the comparison of alternative road network representations and their relation with the modeling techniques. In addition, their contribution to alleviating the aforementioned limitations of Deep Learning and to revealing the spatial and temporal relations of the road network are reviewed.

The remainder of this paper is structured as follows: In "Network Representations in Short-Term Traffic Forecasting" section the unique characteristics of the three main classes of road network representations are presented and compared with each other. In "Open Issues" section, prevailing open issues are discussed, along with some proposed future directions. Finally, "Conclusions" section includes the most important conclusions of this work.

Network Representations in Short-Term Traffic Forecasting

A Historical Perspective

In contrast to most time series data, traffic data evolve with space and time as highlighted in research for the past 30 years. The first spatiotemporal representations either in autoregressive models or neural networks were linear in nature and introduced the influence of upstream data, which varies during different periods of the day, to increase the accuracy of the model (Tebaldi et al. 2002; Stathopoulos and Karlaftis 2003; Vlahogianni et al. 2005). Since then, a series of papers showed evidence that the influence of traffic patterns in different locations of the road network on the target locations is complex, highly non-linear and varying over time and, thus, difficult to model (Vlahogianni et al. 2004, 2014; Ermagun and Levinson 2018; Yin et al. 2021a; Lee et al. 2021). The existence of dependencies between the traffic conditions (traffic flow or volume, speed, etc.) at different locations is also supported by traffic flow theory for both signalized and unsignalized road networks (Vlahogianni et al. 2008, 2014; Pavlyuk 2019).

Interestingly, a clearcut literature finding is that the spatiotemporal dependencies are not limited by the connectivity and the proximity of the locations in space and time; the traffic states of road sections that are close to the target section are not necessarily the most correlated to its traffic state, while the same applies to the traffic states of far time steps, which sometimes are more correlated with the predicted time step than more recent ones (Ma et al. 2015; Do et al. 2019a, b; Yin et al. 2021a; Jiang and Luo 2021). The selection of the most relevant historical observations for the forecasting task remains a challenging, yet vital, task for modeling, understanding and decision-making.

The effect of the spatiotemporal analysis on traffic forecasting was mainly limited to the feature selection process, until recently. Except from the simplest approach of using the direct upstream and downstream links or higher-order neighboring links, researchers have also considered more sophisticated strategies. The first one is based on the distance of the locations: it was assumed that closer locations would have similar traffic conditions patterns, especially in the short term. The distance could also be defined as the travel time between the locations, instead of its actual (Euclidean) value (Pavlyuk 2019). Alternatively, a correlation coefficient was estimated between the time series of traffic conditions of different locations, in order to detect the most correlated ones. A variety of metrics and methods can be used for this purpose, with the most popular being Pearson’s correlation, cross-correlation, mutual information and custom metrics proposed by the corresponding authors (Ermagun and Levinson 2018; Fafoutellis et al. 2020). The latter methods allowed researchers to come to the conclusion that not only near, but also distant locations have high (sometimes higher) correlation coefficients. From a traffic theory perspective, this, intuitively, depends on the size and geometry of the road network, as well as the time resolution of the examined data.

These techniques were very popular before the emergence of Machine Learning and, especially, Deep Learning and were very suitable and efficient for models that cannot cope with a very large and complex input space and, although more sophisticated inputs and methods have been proposed, they are still utilized. Departing from simple feature selection, a meaningful and accurate representation of the road network, which refers to detecting and modeling the spatial and temporal relations between different locations, as well as to how they will be incorporated into the model’s input space and learning process, is an efficient way to pass vital information to the forecasting model and improve its interpretability, while also decreasing the required complexity. Although most early studies assume that it is sufficient to deal with the traffic variables at different locations of the road network as independent features, it has now become widely understood that there do exist significant spatial and temporal relations, which, in addition, are dynamic, complex and nonlinear (Lee et al. 2021). As a location, a loop detector or a camera controlling a lane or a road section, an intersection or an entire city region can be considered. While various approaches have been proposed for modeling the temporal dimension of the problem (i.e. statistical and data-driven time series analysis methods), capturing the spatial relations between different locations is a more complex and under-researched task.

Deep Learning Road Network Representations

In theory, Deep Neural Networks are able to learn any relation between the input data, regardless of its complexity or the size of the input space. In practice, however, the performance of the model heavily depends on the representation of the input data, and the amount of supplementary information they provide about the spatial relations (Manibardo et al. 2021). This valuable information enhances the performance of the model, as well as its interpretability. For example, by using a grid or image representation, the proximity of the locations is implied, as well as the geometry of the road network. In traffic forecasting, an accurate and meaningful representation can also reduce uncertainty and is considered equally important with the modeling technique that is used (Barredo Arrieta et al. 2020; Lee et al. 2021).

However, in numerous recent studies, a data-intensive approach is followed by feeding the model with all the available raw information, without any prior analysis or feature selection, which adds complexity to the model and increases the dimensionality of the input space, undermining the model’s performance. The complex deep learning structure required to successfully handle such input spaces will eventually lack the properties of actionability and interpretability and would be very demanding in terms of training time and computational resources. In these cases, the complex deep learning structure may achieve good prediction accuracy, but only by just implying the existence of correlation, and disregarding any causal features that could be useful in traffic management.

In recent years, a large variety of road network and input data representations have been proposed in the research area of traffic forecasting, depending on the task of the prediction (single point or network level) and the kind of input that is compatible with the corresponding prediction model. In general, they can be classified into three categories: stacked vectors, grid- (or image-) based and graph.

Stacked Vector

The first class of representations is the stacked vector, where the road network data are organized into a single vector. More precisely, the time series of the measurements of each location (e.g. loop detector, road section, intersection, region, etc.), which can already be considered as vectors, are simply stacked in a vector of vectors, which can also be thought of as a two-dimensional matrix of dimensions (number of locations) × (number of timesteps). This representation still remains the most popular one and was already proposed by the initial network-wide traffic forecasting works (Lee et al. 2021).

There is no predefined way of stacking the vectors of the input data into a single vector, but it depends on the researcher’s or practitioner’s intuition, domain knowledge or personal preference. The order in which each location’s vector appears plays an important role, especially in cases where a model that takes account of locality and/or proximity is exploited, such as a Convolutional Neural Network (Modi et al. 2022). Thus, although this kind of representation is simple, the user should not disregard organizing the input data in a suitable way.

When the entire road network and the locations data are available for have a simple geometry, the order in which the input data are organized is rather straightforward. For example, when a circular road network or a corridor is represented (Fig. 1), the vectors of each location can be stacked in a clockwise or connection/proximity order, respectively. According to Fig. 1, the input data would be represented as follows:

$$X=\left\{V\left({L}_{1}\right),V\left( {L}_{2}\right), \dots , V\left({L}_{6}\right)\right\},$$
(1)

where Li is the location and V(Li) is the value of the traffic variable considered, e.g. mean speed or flow.

Fig. 1
figure 1

Examples of road networks that can be represented as a stacked vector

The main disadvantage of this method is that there is no effective way to represent even slightly more complex road network geometries. For example, if an additional location that does not follow the same pattern is added to the above examples, as shown in Fig. 2, the stacked vector representation is not easily applied. Moreover, when a more complex network with a large number of locations should be modeled, it is unclear which locations are close or adjacent and in which order they should appear. In such cases, the location vectors can also be stacked randomly and, thus, the spatial relations between the locations are not at all provided to the prediction model. To mitigate the effect of this issue, it is recommended to follow a feature selection strategy to reduce the dimensionality of the input space and utilize only the locations that are most correlated with the target one. A very naïve and straightforward approach for feature selection is by calculating a correlation metric (e.g. Pearson’s correlation) between the target location and all other locations and including in the input data only a subset of the most correlated locations with the target location (Ermagun and Levinson 2018). Of course, more complex approaches, taking into account proximity or other properties can also be followed (Cai et al. 2015; Ryu et al. 2018).

Fig. 2
figure 2

Examples of low efficiency of the stacked vector representation

Stacked vector representation has been very popular, due to its simplicity and flexibility (Ye et al. 2022). However, it can only pass a limited amount of information about the spatial dependencies and the road network’s geometry to the prediction model. Finally, it can be used for any road network and is compatible with all prediction models, although it is not very suitable for road networks with complex geometry and relationships between the locations.

Grid or Image

The second representation method is the grid or image-like representation. The data are organized into a two-dimensional grid, which is a very intuitive choice for 2-dimensional, Euclidean, spatial data from locations with latitude and longitude. More specifically, a square grid with the size of the road network is defined and a value that represents the traffic conditions inside it is assigned to each square of the grid, which is exactly proportional to a greyscale image. Thus, the input data can directly be passed to the prediction model, without any modification. As Convolutional Neural Networks (CNN) is the most proper technique to handle image data, any hybrid Deep Learning model is compatible with the grid representation, as long as it includes a CNN layer. Other techniques, e.g. Statistical and Machine Learning, cannot be used in this case (Lee et al. 2021).

An example of grid representation is given in Fig. 3. The value of each pixel (square) of the grid is equal to the value of the respective traffic variable measured at the road section that lies inside it, and is sometimes represented by a color scale. Moreover, one may also witness in Fig. 3 the first drawback of this kind of representation, which is its inefficiency. First, the majority of the pixels do not usually match any road section, so their values are zero (black color). Consequently, the model is fed with a larger input space, which increases the complexity of the computations needed and the time they require, while it only contains a relatively low amount of useful information. The latter issue is even more noteworthy when the input data consist of measurements from single points (e.g. loop detectors) that occupy only one pixel and not road sections as in the example of Fig. 3.

Fig. 3
figure 3

Grid representation of indicative road network

Second, as is also obvious in Fig. 3, each road section of the network is contained in more than one square, but the corresponding traffic variable value is the same along its entire length. As a result, the same value is passed multiple times to the model (all pixels have the same value), which increases the input’s size, without increasing the amount of useful information correspondingly.

Another issue that should be considered is the size of the pixels or, equivalently, the resolution of the grid. A higher resolution (more pixels with smaller dimensions) may provide more detailed information about the traffic conditions, but it also intensifies the two drawbacks described earlier; the smaller the pixels, the more possible it is that a road section occupies more pixels and the more pixels will remain empty. On the other hand, when a lower resolution is selected, a significant number of pixels may contain two or more different locations, e.g. road sections, as shown in Fig. 4. In this case, as each pixel can have only one representative value, the average of all road sections may be calculated. The latter is usually undesirable, as the two or more sections may not have a very significant correlation (e.g. when heading in opposite directions) and, by this aggregation, a significant amount of information may be lost.

Fig. 4
figure 4

Road network represented as a low-resolution grid

Despite the aforementioned issues, the image or grid representation has been very popular so far because it very accurately depicts the road network’s geometry and the relations and proximity of the locations. However, its dependence on the method of CNN brings with it all its drawbacks, the most important of which being that they only take into account relationships between pixels that are close to each other in the Euclidean domain (local dependencies) which may not be sufficient for road networks, where dependencies between distant locations are often stronger (Ermagun and Levinson 2018). Finally, the grid representation, although it can be used to extract spatial information, does not express all the properties of a road network, which is physically organized as a graph (Ye et al. 2022).

Graph

The third class of representations that is examined in this paper is the graph. Compared to images, graphs can be used to express more complex relations between the input data from different locations of the road network, which cannot be explained only by (Euclidean) proximity information, stemming from the connectivity of the sections of the road network, the impact of intersections and traffic lights and traffic/congestion patterns of distant locations.

In general, a graph is a mathematical structure that is used to model pairwise relationships between different objects and can be represented as\(G=\left(V,E\right)\), where \(V=\{{v}_{1},{v}_{2},\dots , {v}_{n}\}\) is the set of vertices or nodes and E is the set of edges, consisting of pairs of nodes that are connected to each other, (vi, vj), and 1 ⩽ i, jn. An edge (vi, vj) may be directed, i.e. connecting the nodes asymmetrically with direction from vi to vj, or undirected, i.e. connecting the two nodes symmetrically in both directions. A graph consisting exclusively of undirected edges is also called undirected; otherwise, it is called directed.

The most efficient way of representing a graph is with an adjacency matrix A ∈ R|V|×|V|. The simplest definition of the adjacency matrix is the following: \(A=({a}_{ij})\), where aij = 1, if (vi, vj) is an edge of G and 0 otherwise. Moreover, a weight can be assigned to each edge, representing usually the strength of the connections or a cost, depending on the specific occasion. In this case, the values of the elements of the adjacency matrix are equal to the value of the corresponding weight.

Graph convolutional neural networks (GCNN), as well as their variation Spatio-Temporal GCNN, are the only modeling technique that can handle input data organized in a graph. Except for the adjacency matrix, the input also includes vectors of features for each node, e.g. the time series of the traffic variables measurements.

In traffic forecasting literature, several approaches to how exactly to represent the road network as a graph have been proposed. Firstly, depending on the type of locations that are exploited data availability, the nodes of the graph can be defined as the intersections of the road network (which are the “physical” nodes as well), loop detectors that are installed at the network or its road sections (Jiang and Luo 2021). Furthermore, the connections between them may be their physical ones or connections that express some kind of similarity or statistical relationship between the nodes, which can also be weighted (Ye et al. 2022).

Figure 5 displays various ways to depict the same road network. Among them, the representation where intersections are utilized as nodes (Fig. 5b), closely resembles the actual network visually. In this representation, the adjacency matrix includes information about the traffic variable measurements at the road sections (weights of the adjacency matrix), which are the edges of the graph. However, this representation is not ideal as the most important input, which is also the output of the model, i.e. traffic conditions, should typically correspond to the nodes and not the edges of the graph, according to the architecture of the Graph Convolutional Networks.

Fig. 5
figure 5

Examples of graph representations considering different node types

The two other representations are quite similar to each other. In the first one, each node of the graph corresponds to a road section of the network (Fig. 5c), while in the second (Fig. 5d) to the exact point the measurements refer to (where a loop detector or other sensor is installed). Often the two approaches may result in the same graph, except for the case that two or more detectors are installed at the same road section, which is very possible for long road sections. In this case, for the road section graph, an average value of all the corresponding detectors' measurements should be calculated as representative for the section, which may not be desirable, as it decreases the level of detail that the input data have.

The most important aspect for defining the spatial relations between the locations of the road network, which are key to enhancing the prediction model’s performance, is the adjacency matrix, which contains their pairwise relationships (connectivity or traffic conditions pattern similarity) (Ye et al. 2022). In recent literature, a variety of approaches have been proposed to the definition of the adjacency matrix, which are presented below:

Physical Connectivity Matrix

This type of adjacency matrix reflects the actual connectivity of the road network, e.g. consecutive road sections. This approach is quite intuitive and the values of the elements of the matrix are 1 if the corresponding nodes are connected and 0 otherwise. In the examples of Fig. 5, the connections are determined based on the connectivity of the nodes. Although this representation is simple and suitable for smaller networks, in complex networks it is not always clear which nodes are directly connected, and, especially, when only a relatively small part of the network is covered by sensors. An example of the latter is given in Fig. 6; the road sections with loop detectors are far apart and cannot be considered adjacent. In this case, one of the following approaches should be considered.

Fig. 6
figure 6

Network that cannot be effectively represented with a physical connectivity matrix

Distance-Based Matrix

In this approach, two nodes the distance between which is below a threshold that is decided by the model developer are considered as connected (adjacent matrix value equals 1). Alternatively, the order of neighboring can be used and compared to a threshold: if a node vi is reachable from node vj with m steps, based on the natural connectivity of the nodes, or, equivalently, vi is an m-order neighbor of vj, the two nodes are adjacent. In order to provide the model with more detailed information, weights that are equal to the distance or the neighbor order, respectively, can be assigned to the corresponding edges.

Similarity/Correlation-Based Matrix

The two above representations suffer from two main disadvantages: first they are static, i.e. remain the same over time and during different periods, and, secondly, they take into account only local dependencies, e.g. they consider that the traffic conditions at a location are only related with and affected exclusively by nearby locations, which is not accurate (Jiang and Luo 2021; Zheng et al. 2021). On the other hand, when using a correlation-based matrix, a statistical metric of the similarity between the time series (e.g. Pearson or Spearman correlation) of the traffic conditions of each pair of nodes or a similar metric from Information theory (e.g. mutual information) is estimated; if their correlation is significant (higher than a threshold), which implies that they have similar behavior in terms of the emergence of certain traffic patterns at the same time during the day, the two nodes are considered adjacent (Ermagun and Levinson 2019; Ryu et al. 2018). Of course, the value of the correlation metric can be utilized as the edge’s weight. In general, this approach can theoretically represent more complex spatiotemporal relations, as it captures the dependencies between pairs of distant and nearby nodes the same way and, in addition, it is dynamic, as the connectivity of the nodes can change over time, depending on the similarity of traffic conditions, and the corresponding prediction model would be fed with an adjacency matrix that is not fixed.

Combined Methods

In this case, a function that includes the distance of two nodes, the existence of a physical connection and a correlation metric is used to estimate the weights of the adjacency matrix. Examples of this approach from recent literature are presented in the next section.

The graph representation is currently the state-of-the-art and the most popular in traffic forecasting, because it is a simple and intuitive, yet efficient way to represent any network. In addition, by incorporating novel correlation concepts, the spatial and temporal relations, which are vital for the interpretability and actionability of the model, are extracted. The graph representation is compatible with GCNNs and their variations, as well as hybrid Deep Learning models. An example of how the same road network would be represented according to each one is provided in Fig. 7 below. Vi denotes the traffic variable value in section i.

Fig. 7
figure 7

Comparison between the three representation methods

Spatiotemporal Representations in Recent Literature

In this section, a selection of the most significant and interesting papers of recent literature, in terms of representation approach, are presented and discussed. Emphasis is given to the details of the method of modeling the spatial and temporal relation between the road network’s locations. An overview of the aforementioned papers is given in Tables 1, 2 and 3, for stacked vectors, grid and graph representations, respectively. The modeling technique that was exploited in each paper is mentioned in the second column (if it is based on a well-known model, the latter is mentioned inside a parenthesis). In the next column, the technique used to determine the spatial and temporal correlations or other strategies of feature selection are described. Finally, some implementation details are presented, namely the traffic variable that is predicted, the input data resolution (separated with a comma if more than one dataset of different resolutions are exploited) and the performance of the model, in terms of the Mean Absolute Percentage Error (MAPE) or Mean Absolute Error (MAE), in case MAPE is not estimated. As in most papers, more than one experiment is conducted or more than one dataset may be exploited, various error values are presented separated with a comma, which refers to different datasets, or a range of values, which refer to one- and multi-step forecasting (first value to one-step and last to longest-step).

Table 1 Overview of significant recent literature works using stacked vector representation
Table 2 Overview of significant recent literature works using grid/image representation
Table 3 Overview of significant recent literature works using graph representation

From Tables 1, 2, and 3, one may observe the popularity of deep learning models in recent literature. It is also clear that, apart from the hybrid deep learning structures, such as those including GCNN, RNN or CNN layers, a significant number of works adopt the exploitation of an Attention mechanism, which assigns learnable weights to the input data based on their influence on the expected output, to detect the most important spatial and temporal features (Fang et al. 2022). Besides, evidence from recent literature shows that an Attention mechanism can significantly improve a model’s performance, especially for long-term (multistep) forecasting (Yin et al. 2021a).

Regarding the accuracy achieved by different models, the MAPE metric ranges around 10% and can be even lower in some cases. The values presented in the above Tables cannot be directly compared to each other, as they refer to different experimental setups and also different target variables (e.g. speed, traffic volume). Moreover, according to recent literature, forecasting accuracy is site-specific, meaning that it depends on the specific dataset that is used, the size and geometry of the corresponding road networks and the length of the prediction horizon; thus models evaluated in different datasets and forecasting tasks cannot be directly compared to each other (Manibardo et al. 2021). When using the same benchmarking datasets, it is evident that, in general, the graph representation and the GCNN-based structures are more effective in modeling bigger and more complex road networks, compared to simpler representations and structures. The latter is also reflected in the comparisons between the aforementioned models and simpler baselines in most research papers listed above, indicatively (Chen et al. 2020; Zhang et al. 2020a, b; Ye et al. 2021). Yin et al. (2021a), who provide a benchmarking of different forecasting models, spanning from classical statistical methods to state-of-the-art Deep Learning, using several public datasets, come also to the same conclusion. However, simpler representation and modeling methods perform equally well on simpler tasks (e.g. less complex network, fewer locations) and their deployment should be as well considered, taking into account other advantages of these models, such as interpretability, actionability, and simplicity (Nair and Dekusar 2020; Manibardo et al. 2021).

Finally, regarding the prediction horizon, it is evident that multi-step forecasting is associated with significantly higher error values when compared to one-step ahead forecasting, which is mostly ought to the way the models are trained, i.e. for one-step forecasting. More specifically, for performing multi-step (or long-term) forecasting, the one-step model is used recursively to provide the expected values for each future step, taking as input the forecasted values of the previous ones. That way, each additional step’s forecasting (after the first step) incorporates the forecasting errors of all the previous steps and, thus, naturally, the performance metrics gradually drop after each step.

Furthermore, for stacked vector representation, most studies also include a feature selection process, which is vital for reducing the input space’s dimensionality and the model’s complexity, as well as for introducing the spatial–temporal relations to the model. The simplest approaches include the estimation of a statistical metric to determine the most significant input, such as Pearson and Partial correlation (Li et al. 2019a, b). However, such metrics are subject to several assumptions, such as normality of the distributions of the data, linearity of their relations and independence that are not usually met for traffic data. Thus, the emerging dependencies may not be accurate. Another equally straightforward approach is to select features based on the distance or the neighboring order. But, as long as distant locations may be equally or even more correlated than closer ones, the suitability of this method is also questionable. Approaches combining both methods are also popular in recent literature. For example, in Modi et al. (2022), the authors use a combination of distance and statistical correlation measures to determine the most relevant sensors, while Cai et al. (2015) also include the neighboring order.

To overcome the limitations of the aforementioned approaches, metrics from the area of Information theory, such as Mutual Information, have also been exploited (Ryu et al. 2018), while Lin et al. (2022) estimate the maximum information coefficient between the target location’s time series and lagged versions of the time series of the rest locations. On the other hand, Cheng et al. (2021) perform K-Means clustering to detect locations with similar traffic patterns. In contrast to the Attention mechanism, the latter correlation metrics and feature selection approaches are not learnable (not learned during the training phase of the model), but are predetermined. For this reason, selecting the most appropriate method is a very important task.

A typical example of image representation is presented by Yu et al. (2017), where a color scale is used to represent traffic conditions. The grid values are assigned as the mean value of the corresponding road sections that lie inside them. Ranjan et al. (2020) use a similar representation and image data that are retrieved from a map service’s website. To increase the efficiency of the classic image representation, Dai et al. (2019) estimate the Spearman correlation between the traffic conditions of all pairs of locations and developed an algorithm to rearrange the grid, so that the most correlated locations are placed closer, while Ma et al. (2017) model a single road section as an image, with the x-axis corresponding to the evolution of traffic conditions over time and y-axis to the traffic conditions across the road section. Zhang et al. (2019) use also a grid where the x-axis corresponds to time, but the y-axis contains the traffic measurements of different locations. Additionally, the authors propose a feature selection algorithm, based on the Pearson correlation between the road network’s locations. In Guo et al. (2021) the road network and the evolution of traffic conditions are represented as a 3-dimensional grid, where the x- and y-axes correspond to the coordinates of the locations and the z-axis to time. Moreover, a value is assigned to each pixel of the grid, corresponding to the location’s speed at the specific timestamp.

For the graph representations, it should be noted, additionally, that there is a variation of node types and hybrid Deep Learning models that are utilized with graph road network representations. Taking into account the papers that are examined, which, can be considered indicative of the works of recent literature, intersection nodes are the less common, mainly due to their lower compatibility with GCNN, which is also mentioned before. Road section and detector nodes, which are very similar in most cases, seem to be equally popular and the choice between the two depends mainly on the available dataset. The most usual modeling approach is a combination of GCNN and a model from the Recurrent Neural Networks family (LSTM or GRU), to take account of both the spatial and temporal relations.

The spatial relations, which are expressed through the adjacency matrix, are most of the times based on the physical connectivity, the distance (or the neighboring order or travel time) between the nodes or a correlation metric. Zhang et al. (2020a, b) use a combined similarity and neighboring order-based adjacency matrix, while (Bai et al. (2021) a connectivity-based one, and Wang et al. (2022a, b) construct an affinity matrix based on travel times. The more sophisticated approaches, which include a function that combines the latter aspects, lead to more accurate detection of the dependencies. For example, Zhang et al. (2021a, b) use a matrix that includes physical proximity information fused with cosine similarity and graph betweenness metrics. Furthermore, in Zhang et al. (2020a, b) a Structure Learning Convolution (SLC) framework is proposed, which is able to learn the adjacency matrix during the model’s training phase, given the nodes of the graph. Yu (2022) follows a similar approach, using a learnable adjacency matrix, based on the outcomes of an attention mechanism. In Leiser and Yildirimoglu (2021) and Wang et al. (2022a, b), the authors utilize two different clustering methods in order to enrich the spatiotemporal dependencies information.

Lastly, the typical temporal resolution for traffic data is 5 min, but higher resolutions are becoming increasingly popular. Forecasting with a shorter-term horizon of 1–3 min poses a significant scientific challenge and is also deemed important for near real-time traffic management incorporated into Intelligent Transportation Systems. In general, data with higher resolution contain higher variability as well, even between consecutive timesteps, which makes the forecasting task harder and more complicated. In this case, an effective representation and an accurate spatiotemporal analysis are crucial to achieving decent forecasting accuracy. On the other hand, data with a lower temporal resolution are smoother due to aggregating traffic measurements of a relatively longer horizon; as a result, simpler and less sophisticated models may perform very satisfactorily.

Open Issues

Despite the emergence of very innovative works and the application of the most sophisticated forecasting and analysis methods, the field of traffic forecasting is far from being considered saturated and various challenges remain to be addressed in the future. One of the most important of them is the vision for an inclusive, actionable, interpretable, trustworthy and responsive traffic forecasting framework, that would be able to operate in real time and provide network-wide predictions (Laña et al. 2021). Such a system has the potential to be the core of any future traffic management system and enable efficient and effective management of large, multimodal road networks. Network representations would play a vital role in the above concept; besides their importance in many aspects of forecasting has already been highlighted. However, there are still significant open issues that should be addressed and are discussed below.

Network-Wide Forecasting

Developing a modeling framework that would provide predictions for an entire road network, at once, remains under-researched, although it can increase significantly the actionability of any model. In recent literature, there are very few network-wide approaches (Cui et al. 2020); most models provide output for a single location and not all of them simultaneously, usually referred to as target location, and/or only exploit a relatively small part of the road network as input. The reason for this is that there are several limitations to network-wide traffic prediction, including the following:

  • Data availability and quality: Network-wide traffic prediction requires access to large amounts of traffic data, whose quality and reliability can vary greatly, limiting the accuracy of traffic predictions.

  • Computational requirements: Deep Learning traffic prediction models can be computationally expensive, especially when they are applied at a so large scale, which limits the ability to deploy these algorithms in real time.

  • Model complexity: Traffic prediction models can be highly complex, especially when they incorporate additional factors, such as road geometry and traffic flow dynamics. These models can be difficult to interpret and validate, which prevents them from being applied effectively in practice.

Along with meaningful spatial representations, researchers are developing algorithms and methods that can address these challenges and enable the deployment of network-wide traffic prediction, including new concepts of Computational Science that emerged relatively recently. The most important of them is Edge Computing, in which the computational tasks are performed at the edge of the network, where the data are generated, rather than in a centralized cloud or data center. The main advantages of Edge Computing over traditional cloud computing are (Shi et al. 2016; Cao et al. 2020):

  • Latency and bandwidth reduction: By processing the data closer to where they are generated, reduced amounts of data need to be transmitted to a centralized location and, thus, the related latency and bandwidth requirements are reduced, while the efficiency of the network is improved.

  • Privacy and security: The local processing of data, as well as the reduced amount of them that has to be transmitted, reduces the risk of leaking of sensitive data.

By processing traffic data at the edge of the road network, Edge Computing can enable real-time, low-latency traffic forecasting that can provide valuable information to traffic management authorities. The combination of local data processing, real-time analytics and deployment of machine learning models on edge devices with optimized decentralized coordination (edge nodes exchange necessary information and data without the need for a central entity), there is no need for transmitting large amounts of data to process them in a centralized way, which empowers transportation systems to respond quickly to changing traffic conditions and enables timely decision-making for traffic management, route optimization, and congestion mitigation. For example, Edge Computing can be used to process large amounts of traffic data generated by sensors, cameras, GPS devices and other sources, to provide near real-time predictions of traffic conditions.

Based on the principles of Edge Computing, Federated Learning, is a machine learning paradigm that allows multiple participants, such as devices or edge nodes, to collaboratively learn a model without sharing their raw data with a central entity. Instead, the participants train their local models based on their own data and then exchange model updates with each other in a decentralized manner, allowing the overall model to be improved through the collective contributions of all participants (Bonawitz et al. 2019). This approach has the advantages of scalability, i.e. the load of the computational task is distributed across the participating devices, enabling scaling up to problems that would be infeasible to solve with a centralized approach, and robustness, as the system would be able to continue to operate, even if some participants experience any kind of failure (Li et al. 2020).

Additionally, federated learning can enable the deployment of traffic forecasting algorithms that would be impractical or impossible to run in a centralized cloud or data center due to their computational requirements or the large amounts of data they generate. For example, by distributing the data analysis and model training tasks to various devices close to where the data are collected, the deployment of sophisticated deep learning algorithms, that would be able to learn from network-wide traffic data in real-time and provide highly accurate predictions of future traffic conditions, and in a scalable and efficient manner would be made possible. These devices constitute the nodes of a real-world graph. So, representing the relationships between them becomes even more vital for both the training of the local models and, most importantly, for defining the contribution of each of them to the overall model.

For the time being, the exploitation of Federated Learning in traffic forecasting is at a very early stage, with only a few works taking advantage of its potential (Xia et al. 2022). Liu et al. (2020) propose the “Federated GRU”, which uses a safe data aggregation mechanism, based on a federated averaging algorithm, which prevents the sharing of private data among, e.g., different organizations. On the other hand, Zeng et al. (2021) developed a framework for traffic forecasting using data from different traffic stations, without the need for sharing the data between them. The method depends on partitioning the data into different clusters, based on a hierarchical clustering approach. Moreover, Zhou et al. (2022) have exploited federated learning for vehicle trajectory prediction to preserve the drivers’ privacy. In the near future, with the penetration of connected and automated vehicles in traffic and the wider development of C-ITS, it is expected that massive amounts of network-wide data would be made available, making technologies such as Edge computing and Federated learning an essential part of real-time traffic forecasting and management.

Explainability and Spatiotemporal Analysis

In general, interpretability and explainability refer to a model's transparency, which implies that the data or algorithm and the mechanism that provides the outcomes are accessible to some extent (Miller 2019). Models such as linear regression, decision trees, and rule-based models are considered easy to interpret; linear models offer explanations for predictions generated by the signs and magnitude of the coefficients, while decision trees and rule-based models, on the other hand, have a certain degree of interpretability due to their reliance on decision rules. Tree-like models, in particular, can provide immediate information on the most relevant attributes of a specific rule because of their hierarchical structure.

Explainability aims to make complex machine learning and deep learning models explainable using dedicated tools and methods after their development. It is a broader concept that encompasses the development of AI systems that are understandable, fair, accountable, trustworthy, and transparent, allowing the end-user to comprehend the "what," "why," and "how" of the models (Gunning et al. 2019). Explainability aims to provide customized and relevant information to different stakeholders, taking into account their goals, privacy, and adaptability to human understandability (Barredo Arrieta et al. 2020).

Explainability and identification of the spatial and temporal relations between the locations of the road network, clearly and transparently, are very important for traffic management purposes (Vlahogianni et al. 2014). Specifically, it is essential to (Barredo Arrieta et al. 2020):

  • Justify the decision-making process and increase the trust in the specific model, which is necessary for the compliance of the network users.

  • Extract new scientific knowledge regarding the network’s mechanics.

  • Find ways to improve the prediction model’s performance and its transferability.

Inducing knowledge on the spatial and temporal relations of the forecasting process, via the road network and input data representation, enhances the model’s performance, but also increases its actionability, as it allows the usage of less complex and more interpretable models.

Moreover, understanding the spatial and temporal relations is important for coping with non-recurrent conditions: when extreme conditions (heavy congestion) emerge, e.g., due to an accident, a Deep Learning prediction model would not be able to predict the evolution of the phenomenon and the locations that would be affected, because these models are dependent on the input training data, which most probably would not include a sufficient number of non-recurrent events emerging at all the locations of the road network. Thus, they would not be able to predict future conditions that are not observed in the input data. As a result, a network management authority would not be able to timely implement corrective measures to prevent the spread of congestion. On the other hand, if the spatiotemporal dependencies have been identified, either before the development of the model (for the representation of the input data) or after (for interpreting the results), it would be clear which locations are going to be directly affected and the corresponding measures would be enforced.

Extracting the spatiotemporal relations and interpreting the outcomes of Deep Learning models is not a straightforward task and is usually performed post hoc, using model agnostic methods, such as LIME (Local Interpretable Model-Agnostic Explanations), SHAP (SHapley Additive exPlanations) and Partial Dependence Plots (Molnar 2019). However, these methods do not have a strong mathematical foundation and depend entirely on the available data; consequently, they are very vulnerable to noisy datasets and may provide unreliable outcomes. Moreover, they only imply statistical and not causal relations. To address this issue, researchers should take into consideration the knowledge coming from traffic flow theory concerning traffic spatiotemporal propagation and congestion dynamics, so that the noisy information can be translated into important and potentially causal features that can reduce the dimensionality of the forecasting problem and improve its reliability. Along the same line, the emerging field of causal machine learning proposes a variety of methods to examine and quantify the causal relationships in the available data (Zhao and Liu 2023). The exploitation of such methods for short-term traffic forecasting remains to be researched.

Finally, in recent literature, it is very common to observe researchers developing a very sophisticated model, evaluating its performance and indicating its superiority compared to baselines, but, at the same time, not elaborating on analyzing, understanding and presenting the spatial and temporal dependencies (Manibardo et al. 2021; Yin et al. 2021a). However, in order for a forecasting process to be actionable in real-world traffic management scenarios, it should be assessed not only based on the values of error metrics, but also on the statistical properties of the error and the error bias that affect its trustworthiness as well (Karlaftis and Vlahogianni 2011; Vlahogianni and Karlaftis 2013). Understanding the effect of bias on the model’s trustworthiness in the presence of extensive network-level spatiotemporal information is at an early stage, especially in relation to real-world applications.

Efficiency and Scalability to Multimodal Environments

Demand prediction of public transport, as well as other modes (e.g. taxi, ride-hailing services, bicycles, etc.), are all instances of variations of the traffic forecasting problem, that also can be addressed similarly. An inclusive approach, considering all modes, would revolutionize traffic management and decision-making at a city level, and provide authorities with a tool that would enable the optimization of traffic conditions, actually, across the entire road network. The above is possible with the extension of the concept of multi-task prediction in traffic forecasting. Multi-task prediction is a novel machine-learning technique where multiple related tasks are learned and predicted simultaneously (by the same model), in contrast to single-task prediction, where each task is learned and predicted separately (Jiang and Luo 2021). In multi-task prediction, the model gets as input shared representations for the tasks, allowing for knowledge transfer between the tasks and improving the overall performance. This is particularly useful when there is limited data available for each task and the tasks are related, as the shared representations can help the model generalize better to new data. Multi-task prediction has already been applied in various domains with great success (Kendall et al. 2018; Liu et al. 2019).

In traffic forecasting, multi-task prediction can be associated with multimodal prediction, i.e. predicting the traffic demand of different modes, e.g. volume of private cars, passengers of buses, subway, trains, etc. As it is widely assumed, but also indicated by recent works, there have to be very significant correlations between road traffic and the demand for public transport, which can be utilized to increase the predictability of each individual variable (Fafoutellis and Vlahogianni 2023). Moreover, this type of prediction would have a very significant impact and implications on traffic management, and is vital for model actionability, as it constitutes an integrated, multimodal tool for city-level traffic management. As long as the availability of such multi-source data increases, the popularity of multi-task traffic prediction is expected to increase as well. Such a model would require more than one input dataset and network representations (one for each mode/variable), or, equivalently, a multi-layer graph structure. To develop a model of satisfyingly low complexity that would be efficient, despite having a so large input space, an accurate representation of the input spaces and the interrelations between them is necessary.‬

There are just a few examples of how multi-task prediction can be applied in the traffic forecasting domain to improve accuracy and performance and the meaningful correlations that emerge between the heterogeneous datasets have not been adequately explored yet. Firstly, multi-task forecasting can be considered the exploitation and predictions of more than one traffic variables and other external features with the same model. Du et al. (2019) use speed, flow, density and travel time at different locations of the network, while Liao et al. (2022) also include weather and event features to predict taxi demand. Despite their positive aspects, these approaches are not multimodal and, thus, their contribution to traffic management is limited. On the other hand, Liu and Chen (2022) develop a multimodal model that exploits taxi and metro demand data. The input space is represented as a stacked vector. Finally, Liang et al. (2022) use two graphs, for ride-hailing and subway demands, and proceed to identify correlations between the heterogeneous nodes based on their demand patterns and geographical proximity. These works are in the proposed direction, however, they do not include road traffic information and, moreover, they do not carry out a detailed analysis of the dependencies between different modes, their significance and their spatial and temporal features.

Conclusions

In this paper, we provided a literature review of the most prominent network representation methods for deep learning models utilized for short-term traffic forecasting and a comparative analysis focused on the stacked vector, the image (or grid) and the graph representations. Although the stacked vector representation is the simplest one and is suitable for problems with few locations, the grid and, especially, the graph representations allow the modeling of more complex spatial relations, based on the connectivity, proximity and similarity of the traffic patterns of the locations of the road network. Further analysis of the published works on short-term traffic forecasting indicated that, although simpler representations and modeling techniques can be effective in road networks with few locations, they are outperformed by the graph representation and the corresponding modeling techniques in the case of bigger and more complex ones. In addition, an accurate and meaningful representation of the road network can lead to reduced dimensionality of the input space and, thus, to a prediction model of lower complexity. Such a representation also provides insights into the spatial and temporal dependencies between the traffic conditions at different locations of the network, which are vital for interpreting the results of the model and traffic management.

Based on the findings of recent literature and the analysis conducted in this paper, some future directions of research in traffic forecasting are proposed. To increase the actionability of the models, researchers should focus on less complex and more interpretable modeling techniques. Moreover, the development of multi-task prediction models for different traffic variables, e.g. multimodal demand, as well as the analysis of the relations between them will be a significant breakthrough with many implications in traffic management. The same applies also to efficient, network-wide modeling approaches, enabled by innovative concepts of Computational Science, such as Federated Learning, which still remain a quite under-researched topic, concerning their application in the area of traffic forecasting. Finally, methods for transparently and accurately extracting causal spatial and temporal relations between the locations of the road network and describing their nature and effects are deemed necessary for decision-making under extreme and non-recurrent conditions, but would also enhance the trustworthiness and actionability of the model.