Keywords

1 Introduction

Recently, vehicle trajectory prediction has garnered significant attention due to its critical applications in autonomous driving [1, 2]. However, predicting the trajectories of social vehicles is not trivial due to the inherent uncertainty variability in the motion patterns of objects [3].

Benefiting from the potent deep learning, pioneering work in vehicle trajectory prediction has addressed some of the above challenges. Variational Autoencoders (VAE) [4], Generative Adversarial Networks (GAN) [5], and Graph Neural Networks (GNN) [6] have been utilized to learn trajectory representations and generate multiple possible trajectory samples, effectively capturing multimodal features. These techniques model the complex relationships between vehicles and capture social interactions, leading to more accurate trajectory predictions.

Despite significant progress, existing vehicle trajectory prediction methods struggle with interpretability, especially concerning long-term historical data and nearby vehicle information. Questions about which parts of historical trajectories or nearby vehicle positions influence future motion and how to quantify this influence remain unanswered. To address this, we introduce a spatial-temporal attention mechanism in our STS-GAN model. This approach matches the prediction accuracy of state-of-the-art techniques and enhances interpretability by highlighting the influence of historical trajectories and nearby vehicles through attention weights. Our main contributions are:

  1. 1)

    Proposing a spatial-temporal attention-guided social GAN model for vehicle trajectory prediction;

  2. 2)

    Developing a temporal attention mechanism to identify the importance of historical trajectories at different times for predicting future behavior;

  3. 3)

    Designing a spatial attention mechanism to quantify the influence of nearby vehicles on the trajectory prediction of the target vehicle.

2 Methods

The overall network architecture is shown in Fig. 1. To better understand the importance of different vehicle locations for prediction, similar to research [7], we define a \(3 \times 13\) spatial grid around the predicted vehicles (Fig. 1).

Fig. 1.
figure 1

Overview of proposed STS-GAN.

LSTM Encoder. We first use a single-layer fully connected (FC) network to embed the position of each vehicle \(x^i_t\), obtaining the vector \(e^{e,i}_{t}\). Then, the LSTM encoder processes these embedding vectors for each vehicle i over time steps \(t = 1, ..., h\).

Temporal Attention. The hidden states of vehicle v in the LSTM encoder are denoted as \(H_t^{e,v}=\{h^{e,v}_{t-h},...,h_j^{e,v},...,h_t^{e,v}\}\). Subsequently, the temporal attention weights are computed as follows:

$$\begin{aligned} A_t^v=softmax(tanh(W_\alpha H_t^{e,v})). \end{aligned}$$
(1)

Next, the hidden states \(H_t^{e,v}\) and temporal attention \( A_t^v\) are combined through a weighted processing, resulting in:

$$\begin{aligned} \mathcal {H}_t^v=H_t^{e,v}(A_t^v)^{\top }=\sum _{j=t-h}^{t} {\alpha _t^v h_t^{e,v}}. \end{aligned}$$
(2)

Spatial Attention. Each cell on the grid is denoted as \(G_t = \{G_t^1, ..., G_t^N\}\). N is the total number of grid cells, which can be calculated as follows

$$\begin{aligned} G_t^n = \left\{ \begin{aligned} \mathcal {H}_t^v, \quad & \textrm{if} \; \textrm{any} \; \textrm{vehicle} \; v \; \textrm{locates} \; \textrm{at} \; \textrm{grid} \; \textrm{cell} \; n\\ \textbf{0} \in \mathbb {R}^{d \times 1}, \quad & \textrm{otherwise} \end{aligned} \right. \end{aligned}$$
(3)

The spatial attention weights for all vehicles at time step t, denoted as \(B_t=\{\beta _t^1,...,\beta _t^n,...,\beta _t^N\}\), are calculated as follows:

$$\begin{aligned} B_t=softmax(tanh(W_\beta G_t)), \end{aligned}$$
(4)

where \(W_\beta \) is learnable weights matrix. Finally, we combine all of the historical information from its surrounding vehicles as follows:

$$\begin{aligned} {V}_t= G_t (B_t)^{\top }= \sum _{n=1}^{N} {\beta ^n_t G_t^n}. \end{aligned}$$
(5)

LSTM Decoder. After concatenating the nearby vehicles’ spatial-temporal feature vectors, and their social context vectors, we use an LSTM layer followed by a FC layer to predict the future trajectory.

Discriminator. The discriminator evaluates the accuracy of the predicted and actual trajectories

$$\begin{aligned} h^{D,i}_{t+1}=LSTM(h^{D,i}_{t},x^{D,i}_{t};W_{D,encoder}), \end{aligned}$$
(6)
$$\begin{aligned} s^{D,i}_{t+1}=Sigmoid(FC(\boldsymbol{h}^{D,i}_{t+1};W_{D})). \end{aligned}$$
(7)

3 Datasets and Experiments Setup

STS-GAN is trained and evaluated using the Next Generation Simulation (NGSIM) ([8]) US-101 and I-80 datasets, each containing 45-min vehicle trajectories split into six 15-min segments. These segments are further divided into training, validation, and test datasets in a 0.7 : 0.1 : 0.2 ratio, resulting in 5, 922, 867 training entries, 859, 769 validation entries, and 1, 505, 756 test entries.

The Average Displace Error (ADE) and the Final Displacement Error (FDE) are employed as the performance metrics to evaluate the prediction accuracy, defined as:

$$\begin{aligned} \begin{aligned} \text {ADE} & = \frac{\sum _{i=1}^{n} \sum _{T=t+1}^{t+p} ||x_T^i-\hat{x}_T^i||}{np}, \\ \text {FDE} & = \frac{\sum _{i=1}^{n} ||x_{t+h}^i-\hat{x}_{t+h}^i|| }{n}, \end{aligned} \end{aligned}$$
(8)

where n represents the number of predicted samples. \(\hat{x}^i\) and \(x^i\) are the predicted and true trajectories of group i data, respectively. The batch size is set to 128, the optimiser used is Adam with a learning rate of 0.001, and the number of training epochs is 10.

Table 1. Performance Metrics (ADE/FDE) Comparison with Other Methods

To verify the effectiveness of STS-GAN in vehicle trajectory prediction, we compare several state-of-the-art methods. Additionally, to validate the effectiveness of the network structure and the proposed spatial-temporal attention mechanism, we also design ablation experiments. Specifically, we evaluate: 1) CS-LSTM [7], an LSTM encoder-decoder model using a convolutional pooling layer; 2) STA-LSTM [9], a trajectory prediction model that incorporates spatial-temporal attention mechanisms in LSTM networks; 3) ST-GAN, a GAN-based network for spatial-temporal attention mechanisms, but without the introduction of convolutional social pooling; 4) SS-GAN, a GAN-based network that incorporates convolutional social pooling and spatial attention mechanisms, but without temporal attention mechanisms; and 5) TS-GAN, a GAN-based network that incorporates convolutional social pooling and temporal attention mechanisms, but without spatial attention mechanisms.

4 Results and Analysis

Table 1 compares the ADE/FDE of different models over prediction horizons from 1 to 5 s. STS-GAN outperforms other models across short-term, and long-term predictions, showcasing its superior predictive capabilities. Specifically, CS-LSTM performs the worst due to the absence of attention mechanisms, resulting in higher errors. STA-LSTM, despite incorporating spatial-temporal attention mechanisms, lacks social pooling and generative adversarial mechanisms, leading to lower predictive accuracy. ST-GAN, an ablation study without social pooling, exhibits decreased accuracy compared to STS-GAN, emphasizing the importance of considering social interactions. SS-GAN, focusing on temporal attention, shows slightly lower accuracy than STS-GAN, suggesting limited improvement from the temporal attention mechanism. TS-GAN, concentrating on spatial attention, also demonstrates slightly lower accuracy than STS-GAN but still outperforms models lacking spatial attention.

We calculate the average weights for the last 15 time steps (from \(t - 14\) to t) within each interval. Figure 2 displays these weights from time \(t-5\) to t due to smaller weights before \(t-5\). The results reveal that the weight is highest at the current time step t, indicating that the future trajectory of the target vehicle is primarily influenced by its recent trajectory and those of nearby vehicles. This finding aligns with human cognition.

Fig. 2.
figure 2

The average weights of the six adjacent time steps calculated using the six subsets of data, with weights for moments before \(t-5\) ignored.

We further analyze the spatial attention mechanism and observe that the spatial attention weights of the predicted vehicle are highest within the grid space. Combined with the earlier analysis of temporal attention, this suggests that the future trajectory of the predicted vehicle is largely influenced by its own driving state.To illustrate the distribution of attention weights of nearby vehicles, we select two typical scenarios. We then normalize and plot the remaining attention weights on a 3\(\times \)13 grid, excluding those of the predicted vehicle. In Fig. 3(a), we depict a common driving scenario where the predicted vehicle primarily focuses on the vehicle ahead in the same lane, with relatively high weights (e.g., \(28.4\%\), \(16.3\%\), \(17.9\%\)), while the weights in other grids are relatively low. Notably, the weight of the grid directly in front of the predicted vehicle is low, possibly due to the typically large following distance for driving safety, resulting in the grid directly ahead often being unoccupied. Figure 3(b) illustrates the spatial weight distribution in a left lane-changing scenario. Unlike the common driving scenario, the predicted vehicle does not focus as much on the vehicle ahead in the same lane but instead pays more attention to vehicles in the target lane, both in front and behind. This observation aligns with human driving experience, where drivers assess lane change opportunities by observing the behavior of vehicles in the target lane.

Fig. 3.
figure 3

The distributions of spatial attention weights in two driving scenarios.

5 Conclusions

This paper presents STS-GAN, a spatial-temporal attention guided social GAN model for vehicle trajectory prediction. The temporal attention mechanism highlights significant time points in historical trajectories, while the spatial attention mechanism measures the influence of nearby vehicles. Key findings from ablation experiments and comparisons with state-of-the-art models include: 1) STS-GAN achieves state-of-the-art prediction accuracy, 2) recent historical trajectory segments are sufficient for accurate predictions, and 3) although the accuracy of STS-GAN is similar to that of ST-GAN and SS-GAN, it offers better interpretability through its spatial-temporal attention weights.