With the rapid development of artificial intelligence, the intelligent autonomous moving system has become a hot topic of the current research, and the accompanying driving safety issues have also attracted public attention. However, the uncertainty of future trajectory and the large variation in scene layout bring great challenges to forecasting pedestrians’ trajectories. Therefore, it is of great significance to study pedestrians’ motion behaviors to reduce the occurrence of collision accidents and protect their safeties.

The core of trajectory prediction is to learn pedestrians’ motion behaviors [1, 2] based on given observed trajectories, and predict all possible future trajectories. To accurately predict future trajectories, researchers mainly adopted model-driven or data-driven methods. Commonly used model-driven methods include the Markov model [3, 4] and Kalman filter [5, 6]. For example, Schneider and Gavrila [7] combined Kalman Filter and constant velocity to predict pedestrians’ future trajectory. Mathew et al. [8] proposed a hybrid prediction method based on the hidden Markov model, which clusters trajectories based on observed trajectories. However, due to the complexity and non-linearity of future trajectories, model-based methods are difficult to accurately capture the dynamic changes and long-term dependence of trajectories, so the prediction is not accurate enough.

Data-driven methods [9, 10] are effective ways to deal with dynamic changes and long-term dependence on trajectories. The previous methods are mainly based on Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) to model the dynamic features of the trajectory. In the RNN-based methods, Lee et al. [11] proposed a model based on RNN, which captures the dynamic changes in motion through adaptive learning of network parameters. Bartoli et al. [12] adopted the Long Short-Term Memory (LSTM) network to alleviate the long-term dependence issue of trajectory by using the gate mechanism and time-step parameter sharing. The RNN-based method can model the features of the trajectory. However, the model needs to process the data sequentially due to its recurrent structure, resulting in inefficient data processing and gradient vanishing problems [13]. In the CNN-based methods, Chen et al. [14] proposed a convolution embedding model that models the relative order of positions through the one-dimensional convolution and predicted the next position with trajectory data. Zamboni et al. [15] proposed a new convolutional model for pedestrian trajectory prediction that uses 2D convolution. The CNN-based method can effectively model the trajectory sequence, but due to the limited receptive field of the CNN, the extraction of the long-term dependence does not work out well .

To solve the above problems, we adopt the Transformer framework, which was first proposed in 2017 [16], and soon became popular in the field of Natural Language Processing (NLP) such as machine translation [17], speech recognition [18], and question answering system [19]. Transformer has a strong semantic feature extraction ability and task comprehensive feature extraction ability, and can perform parallel computing, which overcomes the shortcomings of RNN and its variants sequential structure. In terms of long-term feature capture capability, Transformer also excels due to its multi-attention module. On the task of trajectory prediction, Giuliari et al. [20] used vanilla Transformer to model pedestrians’ trajectories and achieved satisfactory prediction performance. Yu et al. [21] proposed a framework for spatio-temporal crowd trajectory prediction with only attention mechanisms to model the interactions in space and time. The above methods use the powerful feature extraction ability of Transformer and achieve good results in the field of trajectory prediction. However, they only use Transformer to extract one kind of features, ignoring the multi-feature fusion ability of cross-attention module in the network. Also, we noticed that researchers tend to focus more on modeling interactions with other agents, ignoring static contextual information (road infrastructure) of the scene. However, as shown in Fig. 1, contextual information of the scene has the same (or even greater) importance as the dynamic information of other agents [22].

Fig. 1
figure 1

Pedestrian’s movement is restricted by the environment

To address such limitations, we propose a Multi-Granularity Scenarios Understanding (MGSU) framework, which can extract trajectory features and fuse with scene information at different granularity. The network can capture long-term dependencies because it is based on the attention mechanism. To make the fusion more efficient, we introduce a novel scene-fusion transformer, whose encoder is used to extract scene features and fuse them with trajectory features in the decoder. The scene-fusion transformer adopts sparse attention mechanism and the decoder is set as generative output like Informer [23], which can effectively avoid the accumulation of errors. A lightweight semantic segmentation network, ESPNet [24], is introduced to extract the semantic features of the scene. To better utilize the scene information, an inverse reinforcement learning (IRL) approach [25] is introduced to generate the optimal path strategy based on semantic features. Concretely, the main contributions of this paper can be summarized as follows:

  1. (1)

    We propose a Multi-Granularity Scenarios Understanding framework (MGSU), which can effectively model the interaction between pedestrians’ trajectories and the scene, and generate multiple feasible predictions for the future trajectories. MGSU gradually integrates scene information and trajectory information according to different granularity stages. We also introduce ESPNet and inverse reinforcement learning methods to achieve a more comprehensive exploration of the impact of the scene layout on future trajectories.

  2. (2)

    To better fuse pedestrians’ trajectories and the scene, a novel and efficient scene-fusion transformer is presented. It adopts the sparse attention mechanism and sets the decoder as generative output of future predicted trajectories, which can effectively avoid error accumulation and improve the efficiency.

  3. (3)

    We evaluate MGSU on the SDD and NuScenesdatabases, and the results show that our approach can understand the scene layout with high accuracy.

The rest of this paper is organized as follows. “Related work” summarizes the methods for trajectory prediction and the methodology related to our work. “Method” describes the proposed MGSU in detail. “Experiments” elaborates on experiments for MGSU and discusses results with trajectory visualization and quantitative evaluation. Our conclusions are presented in “Conclusion”.

Related work

Trajectory prediction Trajectory prediction methods can be mainly divided into model-driven [8, 26,27,28] and data-driven [9, 29,30,31] methods. For the former, there is an explicit model related to the target motion over time. Keller and Gavrila [8] proposed a method based on a linear dynamic model to predict future trajectories in a short time. To overcome the limitations of the linear dynamic model, Karasev et al. [26] proposed another model-based approach to predict future trajectories by modeling the behavior of the target as a Markov process. Malviya and Kala [28] presented a trajectory prediction method based on particle filter to track humans using a limited-field-of-view monocular camera. On the other hand, there is no explicit modeling of target behavior in data-driven methods that rely mainly on trajectory datasets under multiple scenarios and attempt to learn the behavior of targets from the datasets. Alahi et al. [29] proposed Social LSTM that predicts pedestrians’ trajectories by exploiting interactions between pedestrians on roadways. However, the computation of the model is expensive because the social pooling operation needs to consider the interactions between all pedestrians in the scene. Gupta et al. [30] proposed Social-GAN to overcome the limitations of Social LSTM by introducing generative adversarial networks and the global pooling mechanism. In the above methods, RNN and its variants have become an important part of many recent trajectory prediction models [32, 33] due to its powerful processing capability for time series data. However, RNN and its variants cannot be computed in parallel due to its own order structure, and has poor ability to extract the long-term dependence. Therefore, our method mainly uses the attention mechanism to overcome the shortcomings of the above. In previous work, researchers tend to focus more on modeling the interaction with other agents and ignore the context information (road infrastructure) of static scenarios. However, context information for a scenario is just as important (or even more important) as dynamic information for other agents. Therefore, we use ESPNet to extract semantic features of the scene, and introduce IRL to generate the optimal path strategy from the semantic features and historical trajectories. Then, the scene information and trajectory information are closely combined to accurately predict the future trajectory.

Scene understanding The semantic segmentation network provides great help for trajectory prediction by providing feasible regions. In terms of scene understanding, Visin et al. [34] and Bell et al. [35] used RNN to pass information along each row or column of the scene, but resulted in a single RNN layer where each pixel position could only obtain information from the same row or column. Liang et al. [36] proposed a variant of LSTM to exploit the context in the scene, but suffers from the expensive computational cost. Currently, researchers referred to CNN-base methods to understand the scenes [37,38,39]. Ronneberger et al. [37] proposed a U-Net semantic segmentation network, which relies on data augmentation to efficiently utilize available annotation samples. DUC [40], DeepLabv3 [41] and PSPNet [42] used extended convolution to preserve the spatial size of feature graphs. Orhan and Bastanlar [43] proposed a semantic segmentation CNN-based model that utilizes equirectangular convolutions to handle distortions in panoramic images. These methods can precisely describe the scene semantic information, but their heavy computational overheads result in a slow inference speed, which is not suitable for real-time trajectory prediction. Therefore, we introduce a lightweight semantic segmentation network, ESPNet, which has an extremely high inference speed and can make fast and efficient segmentation of the scene images. Further, the IRL method is introduced to generate the optimal path strategy by using the semantic information of the scene and pedestrians’ observed trajectories, so as to help the network to deeply understand the scene.

Transformers Transformer is a deep learning architecture proposed by Google in 2017. It has achieved great success in the field of NLP [44,45,46]. Due to the unique attention mechanism and the excellent performance in the field of NLP, researchers have great interest in its application in trajectory prediction. Giuliari et al. [20] used the vanilla Transformer without considering any complex interaction information, and achieved satisfactory results. Yu et al. [21] proposed a STAR architecture to model the interaction information in space and time. Achaji et al. [47] introduced PReTR, which utilized a decomposed spatio-temporal attention module to extract features from multi-agent scenarios. Yao et al. [48] proposed an end-to-end transformer network that has the self-correcting scheme to enhance the model robustness. The above methods make use of the powerful feature extraction ability of Transformer and perform good in trajectory prediction. However, they only used Transformer to extract one class of features and ignore the multi-feature fusion ability of the cross-attention module. In this work, we improve the vanilla Transformer and propose an efficient scene-fusion Transformer, which can simultaneously fuse trajectory features with scene information.

Fig. 2
figure 2

The overall framework of MGSU


We focus on the fusion of scene information and trajectory information to improve the trajectory prediction performance. The framework of the proposed MGSU is illustrated in Fig. 2. MGSU fully utilizes the scene information by gradually integrating different granularity of scene information to model the interaction between trajectory and scene. Firstly, the semantic information of the scene image is extracted by the coarse-grained fusion module and fused with the trajectory information through the cross-attention module to output the motion representation at a coarse-grained level. Then, the motion representation of coarse-grained fusion is fed into the IRL module to generate the optimal path strategy through the grid-based policy sampling. Afterward, the IRL module outputs multiple scene paths which are accurate and can provide the scope of future paths at a fine-grained level. Finally, the scene paths and observed trajectories are fused in the fine-grained feature fusion module to generate multiple future trajectories. The difference between the coarse-grained fusion and the fine-grained fusion mainly lies in the precision of information presentation and the way of fusion. Coarse-grained fusion is the fusion of trajectory and semantic features which provide information about the category of the identified object and the scope of future trajectories at a coarse-grained level through cross-attention. Fine-grained fusion is the fusion of trajectory and path features that provide the scope of future paths at a fine-grained level through a scene-fusion Transformer. Table 1 presents all control parameters used in this paper. Details of different modules are described as follows.

Table 1 Description of control parameters

Coarse-grained fusion module

The scene semantic information is related to the object categories in the scene. Therefore, the model can learn the information about passable areas (like roads and crosswalks) for pedestrians from semantic information to reduce the uncertainty of objects in the scene. The semantic information can provide a basis for predicting the future trajectory according to the category of the identified object. However, this basis can only provide the scope of future trajectories at a coarse-grained level. As illustrated in Fig. 2, the coarse-grained fusion module uses the cross-attention mechanism to fuse observed trajectories with corresponding scene images, and outputs a coarse-grained motion mixture representation. This module includes a semantic segmentation network, a fully connected layer, and two cross-attention modules. ESPNet is used to extract semantic features from the input scene S. Meanwhile, a fully connected layer is used to map the observed trajectory \(T_{{\mathrm{obs}}}\) to a high-dimensional feature space to facilitate feature extraction. Afterward, the two cross-attention modules are used to calculate the attention of the scene to the trajectory \(A_{\mathrm{s}}\) and the attention of the trajectory to the scene \(A_{\mathrm{t}}\). The two attentions are concatenated to generated the coarse-grained motion representation \(C_{\mathrm{h}}\), as follows:

$$\begin{aligned}&S_{\mathrm{e}}=\hbox {ESPNet}\left( S,W_{\mathrm{e}}\right) , \end{aligned}$$
$$\begin{aligned}&T_{\mathrm{fo}}= FC\left( T_{{\mathrm{obs}}},W_{\mathrm{f}}\right) , \end{aligned}$$
$$\begin{aligned}&A_{\mathrm{t}}=\frac{\hbox {Softmax}\left( T_{\mathrm{fo}}\cdot S_{\mathrm{e}}^{\mathrm{T}}\right) }{\sqrt{d_{\mathrm{e}}}}S_{\mathrm{e}}, \end{aligned}$$
$$\begin{aligned}&A_{\mathrm{s}}=\frac{\hbox {Softmax}\left( S_{\mathrm{e}} \cdot T_{\mathrm{fo}}^{\mathrm{T}}\right) }{\sqrt{d_{\mathrm{t}}}}T_{\mathrm{fo}}, \end{aligned}$$
$$\begin{aligned}&C_{\mathrm{h}}=\hbox {Cat}\left( A_{\mathrm{t}},A_{\mathrm{s}}\right) , \end{aligned}$$

where \(W_{\mathrm{e}}\) and \(W_{\mathrm{f}}\) denote parameters of ESPNet and the fully connected layer, respectively. \(S_{\mathrm{e}}\) denotes the scene semantic feature output by ESPNet. \(T_{\mathrm{fo}}\) denotes the trajectory features output by the fully connected layer. \(\hbox {T}\) denotes the transposed matrix. \(d_{\mathrm{e}}\) and \(d_{\mathrm{t}}\) denote the dimension of \(S_{\mathrm{e}}\) and \(T_{\mathrm{fo}}\), respectively. \(\hbox {Softmax}()\) denotes the activation function, and \(\hbox {Cat}()\) is used for concatenating.

Inverse reinforcement learning module

The inverse reinforcement learning module is introduced to generate concrete scene representation based on the coarse-grained motion representation. The core of IRL is to reverse the reward function according to the expert example and generate the optimal strategy according to the reward function. This module takes the coarse-grained motion representation as input, and outputs multiple scene paths.

Firstly, the path reward map \(r_{\mathrm{path}}\) and goal reward map \(r_{\mathrm{goal}}\) are generated according to coarse-grained motion representation as follows:

$$\begin{aligned}&r_{\mathrm{path}}=\text {MLP}_{\mathrm{path}}(C_{\mathrm{h}}), \end{aligned}$$
$$\begin{aligned}&r_{\mathrm{goal}}=\text {MLP}_{\mathrm{goal}}(C_{\mathrm{h}}), \end{aligned}$$

where \(\text {MLP}_{\mathrm{path}}\) and \(\text {MLP}_{\mathrm{goal}}\) denote two multi-layer perceptron with the same structure, which provide path reward value for reinforcement learning. \(r_{\mathrm{path}}\) is used to provide rewards for action choices; \(r_{\mathrm{goal}}\) is used to provide a reward for terminating a path.

Afterward, to obtain the maximum entropy strategy \(\pi _{\theta }\left( a\mid s \right) \), which represents the probability of taking action a under the condition of state s, we use the approximation iteration to get the maximum entropy strategy as shown in Algorithm 1, where V(s) denotes the state logarithm function, Q(sa) denotes the state-action logarithm function, N is the total number of iterations, \(T\left( s,a\right) \) denotes the cross product of s and a, \(S_{p}\) and \(S_{g}\) denote the state of \(r_{\mathrm{path}}\) and \(r_{\mathrm{goal}}\) respectively, which have the same dimensions as the 2D grid. There are different probability values around the state of the reward map, which provide the probability of different actions for the action selection of the target. Then, the target will choose the action with the highest moving probability.

figure a

Finally, the Gumbel-Softmax Trick is introduced to sample the scene paths, and the argument of the maximum is used to get the selected actions, and the ith scene path \(P_{(i)}\) is obtained, as follows:

$$\begin{aligned}&\hbox {noise}=\hbox {Gumbel}(\log (\Sigma _{a}\pi _{\theta }(a,s))), \end{aligned}$$
$$\begin{aligned}&a = \hbox {argmax}(\log (\Sigma _{a}\pi _{\mathrm{theta}}(a,s))+\hbox {noise}), \end{aligned}$$

where \(\hbox {noise}\) denotes Gumbel noise, and a denotes the final action choice.

Fine-grained fusion module

Path information obtained by IRL is generated through grid-based policy sampling to generate optimal path policies that can explore various future passable paths of the scene and provide the scope of future paths at a fine-grained level. This module aims to use a scene-fusion Transformer to make the network enhance the understanding of the scene on based on the scene paths, and output multiple feasible future trajectories. We present the scene-fusion Transformer to integrate multiple scene paths generated by the inverse reinforcement module and observed trajectories in the fine granularity. Below, we will give a detailed description of the fine-grained fusion module from the aspects of the scene-fusion Transformer, feature extraction, feature fusion, and output.

Fig. 3
figure 3

Overall structure of the vanilla Transformer

Scene-fusion Transformer

Figure 3 shows the architecture of the vanilla Transformer. We improve the vanilla Transformer model based on scene fusion to make it more suitable for trajectory prediction tasks. Considering the importance of real-time performance to the trajectory prediction, we use a sparse self-attention mechanism to extract features, reducing the computational complexity from \(O\left( L^{2}\right) \) to \(O\left( L\hbox {log}L\right) \) where L denotes the length of the input sequence. The computational efficiency is improved, and the performance remains the same as the traditional method. Besides, we adopt a parallel decoding strategy to directly predict future trajectories instead of auto-regressive methods, which can improve the prediction and reasoning speed while reducing the error accumulation. Since we adopt a non-autoregressive training strategy, we directly remove the mask in the decoder. The overall frame diagram of the scene-fusion Transformer is shown in Fig. 4.

Feature extraction

The scene paths information \(P_{(i)}\) output by the IRL is used for feature extraction in the scene-fusion Transformer encoder. Features of the observed trajectory \(T_{{\mathrm{obs}}}\) are extracted by the fully connected layer.

After the scene path information is input into the encoder, it is multiplied by the linear transformation with weights of \(W_{Q}^p\), \(W_{K}^p\) and \(W_{V}^p\) respectively, to output three matrices \(Q_{p}\), \(K_{p}\) and \(V_{p}\). The query sparsity measurement is adopted in [23], defining the i-th query’s attention on all keys as a probability \(p\left( k\mid q_{i}\right) \). If this probability is closer to the uniform distribution \(q\left( k\mid q_{i}\right) \), it means that the self-attention is redundant to the residential input. Therefore, the similarity between distribution p and q can be used to distinguish which queries are “important.” The Kullback–Leibler divergence can be used to measure this similarity, and the i-th query’s sparsity measurement is defined as \(M\left( q_{i}\mid K\right) \).The sparse matrix of \(Q_{p}\) is \(\bar{Q}\), which has the same size of \(Q_{p}\) and only contains the Top-u queries under the sparsity measurement \(M\left( Q_{p},K_{p}\right) \). The attention \(A_{p}\) of the scene path is calculated by multi-heads sparse attention module. The specific formula is as follows:

$$\begin{aligned}&Q_{p}=W_{Q}^{p}P_{(i)},K_{p}=W_{K}^{p}P_{(i)},V_{p}=W_{V}^{p}P_{(i)}, \end{aligned}$$
$$\begin{aligned}&A_{p}=\frac{\hbox {Softmax}(\bar{Q}\cdot K_{p}^{\mathrm{T}})}{\sqrt{d_{k}}}V_{p}, \end{aligned}$$

where \(d_{k}\) denotes the dimension of the \(K_{p}\), T denotes the transposed matrix.

The output of the encoder \(E_{p}\) is obtained through a fully connected layer with residual connections, as follows:

$$\begin{aligned} E_{p}=\hbox {ResBlock}(\text {MLP}(A_{p})+P_{(i)}), \end{aligned}$$

where \(\hbox {ResBlock}()\) denotes the residual connection, \(\text {MLP}()\) denotes the fully connected layer.

The trajectory information \(T_{\mathrm{in}}=\{T_{{\mathrm{obs}}},T_{0}\}\) (where \(T_{0}\) denotes the future trajectory, all filled with 0) is extracted by a linear layer, as follows:

$$\begin{aligned} T_{\mathrm{li}}=\hbox {Linear}(T_{\mathrm{in}},W_{\mathrm{l}}), \end{aligned}$$

where \(W_{\mathrm{l}}\) denotes the parameters of the linear layer.

Fig. 4
figure 4

Overall structure of the scene-fusion Transformer

Feature fusion and output

In the decoder stage, the trajectory features \(T_{\mathrm{li}}\) is fed into the decoder. The calculation of attention parameters from observed trajectories is as follows:

$$\begin{aligned}&Q_{\mathrm{t}}=W_{Q}^{t}T_{\mathrm{li}},K_{\mathrm{t}}=W_{K}^{t}T_{\mathrm{li}}, V_{\mathrm{t}}=W_{V}^{t}T_{\mathrm{li}}, \end{aligned}$$
$$\begin{aligned}&A_{\mathrm{t}}=\frac{\hbox {Softmax}(\bar{Q}\cdot K_{\mathrm{t}}^{\mathrm{T}})}{\sqrt{d_{k}}}V_{\mathrm{t}}, \end{aligned}$$

where \(W_{Q}^{t}\), \(W_{K}^{t}\) and \(W_{V}^{t}\) denote different weight matrix respectively, \(\bar{Q}\) denotes the sparse matrix of \(Q_{\mathrm{t}}\), T denotes the transposed matrix, \(d_{k}\) denotes the dimension of the \(K_{\mathrm{t}}\).

Through the above calculations, the feature output of the scene path \(E_{p}\) and the feature attention of the observed trajectory \(A_{\mathrm{t}}\) are obtained. Afterward, the cross-attention mechanism is used to integrate them and the predicted trajectory \(T_{p}\) is obtained through a fully connected layer, as follows:

$$\begin{aligned}&A_{\mathrm{cross}}=\frac{\hbox {Softmax}(A_{\mathrm{t}} \cdot E_{p}^{\mathrm{T}})}{\sqrt{d_{\mathrm{e}}}}E_{p}, \end{aligned}$$
$$\begin{aligned}&T_{p}=\text {MLP}(A_{\mathrm{cross}},W_{m}), \end{aligned}$$

where \(A_{\mathrm{cross}}\) denotes the cross-attention of \(A_{\mathrm{t}}\) and \(E_{p}\), T denotes the transposed matrix, \(d_{\mathrm{e}}\) denotes the dimension of the \(E_{p}\), \(\text {MLP}()\) is the Multi-layer Perceptron, \(W_{m}\) denotes the weight matrix of \(\text {MLP}()\).

Table 2 Ablation study of the MGSU on the SDD dataset



Stanford drone dataset The Stanford drone dataset (SDD) consists of the tracks of pedestrians, bicycles, skateboarders and vehicles captured by drones in 60 different scenes at Stanford University. It provides a bird’s eye view of the scene and the locations of the tracked agents in the pixel coordinates of the scene. SDD contains multiple scene elements, such as roads, sidewalks, buildings, parking lots, terrain, and leaves. Roads and sidewalks come in different configurations, including roundabouts and intersections. We use the evaluation setting as defined in the TrajNet benchmark, which segments the dataset according to the scenario. Therefore, the training, validation, and test sets have different scenarios in a total of 60 scenarios. This allows us to evaluate our model in unknown scenarios where we cannot see the previous trajectory data.

NuScenes It is a large-scale autonomous driving dataset set up by the autonomous driving company NuTonomy. NuScenes covers a total of 1000 different scenarios, with each scene having a recording length of 20 s, containing different road layouts. All data was captured using the on-board cameras and Lidar sensors. The official segmentation method is used to generate the training and test datasets for evaluations.

Evaluation indicators

Two error metrics are used to evaluate the trajectory prediction performance as follows:

  1. (1)

    Average Displacement Error (ADE): it represents the mean square error (MSE) between the predicted and ground-truth trajectory at each time step t, as follows:

    $$\begin{aligned} \hbox {ADE}=\frac{1}{N}\Sigma _{t=1}^{N}\left\| Y_{\mathrm{t}}^{\mathrm{GT}}-Y_{\mathrm{t}}^{\mathrm{pred}} \right\| _{2}, \end{aligned}$$

    where \(Y_{\mathrm{t}}^{\mathrm{GT}}\) and \(Y_{\mathrm{t}}^{\mathrm{pred}}\) represent the ground-truth and predicted trajectories, respectively. N represents the total number of time steps.

  2. (2)

    Final Displacement Error (FDE): it represents the MSE of the ground-truth and predicted trajectories at the last time step n, as follows:

    $$\begin{aligned} \hbox {FDE}=\left\| Y_{n}^{\mathrm{GT}}-Y_{n}^{\mathrm{pred}} \right\| _{2}. \end{aligned}$$

Experimental details

Samples of SDD and NuScenes are generated following [49]. For the SDD, 3.2-second observed trajectories and 4.8-second ground-truth trajectories are used. For the NuScenes, 2-second observed trajectories and 6-second ground-truth trajectories are used. The input scene image is centered at the position of the last observation, and the size of the image \((s_{i})\) is set to \(200\times 200\) pixels. In the coarse-grained fusion module, we only use the encoder of ESPNet, which is pre-trained on ADE20k. The size of \(\hbox {FC}\) \((s_{{\mathrm{fc}}})\) is set to 128. In the IRL module, the dimension of the 2D grid \((d_{{\mathrm{grid}}})\) is [25, 25], the initial state is set to [12, 12], and the size of the scene feature \((s_{{\mathrm{sf}}})\) is set to 64. In the scene-fusion Transformer of the fine-grained fusion module, the embedding size of the model \((e_{{\mathrm{model}}})\) is set to 512, the number of the layer \((N_{\mathrm{l}})\) is set to 6, the number of heads of the multi-head attention \((\hbox {heads})\) is set to 8, and the dropout of the network is set to 0.01. ADEmin20 and FDEmin20 were used as evaluation indexes on the SDD, and ADEmin10 and FDEmin10 were used as evaluation indexes in NuScenes. All experiments are implemented on the Ubuntu system based on Pytorch framework, and the processor used is the Nvidia 2080 graphics card. The number of training epochs is 300, using the Adam optimizer, the batch size \((\text {bs})\) is set 32, and the learning rate \((\text {lr})\) is set to 0.0001.

Ablation experiments

The ablation study is performed on the SDD to analyze the impact of each module. The scene-fusion Transformer that deals only with trajectory data is denoted as MGSU-A. Based on MGSU-A, MGSU-B introduces the ESPNet and cross-attention modules. Finally, MGSU-C adds the IRL and perform fine-grained fusion based on MGSU-B. Table 2 reports the results of the ablation study. Detailed discussions are presented as follows:

MGSU-A: To analyze the influence of the scene information on trajectory prediction, the scene-fusion Transformer is used to process the observed trajectory without considering any scene information. Results show that the prediction performance is not satisfactory when the model only considers the observed trajectory and ignores the scene information (ADE /FDE up to 20.19/32.93). Such high ADE/FDE values indicate that the model is not comprehensive in predicting agents’ future trajectories only according to their observed trajectories, but ignores other factors that affect the future movement trends, such as agents’ interactions and the scene influence.

MGSU-B: This variant considers the context information of the scene. The ESPNet and cross-attention modules are added to perform coarse-grained fusion with scene context based on observed trajectories. The ESPNet is utilized to explore the semantic meaning of the objects such as roads, trees, and buildings contained in the scene, so that the predicted trajectories fall in the feasible regions of the scene. After concatenating outputs of the scene-fusion Transformer and ESPNet, the ADE/FDE values are decreased by 4.3 and 3.75, respectively. Afterward, the cross-attention module is used to fuse the trajectory and scene information. Specifically, the attention mechanism can make the trajectory to pay more attention to the scene of some important areas, such as sidewalks, roads, so that the trajectory information can better integration with scene information. Therefore, the ADE/FDE values are further decreased by 2.48 and 4.86, respectively.

MGSU-C: This variant introduces the IRL to generate scene paths and uses the fine-grained fusion module to fuse the trajectory and path information. Concretely, the output of the coarse-grained fusion module is fed into the IRL to concretize the fusion information. After outputting several feasible scene paths, the fine-grained fusion module is performed with the observed trajectories to generate the final prediction results. Compared with MGSU-B, the ADE/FDE values of MGSU-C are decreased by 4.17 and 8.45, respectively. Such an improvement verifies the effect of the fine-grained fusion module.

Evaluation of the scene-fusion Transformer

In this work, a novel efficient scene-fusion Transformer is proposed to improve the fusion efficiency. Specifically, the encoder is used to process the scene features, and the decoder is used to extract the trajectory features and fuse them with the scene features to generate the prediction results. Since the traditional attention module has a high computational complexity, the sparse attention mechanism is introduced to reduce the computational complexity from \(O(L^2)\) to \(O(L\cdot \hbox {log}L)\), where L represents the length of trajectory. Meanwhile, the decoder is set as generative decoding, so that the prediction results can be obtained in one step when predicting the future trajectory, instead of generating the predicted trajectory in a step-by-step way, thus reducing the time complexity of prediction from O(N) to O(1). Figures 5 and 6 illustrates the comparison with the vanilla Transformer from the aspects of training speed and prediction accuracy using the same model parameters.

Fig. 5
figure 5

Comparison of training speed between the scene-fusion Transformer and the vanilla Transformer

Fig. 6
figure 6

Comparison of prediction accuracy between the scene-fusion Transformer and the vanilla Transformer

Training speed Figure 5 compares the training speed of the two models when the number of training epoch is set to 10, 30, and 80, respectively. Obviously, the training speed of the scene-fusion Transformer has been greatly improved compared to the vanilla Transformer, increased by 73.3\(\%\), 75.2\(\%\), and 75.4\(\%\), respectively. Such an improvement reflects the advantages of the proposed architecture in terms of efficiency.

Prediction accuracy Figure 6 compares the ADE of the two models when setting the number of training epoch to 90. The two methods have roughly the same accuracy at the beginning. Then, around 30–70 epochs, the prediction accuracy of the vanilla Transformer starts to decline slowly and tends to be constant, with ADE remaining around 14. However, the accuracy of the scene-fusion Transformer continues to decline, and the gap with the accuracy of vanilla Transformer gradually widened. Finally, during 70–90 epochs, the scene-fusion Transformer tends to be stable, and the final prediction results is 10.03, with the prediction accuracy improved by 27.5\(\%\).

Evaluation of the fusion methods in the coarse-grained feature fusion

This section discusses different fusion methods used in the coarse-grained fusion stage. The observed trajectory \(T_{{\mathrm{obs}}}\) and corresponding scene image S are fed into the network through a fully connected layer and ESPNet, respectively. Afterward, we compare the performance of different fusion methods on the SDD dataset, including the concatenation, addition and cross-attention. The fusion results are directly fed into the fine-grained fusion module, ignoring the IRL to better reflect the performance of fusion methods. As presented in Table 3, compared with the simple addition, the performance of the fusion method through concatenation is slightly improved, while the prediction accuracy of the cross-attention fusion method is the best.

Table 3 Fusion evaluation experiment on different fusion methods
Table 4 Comparison with the baselines models on the SDD dataset

Quantitative analysis

In this section, the quantitative analysis is performed by comparing MGSU with state-of-the-art methods on the SDD and NuScenes. Methods used for comparisons are briefly introduced as follows:

SGAN SGAN [30] uses an encoder–decoder network to learn pedestrian movement patterns in an adversarial way. The social pooling operation is used to capture pedestrians’ social interactions.

SoPhie Sophie [50] combines social attention mechanisms with physical attention to help the model learn its position in a large scene and extract the most significant parts of the path-related image. It also GAN to generate more realistic samples and capture the uncertainty of future paths by modeling their distribution.

P2TIRL P2TIRL [49] proposes an attention-based trajectory generator that generates future trajectories based on a sequence of states sampled from the MaxEnt strategy. It reformulates the MaxEnt IRL to allow policies to collectively infer reasonable proxy goals and paths to those goals on a rough 2-D grid defined on the scenario.

SimAug SimAug [51] learns robust representations by augmenting simulated training data, allowing representations to better generalize to unseen real-world test data. Its key idea is to combine features from the most difficult camera views with adversarial features from the original views.

PECNet PECNet [52] presents a pedestrian endpoint conditioned trajectory prediction network that can predict rich and diverse multi-modal socially compliant trajectories across a variety of scenes.

IRLSOT IRLSOT [25] proposes inverse reinforcement learning for scene-oriented trajectory prediction to better forecast pedestrians’ future trajectories under rare or complex environments.

Physics oracle Physics oracle [53] is an extended simple and explainable model of classical physics. The current velocity, acceleration and yaw rate of the trajectory are used for prediction.

CoverNet CoverNet [53] is a method for multimodal probabilistic trajectory prediction for urban driving. It frames the trajectory prediction problem as the classification of a set of distinct trajectories.

SGDNet-ED SGDNet-ED [54] proposes a recursive trajectory prediction network SGNet that evaluates and uses targets at multiple time scales.

MTP MTP [55] proposes a multi-modal modeling method for vehicle motion prediction. It uses raster grid images to encode the context of each vehicle participant, and uses CNN model to generate several possible trajectories and the corresponding probabilities.

Trajectron++ Trajectron++ [2] presents a generative multi-agent trajectory forecasting approach that addresses the desiderata for an open, generally applicable and extensible framework.

Multipath Multipath [56] utilizes a fixed set of future-state sequence anchors that correspond to patterns of trajectory distribution. It predicts a discrete distribution over anchors, and for each anchor, the offset of the regression anchor, as well as the uncertainty, produces a Gaussian mixture at each time step.

Table 5 Comparison with the baselines models on the NuScenes dataset

Table 4 reports the comparison results between our method and S-GAN, SoPhie, P2TIRL, SimAug, PECNet, and IRLSOT on the SDD. The comparisons verify the effectiveness of our method in pedestrian trajectory prediction by fusing scene information. SGAN only uses trajectory information among these methods, so the prediction performance is the lowest. Sophie and CF-VAE take the scene information into account and used CNN to extract the features of the scene information, so their performance is superior to SGAN. In contrast to the above methods, P2TIRL employs the VGG network and reinforcement learning to further process scene information. Therefore, the prediction performance is improved. SimAug uses the multi-view simulation data to enhance the representation of the prediction model, and by adding simulation training data to learn robust representation, the representation can be better generalized to invisible test data. PECNet assists long-range multi-modal trajectory prediction by inferring distant trajectory endpoints. A novel social pooling layer is proposed to enable PECNet to consider social interactions, which improves PECNet’s trajectory prediction performance. IRLSOT exploits an IRL framework to explore the complex scenes and utilizes novel Scene Based Attention block to fuse scene and trajectory information. As a result, it achieves the sub-optimal performance. Our method combines the semantic segmentation network, cross-attention fusion, IRL, and a scene-fusion Transformer to fully fuse the scene and trajectory from different granularity to realize the understanding of the scene context to achieve the best performance in terms of mean displacement error.

Table 5 compares our method and Physics-oracle, CoverNet, MTP, SGDNet-ED, Multipath and Trajectron++ on the NuScenes. Since Physics-oracle is a simple model based on classical physics, it is challenging to accurately capture the dynamic changes of the agent trajectory. CoverNet, MTP and SGDNet-ED take the scene information into account, resulting in a certain improvement in prediction accuracy. Trajectron++ can efficiently incorporate high-dimensional data through the lens of encoding semantic maps and proposes a general approach to incorporate dynamic constraints into learning-based multi-agent trajectory prediction methods. Hence the prediction accuracy has been further improved. Multipath can predict the parameter distribution of the agent trajectory in the real world by considering the scene information, which further improves the prediction performance. Our method uses the fusion of different granularity to fully understand the scenario and also achieves optimal performance on this dataset.

Qualitative analysis

Qualitative analysis is conducted on SDD and NuScenes to evaluate the trajectory prediction performance after fusing the scene information. Figure 7 demonstrates the predicted trajectories on the SDD. The first row shows the observed trajectories (denoted by white lines) and corresponding scenes. The second row shows the scene paths generated by IRL. The third row illustrates the predicted trajectories (the red and black lines denote predicted and ground-truth trajectories, respectively).As we know, pedestrians’ movements are often affected by the scene environment. MGSU can accurately infer the pedestrians’ moving directions and potential destinations after integrating scene information, thus generating feasible future trajectories consistent with path constraints. As shown in Fig. 7a, MGSU can precisely predict the future trajectory of the target in the case of a straight path. In the case of a curved road as shown in Fig. 7b, MGSU can generate scene paths with similar degree of curvature to the road, thus standardizing the predicted future trajectory. In Fig. 7c, MGSU successfully recognize the stationary pedestrian. Figure 7d, e show the case of fork roads. MGSU can generate multiple feasible paths according to the observed trajectories and corresponding scene images. Specifically, as shown in the middle sub-graph of Fig. 7e, our model generates two feasible paths (leading to the upper left and upper right) at the intersection, resulting in a multi-modal distribution of the predicted future trajectory.

Fig. 7
figure 7

Qualitative analysis of MGSU on the SDD datasets

Figure 8 demonstrates the trajectory prediction of vehicles on the NuScenes. The first row shows the input, including the vehicle trajectories and scene layouts. The second row shows the scene paths generated by IRL. The third row illustrates the predicted trajectories (the white, red, and black lines denote observed, predicted, and ground-truth trajectories, respectively). After fusing the observed trajectories and corresponding scene layouts, MGSU can generate reasonable paths and precisely infer the moving directions of the lanes. Therefore, the model can forecast future trajectories that are consistent with the scene layout constraints after integrating the scene information. Figure 8a, b shows the trajectory prediction performance on the straight roads which are commonly observed on the highway. MGSU achieves accurate prediction performance on these cases. In Fig. 8c, MGSU precisely identify the stationary vehicle. In the case of curves on the highway as shown in Fig. 8d, e, MGSU predicts scene paths that share similar motion trends with the ground-truth trajectories. Then, our model accurately predicts vehicles’ future trajectories in such challenging scenarios, benefiting from the understanding of the road environments. Moreover, it can be noted from Fig. 8e that multiple feasible moving trends can be inferred from the scene layout, which makes the generated results more realistic.

Fig. 8
figure 8

Qualitative analysis of MGSU on the NuScenes datasets


A multi-granularity fusion architecture named MGSU is proposed to perform trajectory prediction based on the understanding of the scene. It consists of three modules: the fine-granularity feature fusion module, the inverse reinforcement learning module, and the fine-granularity feature fusion module. With these modules, the scene and trajectory information are gradually fused from coarse granularity to fine granularity. A novel scene-fusion Transformer is presented to better integrate scene information and improve the efficiency. Its encoder explores the scene context and the trajectory information is encoded by a linear layer. A sparse cross-attention mechanism is used to fuse the scene and trajectory information with high efficiency. The decoder predicts future trajectories in a generative manner to avoid the error accumulation. Quantitative and qualitative evaluations of MGSU are conducted on the public SDD and NuScenes datasets. Results show that the trajectory prediction performance of MGSU is improved after fusing the scene information, and it can better adapt to various complex environments.

We believe that MUSU can be used for real-world applications such as the service robotic or self-driving cars. For example, it can be used to forecast pedestrians’ crossing intentions [57] and provide decision information for the intelligent car. Our future work focuses on integrating pedestrians’ trajectory with their actions to perform long-term action prediction.