Abstract
Understanding agents’ motion behaviors under complex scenes is crucial for intelligent autonomous moving systems (like delivery robots and selfdriving cars). It is challenging duo to the inherent uncertain of future trajectories and the large variation in the scene layout. However, most recent approaches ignored or underutilized the scenario information. In this work, a MultiGranularity Scenarios Understanding framework, MGSU, is proposed to explore the scene layout from different granularity. MGSU can be divided into three modules: (1) A coarsegrained fusion module uses the crossattention to fuse the observed trajectory with the semantic information of the scene. (2) The inverse reinforcement learning module generates optimal path strategy through gridbased policy sampling and outputs multiple scene paths. (3) The finegrained fusion module integrates the observed trajectory with the scene paths to generate multiple future trajectories. To fully explore the scene information and improve the efficiency, we present a novel scenefusion Transformer, whose encoder is used to extract scene features and the decoder is used to fuse scene and trajectory features to generate future trajectories. Compared with the current stateoftheart methods, our method decreases the ADE errors by 4.3% and 3.3% by gradually integrating different granularity of scene information on SDD and NuScenes, respectively. The visualized trajectories demonstrate that our method can accurately predict future trajectories after fusing scene information.
Introduction
With the rapid development of artificial intelligence, the intelligent autonomous moving system has become a hot topic of the current research, and the accompanying driving safety issues have also attracted public attention. However, the uncertainty of future trajectory and the large variation in scene layout bring great challenges to forecasting pedestrians’ trajectories. Therefore, it is of great significance to study pedestrians’ motion behaviors to reduce the occurrence of collision accidents and protect their safeties.
The core of trajectory prediction is to learn pedestrians’ motion behaviors [1, 2] based on given observed trajectories, and predict all possible future trajectories. To accurately predict future trajectories, researchers mainly adopted modeldriven or datadriven methods. Commonly used modeldriven methods include the Markov model [3, 4] and Kalman filter [5, 6]. For example, Schneider and Gavrila [7] combined Kalman Filter and constant velocity to predict pedestrians’ future trajectory. Mathew et al. [8] proposed a hybrid prediction method based on the hidden Markov model, which clusters trajectories based on observed trajectories. However, due to the complexity and nonlinearity of future trajectories, modelbased methods are difficult to accurately capture the dynamic changes and longterm dependence of trajectories, so the prediction is not accurate enough.
Datadriven methods [9, 10] are effective ways to deal with dynamic changes and longterm dependence on trajectories. The previous methods are mainly based on Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) to model the dynamic features of the trajectory. In the RNNbased methods, Lee et al. [11] proposed a model based on RNN, which captures the dynamic changes in motion through adaptive learning of network parameters. Bartoli et al. [12] adopted the Long ShortTerm Memory (LSTM) network to alleviate the longterm dependence issue of trajectory by using the gate mechanism and timestep parameter sharing. The RNNbased method can model the features of the trajectory. However, the model needs to process the data sequentially due to its recurrent structure, resulting in inefficient data processing and gradient vanishing problems [13]. In the CNNbased methods, Chen et al. [14] proposed a convolution embedding model that models the relative order of positions through the onedimensional convolution and predicted the next position with trajectory data. Zamboni et al. [15] proposed a new convolutional model for pedestrian trajectory prediction that uses 2D convolution. The CNNbased method can effectively model the trajectory sequence, but due to the limited receptive field of the CNN, the extraction of the longterm dependence does not work out well .
To solve the above problems, we adopt the Transformer framework, which was first proposed in 2017 [16], and soon became popular in the field of Natural Language Processing (NLP) such as machine translation [17], speech recognition [18], and question answering system [19]. Transformer has a strong semantic feature extraction ability and task comprehensive feature extraction ability, and can perform parallel computing, which overcomes the shortcomings of RNN and its variants sequential structure. In terms of longterm feature capture capability, Transformer also excels due to its multiattention module. On the task of trajectory prediction, Giuliari et al. [20] used vanilla Transformer to model pedestrians’ trajectories and achieved satisfactory prediction performance. Yu et al. [21] proposed a framework for spatiotemporal crowd trajectory prediction with only attention mechanisms to model the interactions in space and time. The above methods use the powerful feature extraction ability of Transformer and achieve good results in the field of trajectory prediction. However, they only use Transformer to extract one kind of features, ignoring the multifeature fusion ability of crossattention module in the network. Also, we noticed that researchers tend to focus more on modeling interactions with other agents, ignoring static contextual information (road infrastructure) of the scene. However, as shown in Fig. 1, contextual information of the scene has the same (or even greater) importance as the dynamic information of other agents [22].
To address such limitations, we propose a MultiGranularity Scenarios Understanding (MGSU) framework, which can extract trajectory features and fuse with scene information at different granularity. The network can capture longterm dependencies because it is based on the attention mechanism. To make the fusion more efficient, we introduce a novel scenefusion transformer, whose encoder is used to extract scene features and fuse them with trajectory features in the decoder. The scenefusion transformer adopts sparse attention mechanism and the decoder is set as generative output like Informer [23], which can effectively avoid the accumulation of errors. A lightweight semantic segmentation network, ESPNet [24], is introduced to extract the semantic features of the scene. To better utilize the scene information, an inverse reinforcement learning (IRL) approach [25] is introduced to generate the optimal path strategy based on semantic features. Concretely, the main contributions of this paper can be summarized as follows:

(1)
We propose a MultiGranularity Scenarios Understanding framework (MGSU), which can effectively model the interaction between pedestrians’ trajectories and the scene, and generate multiple feasible predictions for the future trajectories. MGSU gradually integrates scene information and trajectory information according to different granularity stages. We also introduce ESPNet and inverse reinforcement learning methods to achieve a more comprehensive exploration of the impact of the scene layout on future trajectories.

(2)
To better fuse pedestrians’ trajectories and the scene, a novel and efficient scenefusion transformer is presented. It adopts the sparse attention mechanism and sets the decoder as generative output of future predicted trajectories, which can effectively avoid error accumulation and improve the efficiency.

(3)
We evaluate MGSU on the SDD and NuScenesdatabases, and the results show that our approach can understand the scene layout with high accuracy.
The rest of this paper is organized as follows. “Related work” summarizes the methods for trajectory prediction and the methodology related to our work. “Method” describes the proposed MGSU in detail. “Experiments” elaborates on experiments for MGSU and discusses results with trajectory visualization and quantitative evaluation. Our conclusions are presented in “Conclusion”.
Related work
Trajectory prediction Trajectory prediction methods can be mainly divided into modeldriven [8, 26,27,28] and datadriven [9, 29,30,31] methods. For the former, there is an explicit model related to the target motion over time. Keller and Gavrila [8] proposed a method based on a linear dynamic model to predict future trajectories in a short time. To overcome the limitations of the linear dynamic model, Karasev et al. [26] proposed another modelbased approach to predict future trajectories by modeling the behavior of the target as a Markov process. Malviya and Kala [28] presented a trajectory prediction method based on particle filter to track humans using a limitedfieldofview monocular camera. On the other hand, there is no explicit modeling of target behavior in datadriven methods that rely mainly on trajectory datasets under multiple scenarios and attempt to learn the behavior of targets from the datasets. Alahi et al. [29] proposed Social LSTM that predicts pedestrians’ trajectories by exploiting interactions between pedestrians on roadways. However, the computation of the model is expensive because the social pooling operation needs to consider the interactions between all pedestrians in the scene. Gupta et al. [30] proposed SocialGAN to overcome the limitations of Social LSTM by introducing generative adversarial networks and the global pooling mechanism. In the above methods, RNN and its variants have become an important part of many recent trajectory prediction models [32, 33] due to its powerful processing capability for time series data. However, RNN and its variants cannot be computed in parallel due to its own order structure, and has poor ability to extract the longterm dependence. Therefore, our method mainly uses the attention mechanism to overcome the shortcomings of the above. In previous work, researchers tend to focus more on modeling the interaction with other agents and ignore the context information (road infrastructure) of static scenarios. However, context information for a scenario is just as important (or even more important) as dynamic information for other agents. Therefore, we use ESPNet to extract semantic features of the scene, and introduce IRL to generate the optimal path strategy from the semantic features and historical trajectories. Then, the scene information and trajectory information are closely combined to accurately predict the future trajectory.
Scene understanding The semantic segmentation network provides great help for trajectory prediction by providing feasible regions. In terms of scene understanding, Visin et al. [34] and Bell et al. [35] used RNN to pass information along each row or column of the scene, but resulted in a single RNN layer where each pixel position could only obtain information from the same row or column. Liang et al. [36] proposed a variant of LSTM to exploit the context in the scene, but suffers from the expensive computational cost. Currently, researchers referred to CNNbase methods to understand the scenes [37,38,39]. Ronneberger et al. [37] proposed a UNet semantic segmentation network, which relies on data augmentation to efficiently utilize available annotation samples. DUC [40], DeepLabv3 [41] and PSPNet [42] used extended convolution to preserve the spatial size of feature graphs. Orhan and Bastanlar [43] proposed a semantic segmentation CNNbased model that utilizes equirectangular convolutions to handle distortions in panoramic images. These methods can precisely describe the scene semantic information, but their heavy computational overheads result in a slow inference speed, which is not suitable for realtime trajectory prediction. Therefore, we introduce a lightweight semantic segmentation network, ESPNet, which has an extremely high inference speed and can make fast and efficient segmentation of the scene images. Further, the IRL method is introduced to generate the optimal path strategy by using the semantic information of the scene and pedestrians’ observed trajectories, so as to help the network to deeply understand the scene.
Transformers Transformer is a deep learning architecture proposed by Google in 2017. It has achieved great success in the field of NLP [44,45,46]. Due to the unique attention mechanism and the excellent performance in the field of NLP, researchers have great interest in its application in trajectory prediction. Giuliari et al. [20] used the vanilla Transformer without considering any complex interaction information, and achieved satisfactory results. Yu et al. [21] proposed a STAR architecture to model the interaction information in space and time. Achaji et al. [47] introduced PReTR, which utilized a decomposed spatiotemporal attention module to extract features from multiagent scenarios. Yao et al. [48] proposed an endtoend transformer network that has the selfcorrecting scheme to enhance the model robustness. The above methods make use of the powerful feature extraction ability of Transformer and perform good in trajectory prediction. However, they only used Transformer to extract one class of features and ignore the multifeature fusion ability of the crossattention module. In this work, we improve the vanilla Transformer and propose an efficient scenefusion Transformer, which can simultaneously fuse trajectory features with scene information.
Methods
We focus on the fusion of scene information and trajectory information to improve the trajectory prediction performance. The framework of the proposed MGSU is illustrated in Fig. 2. MGSU fully utilizes the scene information by gradually integrating different granularity of scene information to model the interaction between trajectory and scene. Firstly, the semantic information of the scene image is extracted by the coarsegrained fusion module and fused with the trajectory information through the crossattention module to output the motion representation at a coarsegrained level. Then, the motion representation of coarsegrained fusion is fed into the IRL module to generate the optimal path strategy through the gridbased policy sampling. Afterward, the IRL module outputs multiple scene paths which are accurate and can provide the scope of future paths at a finegrained level. Finally, the scene paths and observed trajectories are fused in the finegrained feature fusion module to generate multiple future trajectories. The difference between the coarsegrained fusion and the finegrained fusion mainly lies in the precision of information presentation and the way of fusion. Coarsegrained fusion is the fusion of trajectory and semantic features which provide information about the category of the identified object and the scope of future trajectories at a coarsegrained level through crossattention. Finegrained fusion is the fusion of trajectory and path features that provide the scope of future paths at a finegrained level through a scenefusion Transformer. Table 1 presents all control parameters used in this paper. Details of different modules are described as follows.
Coarsegrained fusion module
The scene semantic information is related to the object categories in the scene. Therefore, the model can learn the information about passable areas (like roads and crosswalks) for pedestrians from semantic information to reduce the uncertainty of objects in the scene. The semantic information can provide a basis for predicting the future trajectory according to the category of the identified object. However, this basis can only provide the scope of future trajectories at a coarsegrained level. As illustrated in Fig. 2, the coarsegrained fusion module uses the crossattention mechanism to fuse observed trajectories with corresponding scene images, and outputs a coarsegrained motion mixture representation. This module includes a semantic segmentation network, a fully connected layer, and two crossattention modules. ESPNet is used to extract semantic features from the input scene S. Meanwhile, a fully connected layer is used to map the observed trajectory \(T_{{\mathrm{obs}}}\) to a highdimensional feature space to facilitate feature extraction. Afterward, the two crossattention modules are used to calculate the attention of the scene to the trajectory \(A_{\mathrm{s}}\) and the attention of the trajectory to the scene \(A_{\mathrm{t}}\). The two attentions are concatenated to generated the coarsegrained motion representation \(C_{\mathrm{h}}\), as follows:
where \(W_{\mathrm{e}}\) and \(W_{\mathrm{f}}\) denote parameters of ESPNet and the fully connected layer, respectively. \(S_{\mathrm{e}}\) denotes the scene semantic feature output by ESPNet. \(T_{\mathrm{fo}}\) denotes the trajectory features output by the fully connected layer. \(\hbox {T}\) denotes the transposed matrix. \(d_{\mathrm{e}}\) and \(d_{\mathrm{t}}\) denote the dimension of \(S_{\mathrm{e}}\) and \(T_{\mathrm{fo}}\), respectively. \(\hbox {Softmax}()\) denotes the activation function, and \(\hbox {Cat}()\) is used for concatenating.
Inverse reinforcement learning module
The inverse reinforcement learning module is introduced to generate concrete scene representation based on the coarsegrained motion representation. The core of IRL is to reverse the reward function according to the expert example and generate the optimal strategy according to the reward function. This module takes the coarsegrained motion representation as input, and outputs multiple scene paths.
Firstly, the path reward map \(r_{\mathrm{path}}\) and goal reward map \(r_{\mathrm{goal}}\) are generated according to coarsegrained motion representation as follows:
where \(\text {MLP}_{\mathrm{path}}\) and \(\text {MLP}_{\mathrm{goal}}\) denote two multilayer perceptron with the same structure, which provide path reward value for reinforcement learning. \(r_{\mathrm{path}}\) is used to provide rewards for action choices; \(r_{\mathrm{goal}}\) is used to provide a reward for terminating a path.
Afterward, to obtain the maximum entropy strategy \(\pi _{\theta }\left( a\mid s \right) \), which represents the probability of taking action a under the condition of state s, we use the approximation iteration to get the maximum entropy strategy as shown in Algorithm 1, where V(s) denotes the state logarithm function, Q(s, a) denotes the stateaction logarithm function, N is the total number of iterations, \(T\left( s,a\right) \) denotes the cross product of s and a, \(S_{p}\) and \(S_{g}\) denote the state of \(r_{\mathrm{path}}\) and \(r_{\mathrm{goal}}\) respectively, which have the same dimensions as the 2D grid. There are different probability values around the state of the reward map, which provide the probability of different actions for the action selection of the target. Then, the target will choose the action with the highest moving probability.
Finally, the GumbelSoftmax Trick is introduced to sample the scene paths, and the argument of the maximum is used to get the selected actions, and the ith scene path \(P_{(i)}\) is obtained, as follows:
where \(\hbox {noise}\) denotes Gumbel noise, and a denotes the final action choice.
Finegrained fusion module
Path information obtained by IRL is generated through gridbased policy sampling to generate optimal path policies that can explore various future passable paths of the scene and provide the scope of future paths at a finegrained level. This module aims to use a scenefusion Transformer to make the network enhance the understanding of the scene on based on the scene paths, and output multiple feasible future trajectories. We present the scenefusion Transformer to integrate multiple scene paths generated by the inverse reinforcement module and observed trajectories in the fine granularity. Below, we will give a detailed description of the finegrained fusion module from the aspects of the scenefusion Transformer, feature extraction, feature fusion, and output.
Scenefusion Transformer
Figure 3 shows the architecture of the vanilla Transformer. We improve the vanilla Transformer model based on scene fusion to make it more suitable for trajectory prediction tasks. Considering the importance of realtime performance to the trajectory prediction, we use a sparse selfattention mechanism to extract features, reducing the computational complexity from \(O\left( L^{2}\right) \) to \(O\left( L\hbox {log}L\right) \) where L denotes the length of the input sequence. The computational efficiency is improved, and the performance remains the same as the traditional method. Besides, we adopt a parallel decoding strategy to directly predict future trajectories instead of autoregressive methods, which can improve the prediction and reasoning speed while reducing the error accumulation. Since we adopt a nonautoregressive training strategy, we directly remove the mask in the decoder. The overall frame diagram of the scenefusion Transformer is shown in Fig. 4.
Feature extraction
The scene paths information \(P_{(i)}\) output by the IRL is used for feature extraction in the scenefusion Transformer encoder. Features of the observed trajectory \(T_{{\mathrm{obs}}}\) are extracted by the fully connected layer.
After the scene path information is input into the encoder, it is multiplied by the linear transformation with weights of \(W_{Q}^p\), \(W_{K}^p\) and \(W_{V}^p\) respectively, to output three matrices \(Q_{p}\), \(K_{p}\) and \(V_{p}\). The query sparsity measurement is adopted in [23], defining the ith query’s attention on all keys as a probability \(p\left( k\mid q_{i}\right) \). If this probability is closer to the uniform distribution \(q\left( k\mid q_{i}\right) \), it means that the selfattention is redundant to the residential input. Therefore, the similarity between distribution p and q can be used to distinguish which queries are “important.” The Kullback–Leibler divergence can be used to measure this similarity, and the ith query’s sparsity measurement is defined as \(M\left( q_{i}\mid K\right) \).The sparse matrix of \(Q_{p}\) is \(\bar{Q}\), which has the same size of \(Q_{p}\) and only contains the Topu queries under the sparsity measurement \(M\left( Q_{p},K_{p}\right) \). The attention \(A_{p}\) of the scene path is calculated by multiheads sparse attention module. The specific formula is as follows:
where \(d_{k}\) denotes the dimension of the \(K_{p}\), T denotes the transposed matrix.
The output of the encoder \(E_{p}\) is obtained through a fully connected layer with residual connections, as follows:
where \(\hbox {ResBlock}()\) denotes the residual connection, \(\text {MLP}()\) denotes the fully connected layer.
The trajectory information \(T_{\mathrm{in}}=\{T_{{\mathrm{obs}}},T_{0}\}\) (where \(T_{0}\) denotes the future trajectory, all filled with 0) is extracted by a linear layer, as follows:
where \(W_{\mathrm{l}}\) denotes the parameters of the linear layer.
Feature fusion and output
In the decoder stage, the trajectory features \(T_{\mathrm{li}}\) is fed into the decoder. The calculation of attention parameters from observed trajectories is as follows:
where \(W_{Q}^{t}\), \(W_{K}^{t}\) and \(W_{V}^{t}\) denote different weight matrix respectively, \(\bar{Q}\) denotes the sparse matrix of \(Q_{\mathrm{t}}\), T denotes the transposed matrix, \(d_{k}\) denotes the dimension of the \(K_{\mathrm{t}}\).
Through the above calculations, the feature output of the scene path \(E_{p}\) and the feature attention of the observed trajectory \(A_{\mathrm{t}}\) are obtained. Afterward, the crossattention mechanism is used to integrate them and the predicted trajectory \(T_{p}\) is obtained through a fully connected layer, as follows:
where \(A_{\mathrm{cross}}\) denotes the crossattention of \(A_{\mathrm{t}}\) and \(E_{p}\), T denotes the transposed matrix, \(d_{\mathrm{e}}\) denotes the dimension of the \(E_{p}\), \(\text {MLP}()\) is the Multilayer Perceptron, \(W_{m}\) denotes the weight matrix of \(\text {MLP}()\).
Experiments
Datasets
Stanford drone dataset The Stanford drone dataset (SDD) consists of the tracks of pedestrians, bicycles, skateboarders and vehicles captured by drones in 60 different scenes at Stanford University. It provides a bird’s eye view of the scene and the locations of the tracked agents in the pixel coordinates of the scene. SDD contains multiple scene elements, such as roads, sidewalks, buildings, parking lots, terrain, and leaves. Roads and sidewalks come in different configurations, including roundabouts and intersections. We use the evaluation setting as defined in the TrajNet benchmark, which segments the dataset according to the scenario. Therefore, the training, validation, and test sets have different scenarios in a total of 60 scenarios. This allows us to evaluate our model in unknown scenarios where we cannot see the previous trajectory data.
NuScenes It is a largescale autonomous driving dataset set up by the autonomous driving company NuTonomy. NuScenes covers a total of 1000 different scenarios, with each scene having a recording length of 20 s, containing different road layouts. All data was captured using the onboard cameras and Lidar sensors. The official segmentation method is used to generate the training and test datasets for evaluations.
Evaluation indicators
Two error metrics are used to evaluate the trajectory prediction performance as follows:

(1)
Average Displacement Error (ADE): it represents the mean square error (MSE) between the predicted and groundtruth trajectory at each time step t, as follows:
$$\begin{aligned} \hbox {ADE}=\frac{1}{N}\Sigma _{t=1}^{N}\left\ Y_{\mathrm{t}}^{\mathrm{GT}}Y_{\mathrm{t}}^{\mathrm{pred}} \right\ _{2}, \end{aligned}$$(18)where \(Y_{\mathrm{t}}^{\mathrm{GT}}\) and \(Y_{\mathrm{t}}^{\mathrm{pred}}\) represent the groundtruth and predicted trajectories, respectively. N represents the total number of time steps.

(2)
Final Displacement Error (FDE): it represents the MSE of the groundtruth and predicted trajectories at the last time step n, as follows:
$$\begin{aligned} \hbox {FDE}=\left\ Y_{n}^{\mathrm{GT}}Y_{n}^{\mathrm{pred}} \right\ _{2}. \end{aligned}$$(19)
Experimental details
Samples of SDD and NuScenes are generated following [49]. For the SDD, 3.2second observed trajectories and 4.8second groundtruth trajectories are used. For the NuScenes, 2second observed trajectories and 6second groundtruth trajectories are used. The input scene image is centered at the position of the last observation, and the size of the image \((s_{i})\) is set to \(200\times 200\) pixels. In the coarsegrained fusion module, we only use the encoder of ESPNet, which is pretrained on ADE20k. The size of \(\hbox {FC}\) \((s_{{\mathrm{fc}}})\) is set to 128. In the IRL module, the dimension of the 2D grid \((d_{{\mathrm{grid}}})\) is [25, 25], the initial state is set to [12, 12], and the size of the scene feature \((s_{{\mathrm{sf}}})\) is set to 64. In the scenefusion Transformer of the finegrained fusion module, the embedding size of the model \((e_{{\mathrm{model}}})\) is set to 512, the number of the layer \((N_{\mathrm{l}})\) is set to 6, the number of heads of the multihead attention \((\hbox {heads})\) is set to 8, and the dropout of the network is set to 0.01. ADEmin20 and FDEmin20 were used as evaluation indexes on the SDD, and ADEmin10 and FDEmin10 were used as evaluation indexes in NuScenes. All experiments are implemented on the Ubuntu system based on Pytorch framework, and the processor used is the Nvidia 2080 graphics card. The number of training epochs is 300, using the Adam optimizer, the batch size \((\text {bs})\) is set 32, and the learning rate \((\text {lr})\) is set to 0.0001.
Ablation experiments
The ablation study is performed on the SDD to analyze the impact of each module. The scenefusion Transformer that deals only with trajectory data is denoted as MGSUA. Based on MGSUA, MGSUB introduces the ESPNet and crossattention modules. Finally, MGSUC adds the IRL and perform finegrained fusion based on MGSUB. Table 2 reports the results of the ablation study. Detailed discussions are presented as follows:
MGSUA: To analyze the influence of the scene information on trajectory prediction, the scenefusion Transformer is used to process the observed trajectory without considering any scene information. Results show that the prediction performance is not satisfactory when the model only considers the observed trajectory and ignores the scene information (ADE /FDE up to 20.19/32.93). Such high ADE/FDE values indicate that the model is not comprehensive in predicting agents’ future trajectories only according to their observed trajectories, but ignores other factors that affect the future movement trends, such as agents’ interactions and the scene influence.
MGSUB: This variant considers the context information of the scene. The ESPNet and crossattention modules are added to perform coarsegrained fusion with scene context based on observed trajectories. The ESPNet is utilized to explore the semantic meaning of the objects such as roads, trees, and buildings contained in the scene, so that the predicted trajectories fall in the feasible regions of the scene. After concatenating outputs of the scenefusion Transformer and ESPNet, the ADE/FDE values are decreased by 4.3 and 3.75, respectively. Afterward, the crossattention module is used to fuse the trajectory and scene information. Specifically, the attention mechanism can make the trajectory to pay more attention to the scene of some important areas, such as sidewalks, roads, so that the trajectory information can better integration with scene information. Therefore, the ADE/FDE values are further decreased by 2.48 and 4.86, respectively.
MGSUC: This variant introduces the IRL to generate scene paths and uses the finegrained fusion module to fuse the trajectory and path information. Concretely, the output of the coarsegrained fusion module is fed into the IRL to concretize the fusion information. After outputting several feasible scene paths, the finegrained fusion module is performed with the observed trajectories to generate the final prediction results. Compared with MGSUB, the ADE/FDE values of MGSUC are decreased by 4.17 and 8.45, respectively. Such an improvement verifies the effect of the finegrained fusion module.
Evaluation of the scenefusion Transformer
In this work, a novel efficient scenefusion Transformer is proposed to improve the fusion efficiency. Specifically, the encoder is used to process the scene features, and the decoder is used to extract the trajectory features and fuse them with the scene features to generate the prediction results. Since the traditional attention module has a high computational complexity, the sparse attention mechanism is introduced to reduce the computational complexity from \(O(L^2)\) to \(O(L\cdot \hbox {log}L)\), where L represents the length of trajectory. Meanwhile, the decoder is set as generative decoding, so that the prediction results can be obtained in one step when predicting the future trajectory, instead of generating the predicted trajectory in a stepbystep way, thus reducing the time complexity of prediction from O(N) to O(1). Figures 5 and 6 illustrates the comparison with the vanilla Transformer from the aspects of training speed and prediction accuracy using the same model parameters.
Training speed Figure 5 compares the training speed of the two models when the number of training epoch is set to 10, 30, and 80, respectively. Obviously, the training speed of the scenefusion Transformer has been greatly improved compared to the vanilla Transformer, increased by 73.3\(\%\), 75.2\(\%\), and 75.4\(\%\), respectively. Such an improvement reflects the advantages of the proposed architecture in terms of efficiency.
Prediction accuracy Figure 6 compares the ADE of the two models when setting the number of training epoch to 90. The two methods have roughly the same accuracy at the beginning. Then, around 30–70 epochs, the prediction accuracy of the vanilla Transformer starts to decline slowly and tends to be constant, with ADE remaining around 14. However, the accuracy of the scenefusion Transformer continues to decline, and the gap with the accuracy of vanilla Transformer gradually widened. Finally, during 70–90 epochs, the scenefusion Transformer tends to be stable, and the final prediction results is 10.03, with the prediction accuracy improved by 27.5\(\%\).
Evaluation of the fusion methods in the coarsegrained feature fusion
This section discusses different fusion methods used in the coarsegrained fusion stage. The observed trajectory \(T_{{\mathrm{obs}}}\) and corresponding scene image S are fed into the network through a fully connected layer and ESPNet, respectively. Afterward, we compare the performance of different fusion methods on the SDD dataset, including the concatenation, addition and crossattention. The fusion results are directly fed into the finegrained fusion module, ignoring the IRL to better reflect the performance of fusion methods. As presented in Table 3, compared with the simple addition, the performance of the fusion method through concatenation is slightly improved, while the prediction accuracy of the crossattention fusion method is the best.
Quantitative analysis
In this section, the quantitative analysis is performed by comparing MGSU with stateoftheart methods on the SDD and NuScenes. Methods used for comparisons are briefly introduced as follows:
SGAN SGAN [30] uses an encoder–decoder network to learn pedestrian movement patterns in an adversarial way. The social pooling operation is used to capture pedestrians’ social interactions.
SoPhie Sophie [50] combines social attention mechanisms with physical attention to help the model learn its position in a large scene and extract the most significant parts of the pathrelated image. It also GAN to generate more realistic samples and capture the uncertainty of future paths by modeling their distribution.
P2TIRL P2TIRL [49] proposes an attentionbased trajectory generator that generates future trajectories based on a sequence of states sampled from the MaxEnt strategy. It reformulates the MaxEnt IRL to allow policies to collectively infer reasonable proxy goals and paths to those goals on a rough 2D grid defined on the scenario.
SimAug SimAug [51] learns robust representations by augmenting simulated training data, allowing representations to better generalize to unseen realworld test data. Its key idea is to combine features from the most difficult camera views with adversarial features from the original views.
PECNet PECNet [52] presents a pedestrian endpoint conditioned trajectory prediction network that can predict rich and diverse multimodal socially compliant trajectories across a variety of scenes.
IRLSOT IRLSOT [25] proposes inverse reinforcement learning for sceneoriented trajectory prediction to better forecast pedestrians’ future trajectories under rare or complex environments.
Physics oracle Physics oracle [53] is an extended simple and explainable model of classical physics. The current velocity, acceleration and yaw rate of the trajectory are used for prediction.
CoverNet CoverNet [53] is a method for multimodal probabilistic trajectory prediction for urban driving. It frames the trajectory prediction problem as the classification of a set of distinct trajectories.
SGDNetED SGDNetED [54] proposes a recursive trajectory prediction network SGNet that evaluates and uses targets at multiple time scales.
MTP MTP [55] proposes a multimodal modeling method for vehicle motion prediction. It uses raster grid images to encode the context of each vehicle participant, and uses CNN model to generate several possible trajectories and the corresponding probabilities.
Trajectron++ Trajectron++ [2] presents a generative multiagent trajectory forecasting approach that addresses the desiderata for an open, generally applicable and extensible framework.
Multipath Multipath [56] utilizes a fixed set of futurestate sequence anchors that correspond to patterns of trajectory distribution. It predicts a discrete distribution over anchors, and for each anchor, the offset of the regression anchor, as well as the uncertainty, produces a Gaussian mixture at each time step.
Table 4 reports the comparison results between our method and SGAN, SoPhie, P2TIRL, SimAug, PECNet, and IRLSOT on the SDD. The comparisons verify the effectiveness of our method in pedestrian trajectory prediction by fusing scene information. SGAN only uses trajectory information among these methods, so the prediction performance is the lowest. Sophie and CFVAE take the scene information into account and used CNN to extract the features of the scene information, so their performance is superior to SGAN. In contrast to the above methods, P2TIRL employs the VGG network and reinforcement learning to further process scene information. Therefore, the prediction performance is improved. SimAug uses the multiview simulation data to enhance the representation of the prediction model, and by adding simulation training data to learn robust representation, the representation can be better generalized to invisible test data. PECNet assists longrange multimodal trajectory prediction by inferring distant trajectory endpoints. A novel social pooling layer is proposed to enable PECNet to consider social interactions, which improves PECNet’s trajectory prediction performance. IRLSOT exploits an IRL framework to explore the complex scenes and utilizes novel Scene Based Attention block to fuse scene and trajectory information. As a result, it achieves the suboptimal performance. Our method combines the semantic segmentation network, crossattention fusion, IRL, and a scenefusion Transformer to fully fuse the scene and trajectory from different granularity to realize the understanding of the scene context to achieve the best performance in terms of mean displacement error.
Table 5 compares our method and Physicsoracle, CoverNet, MTP, SGDNetED, Multipath and Trajectron++ on the NuScenes. Since Physicsoracle is a simple model based on classical physics, it is challenging to accurately capture the dynamic changes of the agent trajectory. CoverNet, MTP and SGDNetED take the scene information into account, resulting in a certain improvement in prediction accuracy. Trajectron++ can efficiently incorporate highdimensional data through the lens of encoding semantic maps and proposes a general approach to incorporate dynamic constraints into learningbased multiagent trajectory prediction methods. Hence the prediction accuracy has been further improved. Multipath can predict the parameter distribution of the agent trajectory in the real world by considering the scene information, which further improves the prediction performance. Our method uses the fusion of different granularity to fully understand the scenario and also achieves optimal performance on this dataset.
Qualitative analysis
Qualitative analysis is conducted on SDD and NuScenes to evaluate the trajectory prediction performance after fusing the scene information. Figure 7 demonstrates the predicted trajectories on the SDD. The first row shows the observed trajectories (denoted by white lines) and corresponding scenes. The second row shows the scene paths generated by IRL. The third row illustrates the predicted trajectories (the red and black lines denote predicted and groundtruth trajectories, respectively).As we know, pedestrians’ movements are often affected by the scene environment. MGSU can accurately infer the pedestrians’ moving directions and potential destinations after integrating scene information, thus generating feasible future trajectories consistent with path constraints. As shown in Fig. 7a, MGSU can precisely predict the future trajectory of the target in the case of a straight path. In the case of a curved road as shown in Fig. 7b, MGSU can generate scene paths with similar degree of curvature to the road, thus standardizing the predicted future trajectory. In Fig. 7c, MGSU successfully recognize the stationary pedestrian. Figure 7d, e show the case of fork roads. MGSU can generate multiple feasible paths according to the observed trajectories and corresponding scene images. Specifically, as shown in the middle subgraph of Fig. 7e, our model generates two feasible paths (leading to the upper left and upper right) at the intersection, resulting in a multimodal distribution of the predicted future trajectory.
Figure 8 demonstrates the trajectory prediction of vehicles on the NuScenes. The first row shows the input, including the vehicle trajectories and scene layouts. The second row shows the scene paths generated by IRL. The third row illustrates the predicted trajectories (the white, red, and black lines denote observed, predicted, and groundtruth trajectories, respectively). After fusing the observed trajectories and corresponding scene layouts, MGSU can generate reasonable paths and precisely infer the moving directions of the lanes. Therefore, the model can forecast future trajectories that are consistent with the scene layout constraints after integrating the scene information. Figure 8a, b shows the trajectory prediction performance on the straight roads which are commonly observed on the highway. MGSU achieves accurate prediction performance on these cases. In Fig. 8c, MGSU precisely identify the stationary vehicle. In the case of curves on the highway as shown in Fig. 8d, e, MGSU predicts scene paths that share similar motion trends with the groundtruth trajectories. Then, our model accurately predicts vehicles’ future trajectories in such challenging scenarios, benefiting from the understanding of the road environments. Moreover, it can be noted from Fig. 8e that multiple feasible moving trends can be inferred from the scene layout, which makes the generated results more realistic.
Conclusion
A multigranularity fusion architecture named MGSU is proposed to perform trajectory prediction based on the understanding of the scene. It consists of three modules: the finegranularity feature fusion module, the inverse reinforcement learning module, and the finegranularity feature fusion module. With these modules, the scene and trajectory information are gradually fused from coarse granularity to fine granularity. A novel scenefusion Transformer is presented to better integrate scene information and improve the efficiency. Its encoder explores the scene context and the trajectory information is encoded by a linear layer. A sparse crossattention mechanism is used to fuse the scene and trajectory information with high efficiency. The decoder predicts future trajectories in a generative manner to avoid the error accumulation. Quantitative and qualitative evaluations of MGSU are conducted on the public SDD and NuScenes datasets. Results show that the trajectory prediction performance of MGSU is improved after fusing the scene information, and it can better adapt to various complex environments.
We believe that MUSU can be used for realworld applications such as the service robotic or selfdriving cars. For example, it can be used to forecast pedestrians’ crossing intentions [57] and provide decision information for the intelligent car. Our future work focuses on integrating pedestrians’ trajectory with their actions to perform longterm action prediction.
References
Kothari P, Kreiss S, Alahi A (2021) Human trajectory forecasting in crowds: a deep learning perspective. IEEE Trans Intell Transp Syst 13:137–146. https://doi.org/10.48550/arXiv.1907.03395
Salzmann T, Ivanovic B, Chakravarty P, Pavone M (2020) Trajectron++: dynamicallyfeasible trajectory forecasting with heterogeneous data. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer vision—ECCV 2020. ECCV 2020. Lecture notes in computer science, vol 12363. Springer, Cham. https://doi.org/10.1007/9783030585235_40
Liu S, Wang L (2018) A selfadaptive pointofinterest recommendation algorithm based on a multiorder Markov model. Future Gener Comput Syst 89:506–514. https://doi.org/10.1016/j.future.2018.07.008
Yan M, Li SJ, Chan CA (2021) Mobility prediction using a weighted Markov model based on mobile user classification. Sensors 21(5):1740. https://doi.org/10.3390/s21051740
Barth A, Franke U (2008) Where will the oncoming vehicle be the next second? In: IEEE intelligent vehicles symposium, pp 1068–1073. https://doi.org/10.1109/IVS.2008.4621210
Qiao SJ, Han N, Zhu XW, Shu HP, Zheng JL, Yuan CA (2018) A dynamic trajectory prediction algorithm based on Kalman filter. Acta Electon Sin 46(2):418. https://doi.org/10.3969/j.issn.03722112.2018.02.022
Schneider N, Gavrila DM (2013) Pedestrian path prediction with recursive Bayesian filters: a comparative study. In: Weickert J, Hein M, Schiele B (eds) Pattern recognition. GCPR 2013. Lecture Notes in Computer Science, vol 8142. Springer, Berlin, Heidelberg, pp 174183. https://doi.org/10.1007/9783642406027_18
Mathew W, Raposo R, Martins B (2012) Predicting future locations with hidden Markov models. In: Proceedings of the 2012 ACM conference on ubiquitous computing, pp 911–918. https://doi.org/10.1145/2370216.2370421
Cai YF, Dai L, Wang H, Chen L, Li YC, Sotel MA, Li ZX (2021) Pedestrian motion trajectory prediction in intelligent driving from far shot firstperson perspective video. IEEE Trans Intell Transp Syst. https://doi.org/10.1109/TITS.2021.3052908
Yang B, Yan GC, Wang P, Chan CY, Song X, Chen Y (2021) A novel graphbased trajectory predictor with pseudooracle. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3084143
Lee N, Choi W, Vernaza P, Choy CB, Torr PHS, Chandraker M (2017) DESIRE: distant future prediction in dynamic scenes with interacting agents. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 2165–2174. https://doi.org/10.1109/CVPR.2017.233
Bartoli F, Lisanti G, Ballan L, Bimbo AD (2018) Contextaware trajectory prediction. In: 2018 24th international conference on pattern recognition (ICPR), pp 1941–1946. https://doi.org/10.1109/ICPR.2018.8545447
Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl Based Syst 6(2):107–116. https://doi.org/10.1142/S0218488598000094
Chen M, Zuo Y, Jia XY, Liu Y, Yu XH, Zheng K (2020) CEM: a convolutional embedding model for predicting next locations. IEEE Trans Intell Transp Syst 22(6):3349–3358. https://doi.org/10.1109/TITS.2020.2983647
Zamboni S, Kefato ZT, Girdzijauskas S, Noren C, Col LD (2022) Pedestrian trajectory prediction with convolutional neural networks. Pattern Recognit 121:108252. https://doi.org/10.1016/j.patcog.2021.108252
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Plolsukhin L (2017) Attention is all you need. Adv Neural Inf Process Syst. https://doi.org/10.48550/arXiv.1706.03762
Yao SW, Wan XJ (2020) Multimodal transformer for multimodal machine translation. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 4346–4350. https://doi.org/10.18653/v1/2020.aclmain.4002
Dong LH, Xu S, Xu B (2018) Speechtransformer: a norecurrence sequencetosequence model for speech recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5884–5888. https://doi.org/10.1109/ICASSP.2018.8462506
Zhao XY, Xiao F, Zhong HM, Yao J, Chen HH (2020) Condition aware and revise transformer for question answering. In: Proceedings of the web conference 2020, pp 2377–2387. https://doi.org/10.1145/3366423.3380301
Giuliari F, Hasan I, Cristani M, Galasso F (2021) Transformer networks for trajectory forecasting. In: 2020 25th international conference on pattern recognition (ICPR), pp 10335–10342. https://doi.org/10.1109/ICPR48806.2021.9412190
Yu CJ, Ma X, Ren JW, Zhao HY, Yi S (2020) Spatiotemporal graph transformer networks for pedestrian trajectory prediction. In: European conference on computer vision, pp 507–523. https://doi.org/10.1007/9783030586102_30
Cai YF, Wang ZH, Wang H, Chen L, Li YC, Sotel MA, Li ZX (2021) Environmentattention network for vehicle trajectory prediction. IEEE Trans Veh Technol 70(11):11216–11227. https://doi.org/10.1109/TVT.2021.3111227
Zhou HY, Zhang SH, Peng JQ, Zhang S, Li JX, Xiong H, Zhang WC (2021) Informer: beyond efficient transformer for long sequence timeseries forecasting. In: Proceedings of the AAAI conference on artificial intelligence, pp 11106–11115. https://doi.org/10.48550/arXiv.2012.07436
Mehta S, Rastegari M, Caspi A, Shapiro L, Hajishirzi H (2018) ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 552–568. https://doi.org/10.1007/9783030012496_34
He CZ, Chen LP, Xu LM, Yang CC, Liu XF, Yang B (2022) IRLSOT: inverse reinforcement learning for sceneoriented trajectory prediction. IET Intell Transp Syst. https://doi.org/10.1049/itr2.12172
Karasev V, Ayvaci A, Heisele B, Soatto S (2016) Intentaware longterm prediction of pedestrian motion. In: 2016 IEEE international conference on robotics and automation (ICRA), pp 2543–2549. https://doi.org/10.1109/ICRA.2016.7487409
Wang P, Yang J, Zhang J (2022) A spatialcontextual indoor trajectory prediction approach via hidden Markov models. Wirel Commun Mob Comput. https://doi.org/10.1155/2022/6719514
Malviya V, Kala R (2022) Trajectory prediction and tracking using a multibehaviour social particle filter. Appl Intell 52(7):7158–7200. https://doi.org/10.1007/s10489021022866
Alahi A, Goel K, Ramanathan V, Robicquet A, FeiFei L, Savarese S (2016) Social LSTM: human trajectory prediction in crowded spaces. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 961–971. https://doi.org/10.1109/CVPR.2016.110
Gupta A, Johnson J, FeiFei L, Savarese S, Alahi A (2018) Social GAN: socially acceptable trajectories with generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2255–2264. https://doi.org/10.1109/CVPR.2018.00240
Xu CX, Mao WB, Zhang WJ, Chen SH (2022) Remember intentions: retrospectivememorybased trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6488–6497. https://doi.org/10.48550/arXiv.2203.11474
Zhang W, Yao G, Yang B, Zheng WF, Liu C (2022) Motion prediction of beating heart using spatiotemporal LSTM. IEEE Signal Process Lett 29:787–791. https://doi.org/10.1109/LSP.2022.3154317
Liu RW, Liang M, Nie J, Lim WYB, Zhang Y, Guizani M (2022) Deep learningpowered vessel trajectory prediction for improving smart traffic services in maritime internet of things. IEEE Trans Netw Sci Eng. https://doi.org/10.1109/TNSE.2022.3140529
Visin F, Kastner K, Cho K, Matteucci M, Bengio Y (2015) ReNet: a recurrent neural network based alternative to convolutional networks. Comput Sci 25(7):2983–2996. https://doi.org/10.1109/TIP.2016.2548241
Bell S, Zitnick CL, Bala K, Girshick R (2016) Insideoutside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2874–2883. https://doi.org/10.1109/CVPR.2016.314
Liang XD, Shen XH, Feng JS, Lin L, Yan SC (2016) Semantic object parsing with graph LSTM. In: European conference on computer vision, pp 125–143. https://doi.org/10.1007/9783319464480_8
Ronneberger O, Fischer P, Brox T (2015) UNet: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computerassisted intervention, pp 234–241. https://doi.org/10.1007/9783319245744_28
Bai S, Gu WC, Kong LX (2022) Interweave features of deep convolutional neural networks for semantic segmentation. Eng Appl Artif Intell 109:104587. https://doi.org/10.1016/j.engappai.2021.104587
Gao P, Ma T, Li HS, Lin ZY, Dai JF, Qiao Y (2022) ConvMAE:masked convolution meets masked autoencoders. arXiv preprint, arXiv:2205.03892. https://doi.org/10.48550/arXiv.2
Wang PQ, Chen PF, Yuan Y, Ding L, Huang ZH, Hou XD, Cottrell G (2018) Understanding convolution for semantic segmentation. In: 2018 IEEE winter conference on applications of computer vision (WACV), pp 1451–1460. https://doi.org/10.1109/WACV.2018.00163
Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint, arXiv:1706.05587. https://doi.org/10.48550/arXiv.1706.05587
Zhao HH, Shi JP, Qi XJ, Wang XG, Jia JY (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Honolulu, Hawaii, pp 2881–2890. https://doi.org/10.48550/arXiv.1612.01105
Orhan S, Bastanlar Y (2022) Semantic segmentation of outdoor panoramic images. Signal Image Video Process 16(3):643–650. https://doi.org/10.1007/s11760021020033
Irwin R, Dimitriadis S, He JZ, Bjerrum EJ (2022) Chemformer: a pretrained transformer for computational chemistry. Mach Learn Sci Technol 3(1):015022. https://doi.org/10.1088/26322153/ac3ffb
Tian TL, Song C, Ting J, Huang HY (2022) A FrenchtoEnglish machine translation model using transformer network. Procedia Comput Sci 199:1438–1443. https://doi.org/10.1016/j.procs.2022.01.182
Yadav S, Gupta D, Abacha AB, DemnerFushman D (2022) Questionaware transformer models for consumer health question summarization. J Biomed Inform 128:104040. https://doi.org/10.1016/j.jbi.2022.104040
Achaji L, Barry T, Fouqueray T, Moreau J, Aioun F, Charpillet F (2022) PreTR: spatiotemporal nonautoregressive trajectory prediction transformer. arXiv preprint, arXiv:2203.09293. https://doi.org/10.48550/arXiv.2203.09293
Yao HY, Wan WG, Li X (2022) Endtoend pedestrian trajectory forecasting with transformer network. ISPRS Int J GeoInf 11(1):44. https://doi.org/10.3390/ijgi11010044
Deo N, Trivedi MM (2020) Trajectory forecasts in unknown environments conditioned on gridbased plans. arXiv preprint, arXiv:2001.00735. https://doi.org/10.48550/arXiv.2001.00735
Sadeghian A, Kosaraju V, Sadeghian A, Hirose N, Rezatofighi H, Savarese S (2019) SoPhie: an attentive GAN for predicting paths compliant to social and physical constraints. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1349–1358. https://doi.org/10.48550/arXiv.1806.01482
Liang JW, Jiang L, Hauptmann A (2020) SimAug: learning robust representations from simulation for trajectory prediction. In: European conference on computer vision, pp 275–292. https://doi.org/10.1007/9783030586010_17
Mangalam K, Girase H, Agarwal S, Lee KH, Adeli E, Malik J, Gaidon A (2020) It is not the journey but the destination: endpoint conditioned trajectory prediction. In: European conference on computer vision. Springer, Cham, pp 759–776. https://doi.org/10.1007/9783030585365_45
PhanMinh T, Grigore EC, Boulton FA, Beijbom O, Wolff EM (2020) CoverNet: multimodal behavior prediction using trajectory sets. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14063–14071. https://doi.org/10.1109/CVPR42600.2020.01408
Wang C, Wang Y, Xu M, Crandall DJ (2022) Stepwise goaldriven networks for trajectory prediction. IEEE Robot Autom Lett. https://doi.org/10.1109/LRA.2022.3145090
Cui HG, Radosavljevic V, Chou FC, Lin TH, Nguyen T, Huang TK, Schneider J, Djuric N (2019) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In: 2019 International conference on robotics and automation (ICRA), pp 2090–2096. https://doi.org/10.1109/ICRA.2019.8793868
Chai YN, Sapp B, Bansal M, Anguelov D (2019) Multipath:multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint, arXiv:1910.05449. https://doi.org/10.48550/arXiv.1910.05449
Yang B, Zhan WQ, Wang P, Chan CY, Cai YF, Wang N (2022) Crossing or not? Contextbased recognition of pedestrian crossing intention in the urban environment. IEEE Trans Intell Transp Syst 23(6):5338–5349. https://doi.org/10.1109/TITS.2021.3053031
Acknowledgements
This work is supported by Postdoctoral Foundation of Jiangsu Province no. 2021K187B; National Postdoctoral General Fund no. 2021M701042; Changzhou Science and Technology Program with Grant no. CJ20210052; General Project of Jiangsu Provincial Department of Science and Technology no. BK20221380.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yang, B., Yang, J., Ni, R. et al. Multigranularity scenarios understanding network for trajectory prediction. Complex Intell. Syst. 9, 851–864 (2023). https://doi.org/10.1007/s40747022008342
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747022008342
Keywords
 Trajectory prediction
 Transformer
 Multigranularity
 Scenario understanding
 Inverse reinforcement learning