Trajectory distributions: A new description of movement for trajectory prediction

Trajectory prediction is a fundamental and challenging task for numerous applications, such as autonomous driving and intelligent robots. Current works typically treat pedestrian trajectories as a series of 2D point coordinates. However, in real scenarios, the trajectory often exhibits randomness, and has its own probability distribution. Inspired by this observation and other movement characteristics of pedestrians, we propose a simple and intuitive movement description called a trajectory distribution, which maps the coordinates of the pedestrian trajectory to a 2D Gaussian distribution in space. Based on this novel description, we develop a new trajectory prediction method, which we call the social probability method. The method combines trajectory distributions and powerful convolutional recurrent neural networks. Both the input and output of our method are trajectory distributions, which provide the recurrent neural network with sufficient spatial and random information about moving pedestrians. Furthermore, the social probability method extracts spatio-temporal features directly from the new movement description to generate robust and accurate predictions. Experiments on public benchmark datasets show the effectiveness of the proposed method.


Introduction
A pedestrian's trajectory is multimodal, and closely depends on the person's hearing, vision, touch, thoughts, and personality, and is also affected by other factors such as the static environment, dynamic human-human interactions, and planned destinations. Nevertheless, pedestrians still can intuitively predict the future trajectories of others and adjust themselves in advance. For example, when people walk in shopping malls, streets, and stations, they quickly predict the trajectories of others so as to choose their own route at the next moment and avoid collisions. The purpose of trajectory prediction is to enable machines, such as robots, self-driving cars, and intelligent tracking systems, to have the ability to predict future trajectories based on historical trajectories. This is a fundamental but extremely challenging task.
In previous works, researchers mainly focused on the following problems in trajectory prediction: interaction between pedestrians [1][2][3][4][5][6], interaction between pedestrians and their environment [7][8][9], and multi-modality [10,11]. Recently, more and more effort has been made to predict multi-future trajectories [12][13][14][15][16], due to the uncertainty in predicted trajectories. In the real world, a trajectory appears as a probability distribution. When the historical trajectory is known and fixed, a person may have many different future trajectories according to dynamic influences. For example, if the same person walks twice from the same starting location to the same destination, the two trajectories usually differ somewhat. Although the methods above can predict multi-future trajectories, their inputs are still fixed points, and take a trajectory as a two-dimensional sequence of coordinates (x t , y t ). Obviously, each coordinate is fixed, and these separate points fail to represent randomness of the trajectory. Consequently, these methods cannot demonstrate uncertainty in the trajectory caused by inherent randomness.
Other earlier works [1-3, 17, 18] made great progress in modeling the impact of human-human interactions. However, great challenges still exist since most of this work achieves the purpose of modeling pedestrian interactions by combining hidden states. Because the input is a one-dimensional vector, these hidden states are also one-dimensional, so carry little spatial information. The lack of spatial information makes the problem of modeling interactions complicated and difficult.
In order to solve the above problems, we propose the concept of a trajectory distribution, which is an intuitive and effective motion description. The trajectory is no longer described by a series of fixed coordinates, but by a probability distribution (see Fig. 1). Specifically, we use a probability density function to map the pedestrian coordinates (x t , y t ) to two-dimensional Gaussian distributions G(x t , y t ). Unlike fixed point coordinates, the new description can represent randomness.
Moreover, we can conveniently map all pedestrian's trajectories at time t into a single two-dimensional space, with unique advantages in modeling human-human interactions.
Based on the proposed trajectory distribution, we further develop a new method called the social probability method for predicting robust and accurate pedestrian trajectories. Firstly, the inputs to our method are trajectory distributions, which enable our forecasting model to fully consider the randomness of the trajectory. Secondly, by adding convolution layers to a recurrent neural network, our forecasting model can learn spatio-temporal features efficiently. We extract location information for pedestrians from the two-dimensional probability space through the convolutional neural network; the two-dimensional probability space contains the locations of all pedestrians. By performing convolution operations on the space, we can extract all pedestrians' location information and easily capture changes in relative locations. These two factors are indispensable for modeling interaction.
In summary, the main contributions of this paper are as follows: 1. Trajectory distributions for representing pedestrian trajectories. They capture the inherent randomness of trajectories, facilitating subsequent modeling of their indeterminism. 2. The social probability method, based on trajectory distributions and recurrent neural networks. Convolution operations on trajectory distributions allow better modeling of humanhuman interactions. 3. Experimental verification of our ideas on ADE and FDE public pedestrian datasets, showing that our approach is competitive with state-of-the-art methods. The rest of the paper is organized as follows. In Section 2, we first review related work on trajectory prediction. Then we introduce the social probability method in detail in Section 3. In Section 4, we further present and analyse experimental results. Finally, we conclude and discuss future directions in Section 5.

Multi-outcome trajectory forecasting
In recent years, some researchers have tried to model randomness in trajectory prediction. Gupta et al. [10] solved the trajectory prediction problem using generative adversarial networks (GANs) [19] and considered the fact that pedestrian trajectories may have multiple plausible predictions. SoPhie [8] combined semantic scene segmentation with GANs to model trajectories. Multiverse [11] was a joint model to generate multiple plausible future trajectories, using multi-scale location encodings and convolutional RNNs over graphs. Simultaneously, Refs. [14,15,20] also proposed probability networks to incorporate randomness into vehicle trajectory prediction. However, these works all treat position information as two-dimensional point coordinates for input into the prediction model; doing so cannot completely describe the random behavior of pedestrians. Unlike these works, we take a trajectory distribution transformed from the moving trajectory as input, and generate multiple future trajectories.

Human-human interaction modeling in trajectory forecasting
In social interaction, researchers have utilized multiple methods of modeling interactions between pedestrians, such as social force [2], social pooling [1], and attention [3]. Methods [21,22] based on social force use the principle that attractive forces are used to guide people toward their destinations, and repulsive forces are used to avoid collisions, both human-human and human-obstacle. Most social force-based models try to learn the parameters of the social force functions from real-world crowd datasets. However, Alahi et al. [1] showed that attraction and repulsion alone cannot simulate complex crowd interactions. Other approaches [1,10,23,24] used a social pooling layer to allow LSTMs to share hidden states. This novel design can model human interaction efficiently, but complexity increases with crowd density. Thus, methods based on the attention mechanism have emerged [3,8]. Pedestrians can automatically perceive the importance of certain targets that affect their location in following time steps. Further methods [25][26][27] simultaneously learn spatial and temporal interaction patterns to capture spatio-temporal correlations efficiently and comprehensively, making their interactive models more suitable for real scenarios. RSBG (recursive social behavior graph) [18] established a group-based social interaction model to explore relationships that are not affected by spatial distance, and a graph convolutional neural network [28] has also been applied to trajectory prediction. In this paper, trajectory distributions are introduced, and the influence of spatial interactions is automatically perceived through convolution operations, which avoids the need to design complex interaction modules. Experimental results show that this method has better interaction performance.

Sequence prediction model
Sequence prediction uses past sequences to predict future sequences, so it is time series data modeling problem. Convolutional neural networks are very useful in the field of computer vision, but it is difficult to learn the characteristics of time series data using them. Recurrent neural networks are specially suitable for dealing with sequence-related data such as audio, video, and text. Recurrent neural networks and their derivatives LSTMs [29] and GRUs (gated recurrent units) [30] have proved their effectiveness in many fields, such as machine translation [31], text generation [32,33], speech recognition [34][35][36], and traffic flow prediction [37]. Some researchers have combined convolutional neural networks with recurrent neural networks for novel applications, such as image captioning [33,38,39] and video understanding [40,41]. In order to learn spatio-temporal features simultaneously, Shi et al. [42] added convolution layers to a recurrent neural network. Their ConvLSTM model not only learns temporal relationships, but also extracts spatial features using the convolution layer. We take advantage of ConvLSTM to obtain spatio-temporal features and directly model interaction between pedestrians.

Approach
In this section, we first present our new pedestrian motion description, the trajectory distribution, which solves the problem of modeling multiple trajectories from the data description level, and then we propose a prediction model based on trajectory distributions to describe human-human interactions conveniently.

Problem definition
Our goal is to predict the future trajectories of the pedestrians in a scene. The input is the historical location information for each pedestrian in the scene and the output is trajectory information for all pedestrians in the future. We define the historical trajectory distribution of the pedestrian as X =X 1 , . . . , X n . The predicted future trajectory where n is the number of pedestrians. The input trajectory of pedestrian i is defined as X i ∼ N(x i t , y i t ) for time steps t = 1, . . . , t obs and the future trajectory can be defined similarly as Y i ∼ N(x i t , y i t ) for time steps t = t obs+1 , t obs+2 , . . . , t pred , where N represents a Gaussian distribution. The prediction is denoted Y i and the ground truth is denoted Y i .

Mathematical definition
Supposing the feasible area for the pedestrians is Ω, we represent the location of a single pedestrian at time t as a probability distribution on Ω. We use a two-dimensional Gaussian distribution, which can well characterize the location of a trajectory. The location distribution at time t has the highest probability density at the center position (x t , y t ). It means that the location does not have to be at this fixed position, but also has a probability of being located in some other area; the further away from the central location, the smaller the probability density becomes. We suppose that (x t , y t ) follows a twodimensional Gaussian distribution with parameters where μ 1 and μ 2 are the mean values of (x t , y t ) respectively, σ 1 and σ 2 are the variances of (x t , y t ), and ρ is the correlation coefficient of x t and y t . μ 1 is set to x t and μ 2 is set to y t . σ 1 and σ 2 are set to 0.3 according to experience, and ρ is 0. Using this data structure to represent trajectories, we may successfully retain the randomness of trajectories. In two-dimensional space, a pedestrian trajectory is no longer a single point at time t, but a probability distribution, as shown in Fig. 2(a).

Integrating neighbor information
At time t, we denote the trajectory distribution of pedestrian i by p i t . However, the scene at time t contains multiple pedestrians. Neighboring pedestrians have great influence on the movement decisions of each subject pedestrian. In order to enable the model to predict future trajectories based on the locations of surrounding pedestrians, we need to integrate the trajectory distributions of all pedestrians at time t into the same two-dimensional probability space. The trajectory distribution at time t is denoted p t . In two-dimensional space, we integrate p i t into p t using the max(·) function. Specifically, for the corresponding position in the trajectory distribution, we take the larger value as the consolidated value, as follows.
where n is the number of pedestrians at time t. In order to distinguish the current predicted pedestrian from the other surrounding pedestrians, we set their σ values to 0.1 and 0.3 respectively. A comparison using different σ is shown in Fig. 2(b).

Convolutional LSTM
Due to its unique structure, the long and short-term memory network (LSTM) has great advantages in processing time sequence data. Moreover, Shi et al. [42] proposed a variant of LSTM, which added a convolutional layer to the LSTM module, calling ConvLSTM, and demonstrated that this model can learn spatio-temporal information through experiments. Specifically, the main operations are as follows: where X t is the input at time t, and h t and c t are hidden state and cell state, respectively. i t , f t , o t are the gates of the ConvLSTM. They are all threedimensional tensors whose last two dimensions are spatial dimensions (width, height). W is a weight matrix. "•" denotes the Hadamard product. At time t, X t provides input to the module for calculation only when the input gate is activated. Similarly, the past cell state c t−1 is forgotten when the forget gate f t is activated and the current cell state c t is transfered when the output gate o t is open. ConvLSTM uses the current input and past states to determine future states; the current input includes not only temporal features, but also spatial features. The temporal features can be learned through the gate structure mentioned above, and the spatial features can be extracted through the convolutional layer embedded in the module. Essentially, trajectory prediction can be regarded as a spatio-temporal sequence generation problem. Therefore, applying ConvLSTM to solve it, we can model the temporal characteristics of the trajectory while also considering spatial interactions between different trajectories.

Social probability
As we show in Fig. 3, the social probability method is a trajectory prediction method based on trajectory distributions. Firstly, we map the position information of all pedestrians at time t into trajectory distributions. Then, the ConvLSTM module takes two-dimensional trajectory distributions as input and outputs predictive trajectory distributions at future time t + 1. The coordinates of trajectory points can be obtained by sampling the outputs.

Probability-based prediction
The input to the ConvLSTM needs to be twodimensional tensors. As discussed in Section 3.2, our trajectory distribution is a probability distribution in two-dimensional space. Therefore, it is suitable for input to the ConvLSTM model. Moreover, trajectory distributions are essentially probability density distributions. The value of the trajectory distribution indicates the level of probability density. Modeling trajectory distributions directly makes our method a probability-based forecasting method. Our method not only predicts multiple future trajectories, but the input historical trajectory is also multimodal, unlike previous methods. The problem of modeling multimodal features is solved at the data level.

Human-human interaction modeling
The input to our model comprises the trajectory distributions of all pedestrians at time t, integrated into one two-dimensional space, so modeling humanhuman interactions is direct and expedient. As illustrated in Fig. 4, after trajectory distributions are input into the model, the convolutional layer extracts features in the two-dimensional trajectory distribution to obtain the hidden state, which is the feature vector in the RNN-based model. Since the convolution kernel slides across the entire two-dimensional space like a sliding window, hidden states contain location information for each pedestrian. Thus, due to the convolution operation, the model not only considers the density value at the current position, but also the density value at surrounding positions when predicting the probability density value in the future. Therefore, our model considers the locations of all pedestrians at time t, considering human-human interactions without complex interaction modules.

Loss function
We empirically choose the loss function to train our model by following previous works [43,44]. Since Fig. 3 Overview of the social probability method. We use a separate ConvLSTM network for each trajectory in the scene. Inputs and outputs of the model are both trajectory distributions. The trajectory distribution is mapped from fixed point coordinates; a single trajectory distribution at time t contains all pedestrians' trajectory information. The ConvLSTM network consists of a convolutional layer and gate modules, and has the ability to learn spatio-temporal features. In the prediction stage, the output of our model is again in the form of trajectory distributions, and trajectory coordinates can be obtained by sampling them. our model focuses on the specific probability density value, rather than some high-dimensional features, such as style, graphics, or objects, we use an L2 loss function to encourage our model to generate accurate probability density distributions: whereŶ i t and Y i t are the predicted and ground truth trajectory distributions for person i at time t respectively.

Experiments
In this section, we present experimental results using five public datasets, and compare our method to state-of-the-art methods, as well as analyzing the performance of our method.

Datasets
We validated the proposed model on the public ETH [22] and UCY [45] datasets, which are the widely used benchmarks in the field of trajectory prediction. Most state-of-the-art methods have been evaluated on these datasets. They contain a total of 1536 labeled pedestrians in 4 different scenes. These datasets are based on binocular vision for the research of pedestrian trajectory tracking and prediction. There are altogether 5 sub datasets: ETH contains ETH and HOTEL subsets, while UCY has three subsets: ZARA1, ZARA2, and UNIV. Following previous works, we observe the historical trajectory for the past 8 time steps (3.2 s) and predict the future trajectory for the next 12 time steps (4.8 s).

Evaluation metrics and methods
Following previous works [1], we use two evaluation metrics: • Average displacement error (ADE): The average Euclidean distance between the predicted trajectories and the true trajectories at each prediction time step: The Euclidean distance between the predicted destination and the ground truth destination at the last prediction time step: and (x i t , y i t ) are the predicted and ground truth coordinates for pedestrian i at time t respectively, and Z is the total number of pedestrians in the test set.
We use a leave-one-out approach to evaluate the performance of the model. Four sets are used as the training set and verification set, and the remaining one is used as the test set.

Implementation details
Five layers are used in the ConvLSTM model and the hidden state channel size in each layer is 128, 64, 64, 32, 32, respectively. The kernel size of the convolutional layer is 3 × 3 and the padding is 1. We train our model using Adam [46] with an initial learning rate of 0.001. The sizes of the trajectory distribution and the hidden state of our model are both 100 × 100. In the prediction stage, the variance of the current pedestrian to be predicted is set to 0.1, and the other pedestrians are set to 0.3. In the testing stage, we sample 20 times from the trajectory distribution predicted by the model, and select the best prediction in terms of Euclidean distance for quantitative estimation.

Comparison with other methods
As shown in Table 1, we choose the following methods for comparison: 1. Linear: A linear regression model which predicts the trajectory by minimizing the least square error. 2. Plain-LSTM: Use the LSTM model to predict the future trajectory. This method only considers its own historical trajectory and does not consider any other factors. 3. Social-LSTM [1]: A social-pooling layer is added to the LSTM, so that the model has the ability to model human-human interactions. 4. Social-GAN [10]: A trajectory prediction model trained with GAN architecture is designed to improve existing models in terms of rationality, diversity, and prediction speed. The model pays attention to the feasibility of generating trajectory predictions using social rules. 5. Social-GAN-P [10]: As Social-GAN, but without the pooling mechanism.
6. SoPhie [8]: An interpretable framework based on GAN for trajectory prediction. It uses two information sources, the historical trajectories of all pedestrians in a scene and the scene contextual information from the scene image. 7. RSBG [18]: A group-based social interaction model to explore pedestrian relationships that are not affected by spatial distance. A graph convolutional neural network is applied to trajectory prediction in this model. 8. NEXT [7]: An end-to-end multi-task learning system that uses pedestrian behavior information and its surrounding scene environment to predict trajectory. This method uses behavior information for the first time to improve the accuracy of trajectory prediction. Table 1 presents average displacement error and final displacement error for our method and existing methods, given the task of predicting 12 future time steps from 8 historical time steps. We follow comparative works in choosing the best prediction among multiple samples for quantitative analysis. It can be seen that the linear model usually performs worst, because it is only suitable for predicting straight trajectories, and is insensitive to pedestrian interaction. Social-LSTM and Social-GAN perform better than the linear method since they can handle interactions between pedestrians via the corresponding interaction module. We can see that our method outperforms all others in terms of FDE for the ETH and UNIV datasets, avoiding more potential future collisions. Although the performance of our method is not the best on other datasets, it is still very competitive and significantly superior to These examples have large gaps in predictions, or go in the wrong direction. By analyzing the source videos, we found that such cases were generated when pedestrians stopped walking or turned suddenly, the main reason being the unpredictability of pedestrian intent. Another reason is that when pedestrians interact with their physical surroundings, the model does not handle scene information. Integrating information about the scene is a direction for our future research.

Attention mechanism
In our experiments, we tried using a spatial attention mechanism [47] to improve the prediction accuracy of our model. In the two-dimensional trajectory distribution space, an attention module was applied to capture which locations have more influence. However, we found that the attention mechanism did not improve our results as expected-see Table 2. The reason may be that the trajectory distribution has already played a role in providing attention. The probability density of each spatial position represents the importance of the location, namely the weight value in the attention mechanism.

Integration of trajectory distributions
As explained in Section 3.2, we integrate the trajectory distributions of all pedestrians at time t into a single two-dimensional probability space.
Here, we omit the other trajectory distributions to verify the ability of our model to capture interaction. Thus, when predicting the trajectory of person i, the trajectory distribution only contains that person's own trajectory information, and trajectory information of people around is omitted. We conducted experiments using the ETH dataset-see Table 3. The method with integration has clearly better ADE and FDE, showing that integrating trajectory distributions provides the ability to model human-human interaction.

Size of trajectory distribution
The trajectory distribution is two-dimensional, so a suitable size must be determined. We set up a comparison, using sizes of 80×80, 100×100, 120×120, 150 × 150, 170 × 170, and 200 × 200. Results are shown in Fig. 6. The predicted result is best when the size is 100 × 100. Too large or too small a value decreases prediction accuracy. We sampled from the ground truth and found that as the size increases, the sampling error also increases. Sampling error may be the reason for the decrease in prediction accuracy, while as the size decreases, the trajectory distribution is incapable of modeling large enough amounts of data.

Conclusions
In this paper, we have proposed the concept of trajectory distributions, with advantages in representing the randomness of trajectories, and explored a new trajectory prediction method based on it. To encode social interaction features, we introduced ConvLSTM, a sequence to sequence prediction model with the ability to model spatiotemporal information. Experiments on public datasets show the effectiveness of our method. Although it is not best on all datasets, our method is simple and has great potential. Our current work does not incorporate the physical environment, but it is obvious that adding such information to our model is straightforward and convenient, and is the direction of our future work.