Introduction

Modern society suffers from high-density flows of people and vehicles. Therefore, forecasting future trajectories of moving targets presents more critical effects in various applications. For example, in the field of automatic driving [1,2,3,4], the assistant system of an established trajectory prediction program can help drivers control vehicles [5,6,7], predict pedestrians’ walking intentions [8] on crowded roads, and reduce traffic accidents caused by driver’s fatigue or inattention. Meanwhile, studying the crowd walking path and behavior also presents significance to intelligent social robots [9], smart city construction [10], and cultural and entertainment industries. For example, restaurant service robots [11] need to predict the trajectory of guests to optimize their service paths. Intelligent tracking and monitoring system [12] in the city also needs to understand the interaction between pedestrians to prevent the occurrence of danger and ensure social stability. Therefore, predicting pedestrian trajectories has become the top priority of modern society.

Fig. 1
figure 1

The illustration of training a trajectory prediction model with the trajectory data in multiple scenes using a federated training paradigm. The three scenes in the figure represent different real-life scenes, thus containing diverse trajectory patterns. The federated server cannot directly access the data in federated clients, thus protecting data privacy

The high performance of pedestrian trajectory prediction models mainly relies on rich trajectory data from different scenes. However, current research mainly adopts a centralized manner by artificially aggregating data from multiple scenes for centralized training. Such a manner does not match the reality that data are primarily scattered in various surveillance devices in cities that cannot be connected and shared. Moreover, the most severe issue is the possible leakage of user privacy, which is not conducive to protecting data security. With the recent awakening of the self-protection consciousness of ordinary people, the security of user personal data privacy [13] has aroused great concern in all social strata. According to the General Data Protection Ordinance [14], data is the user’s private property, and no organization has the right to use it without prior consent. Therefore, the research challenge is combining the data of multiple scenes and federally training the trajectory prediction model in the server without using traditional methods and violating users’ privacy.

Federated learning [13, 15, 16] can provide an intelligent solution for breaking data island limitations and securing privacy for collaborative data training from multiple scenes. It has excellent potential for federated training of multi-scene data [17]. Federated learning does not need to gather data from different scenes, which may damage data privacy [18, 19]. On the contrary, it maintains a shared global model in a federated server, which exchanges information with federated clients in which local models are trained with data in different scenes. Federated learning has been applied in many fields [13], especially in medicine [20] for protecting patients’ privacy. However, studies that combine pedestrian trajectory prediction with federated learning remain lacking. Therefore, we introduce a pedestrian trajectory prediction model with data collected from different scenes based on a federated learning paradigm (as shown in Fig. 1).

In summary, the main contributions of this work are as follows:

  1. (1)

    To forecast trajectories in a local scene, we propose a lightweight destination-oriented LSTM-based trajectory prediction (DO-TP) network in an encoder-decoder manner. The learned destination information can guide the model to generate credible future trajectories with limited trajectory data.

  2. (2)

    To improve the trajectory prediction performance, we gather trajectory data from different scenes but consider data privacy protection. The federated learning strategy is introduced to solve the data island problem by reserving data for cooperative training on federated clients.

  3. (3)

    To find the most suitable federated learning paradigm in trajectory prediction, we conduct quantitative and qualitative evaluations on reintegrated ETH, UCY, and SDD data sets to fairly compare different federated learning algorithms.

The rest of this work is arranged as follows. The following section reviews the related work. The proposed method is introduced in detail in “Methods”. The evaluation results are presented in “Experiments”. Conclusions and discussions are presented in “Conclusion and future works”.

Related work

Federated learning

Federated learning is a recently proposed machine learning framework that is very popular in protecting personal privacy and solving the data island problem [21]. FedAvg [15] was proposed in 2016 to learn how to share a global model by training local data jointly across mobile devices. The privacy of the data was significantly protected. The survey report by Yang et al. [13] showed a bright future for integrating federated learning into various fields. Tan et al. [22] introduced federated learning into online recommendation systems. It solved the data island problem in the recommendation system and protected user privacy. Bai et al. [23] proposed a FedFace framework to apply federated learning to face recognition. It trained multi-party data from the perspective of circumventing user privacy leakage.

In recent years, federated learning has shown a clear trend in combination with reinforcement learning, knowledge distillation, and contrastive learning. Nadiger et al. [24] combined reinforcement learning with federated technology. Users with similar tasks could learn from each other, which in turn reduces the personalization application time. Qi et al. [25] summarized the existing research results on federated reinforcement learning and the related classification. The future directions were also discussed. Chen et al. [26] proposed a concept of federated learning based on cyclic knowledge distillation. By removing the role of the central server, public knowledge within each federation was cyclically accumulated, and each federation was then trained to personalize it. This led to MetaFed, a highly credible and personalized federated framework. Li et al. [27] proposed model-contrastive federated learning (MOON). By performing contrastive learning at the model level, the data heterogeneity problem was solved, and the performance of the federated model on image datasets was improved.

At present, federated learning needs to be studied more in the trajectory prediction community. However, the research on pedestrian trajectory prediction urgently needs to break the current situation that data is closed and cannot be effectively shared to better fall on the field. Therefore, we propose a pedestrian trajectory prediction model based on federated learning, which can efficiently use trajectory data from multiple scenes and protect data privacy, which is helpful for trajectory prediction research. Moreover, the performance of the three federated algorithms in trajectory prediction is explored, and the most suitable one is selected.

Trajectory prediction

The core of trajectory prediction is to predict the future trajectory from the historical trajectory of pedestrians [28]. Early studies relied on kinematic, statistical [29], and probability models [30]. However, these methods often predicted trajectories that did not match the actual trajectories due to their weak fitting ability.

With the development of deep learning [31], data-driven trajectory prediction has become a significant research direction. In 2016, Li et al. proposed Social-LSTM [32], which learned pedestrians’ motion patterns with LSTM and captured social interactions with a social pooling layer. Later, SGAN [33] introduced generative adversarial networks to generate socially acceptable and diverse trajectories. Kosaraju et al. [34] proposed a graph-based generative adversarial network (Social-BiGAT) to improve the multi-modality of future trajectories with a latent scene encoder. A graph attention module was introduced to model all pedestrian interactions within a scene. A multi-modal end-to-end trajectory prediction network (Goal-GAN) [35] was proposed to utilize both goal estimation and route navigation modules. The final goal location of the pedestrian was predicted based on the observed path and environmental information to generate a feasible trajectory to reach the goal.

Fig. 2
figure 2

The overall framework of Fed-TP, which does not share trajectory data between different federated clients. Fed-TP trains different local trajectory prediction models in different federated clients based on the global model sent by the federated server. Afterward, all model parameters are aggregated in the federated server to train a new global model. Then, the federated server will broadcast the new global model to the federated clients participating in each round. Specifically, Y denotes the historical trajectory in the time period of \(t=1\) \(\sim \) \(T_\textrm{obs}\). \({\hat{Y}}\) represents the future trajectory in the time period of \(t=T_{\textrm{obs}+1}\) \(\sim \) \(T_\textrm{pred}\). In the right rectangle, solid and dashed lines represent pedestrians’ historical and future trajectories, respectively

Pedestrian trajectory prediction suffers from future uncertainties from pedestrians’ internal and external factors, such as potential destination, surrounding pedestrian interference, and scene constraints. To better handle the uncertainty of future pedestrian movements and improve the trajectory prediction performance, Yang et al. [36] proposed a POP (pseudo oracle predictor) module to generate an informative potential variable by learning the future behavior of pedestrians. It could be better used in the testing phase and facilitates the broad application of trajectory prediction. A scene-oriented inverse reinforcement learning method [37] was proposed for trajectory prediction by understanding the strong correlation between scene and trajectory, which solved the problem that existing trajectory methods were prone to over-fitting. A group often influences a person’s walking route in realistic crowded scenes. Therefore, Bae et al. [38] proposed GP-Graph, which first assigned pedestrians to the group with the highest similarity and later built graphs based on the interpersonal interactions within and between groups.

The above research ignored the privacy of data and the reality that data is scattered and difficult to collect. We introduce federated learning to make up for the shortcomings of existing work and explore the feasibility of federated learning combined with pedestrian trajectory prediction. In terms of trajectory prediction, a lightweight DO-TP is proposed to perform trajectory prediction in each scene. Unlike former works, we resort to trajectory data only to estimate the potential destination, which avoids the complexity increase caused by semantic segmentation.

The research work mentioned above and our proposed new methods are based on data-driven models. With the vigorous development of artificial intelligence, we also note that recent research uses mathematical structures [39] to constrain machine learning algorithms and guide the generation of optimal strategies. Borodin et al. [40] established a mathematical model on the indicators that affect the financial situation, studied the profitability of enterprises, and then analyzed the development prospects of enterprises. Tutsoy et al. [41] proposed a multi-dimensional decision algorithm based on artificial intelligence, which used a mathematical model to derive constraints. Experimental results show that it can generate specific optimal strategies according to the importance of each sub-model. Bouchnita et al. [42] proposed a new method of the mathematical model combined with deep learning to rapidly predict patients’ specific response to anticoagulant therapy, which was conducive to clinical decision-making and effective management of coagulopathy.

Methods

It is not conducive to improving trajectory prediction performance if the scene data cannot be effectively shared. However, a high risk of privacy leakage existed if manually concentrating trajectory data in different scenes. Therefore, we propose a federated learning-based trajectory prediction (Fed-TP) method to address these two significant limitations. Fed-TP consists of the following two parts: (1) A lightweight DO-TP that forecasts pedestrians’ future trajectories based on their historical trajectories with an LSTM-based encoder-decoder. Potential destinations are learned from observation ground-truth and prediction positions without any scene information. (2) A federated learning framework that trains the trajectory prediction model with data from multiple scenes in a privacy protection manner. As presented in Fig. 2, the trajectory data of each scene is retained in the local scene. Afterward, the model is trained in each federated client, avoiding data privacy leakage during data transmission. Then all parameters are aggregated in the federated server to train a global model. Afterward, different federated learning algorithms are evaluated to select the most suitable training paradigm.

Fig. 3
figure 3

The pipeline of DO-TP. Pos and \({\widetilde{P}}{os}\) represent observed and future predicted coordinates, respectively. Z represents the motion feature vector

Destination-oriented trajectory prediction

This section presents the definition of trajectory prediction. Assuming pedestrians \(P_{1}, P_{2},\ldots , P_{n}\) exist in the scene, we first term the position \(\left\{ x_{i}^t, y_{i}^t \right\} \) of pedestrian \(P_{i}\) at time step t as \(P_{i}^t\). The purpose of trajectory prediction is to forecast the future trajectory \( \hat{Y_{i}} =\left\{ P_{i}^{T_{\textrm{obs}+1}},\ldots , P_{i}^{T_\textrm{pred}}\right\} \), considering the historical trajectory \(Y_{i}=\left\{ P_{i}^1,\ldots , P_{i}^{T_\textrm{obs}}\right\} \). \(T_\textrm{obs}\) and \(T_\textrm{pred}\) are the lengths of observation and prediction, respectively.

A lightweight DO-TP is proposed considering the computational burdens in intelligent edge devices. DO-TP takes historical trajectories of all pedestrians in the scenes as input and outputs their predicted trajectories. To achieve trajectory prediction, the model comprises the LSTM-based encoding and decoding modules, which encode pedestrian motion patterns from their historical trajectories and decode the predicted trajectories from the learned motion patterns. Moreover, the model contains two LSTMs and two fully connected (FC) layers to predict destinations. Similar to [36], pedestrians’ relative displacements are fed into the encoding module to obtain the hidden states that represent pedestrians’ motion patterns from their observed trajectories, as follows:

$$\begin{aligned} Z_i^t=\theta \left( \left\{ \Delta x_i^t, \Delta y_i^t \right\} ; W_v \right) \end{aligned}$$
(1)
$$\begin{aligned} H_i^t = F_\textrm{enc} (H_i^{t-1}, Z_i^t; W_e ) \end{aligned}$$
(2)

where a linear transformation layer \(\theta (\cdot )\) with learnable parameter \(W_v\) is used to map the input displacement \(\left\{ \Delta x_i^t,\Delta y_i^t\right\} \) into the 64-dimensional motion feature vector \(Z_i^t\). \(F_\textrm{enc}\) denotes the LSTM-based encoding module with learnable parameter \(W_e\). After Xavier initialization, \(W_v\) and \(W_e\) are gradually learned while updating the network through error back-propagation until given epochs. \(H_i^{t-1}\) and \(H_i^t\) represent the hidden states of \(F_\textrm{enc}(\cdot )\) at time steps \(t-1\) and t, respectively. The relative displacement \(\left\{ \Delta x_i^t, \Delta y_i^t\right\} \) is defined as:

$$\begin{aligned} \Delta x_i^t = \left( x_i^t - x_i^{t-1} \right) \end{aligned}$$
(3)
$$\begin{aligned} \Delta y_i^t = \left( y_i^t-y_i^{t-1} \right) \end{aligned}$$
(4)

where \((x_i^t, y_i^t)\) denotes the two-dimensional spatial coordinate of pedestrian \(P_i\) at time step t.

Enlightened by GTPPO [36], we propose a destination prediction strategy without scene semantic segmentation to maintain lightweight. Specifically, we use two LSTMs to extract sequential position information from observed and future predicted spatial coordinates. Afterward, two fully connected (FC) layers are introduced to map the position information into 32-dimensional destination-aware latent vectors \(D_i\) and \(\hat{D_i}\). Considering the observed ground-truth spatial coordinates contain the information about potential destinations, we minimize the KL divergence between \(D_i\) and \(\hat{D_i}\) during training. Subsequently, the latent vector \(D_i\) learned from observed coordinates can provide destination information that can guide the decoding module to generate more precise future trajectories.

\(F_\textrm{dec}(\cdot )\) denotes the LSTM-based decoding module with learnable parameter \(W_d\), which generates future trajectories based on the encoded motion feature vector, the hidden state, and the destination-aware latent vector, as follows:

$$\begin{aligned} Q_i^{T_{\textrm{obs}+1}} = F_\textrm{dec} \left( H_i^{T_\textrm{obs}}, Z_i^{T_\textrm{obs}} \Vert D_i; W_d \right) \end{aligned}$$
(5)
$$\begin{aligned} \left\{ \Delta x_i^{T_{\textrm{obs}+1}}, \Delta y_i^{T_{\textrm{obs}+1}} \right\} = \delta \left( Q_i^{T_{\textrm{obs}+1}}, W_c \right) \end{aligned}$$
(6)

where \(\Vert \) denotes the concatenation operation. \(\delta (\cdot )\) is a linear transformation layer ( \(W_c\) is its learnable parameter) that converts the hidden state of the decoder \(Q_i^{T_{\textrm{obs}+1}}\) into the predicted relative displacement \(\left\{ \Delta x_i^{T_{\textrm{obs}+1}}, \Delta y_i^{T_{\textrm{obs}+1}} \right\} \), which is further used to forecast pedestrian \(P_i\)’s future trajectory \(\hat{Y_i}\) through the inverse operation of Eqs. (3) and (4). Figure 3 shows the DO-TP. Afterward, the model in a local scene changes from \(\omega _g\) to \(\omega _{g+1}^k\) (k represents the serial number of client scenes involved).

Federated learning framework

To overcome the drawbacks of privacy leakage in data aggregation-based trajectory prediction, we introduce the federated learning framework and propose Fed-TP. Given each scene in \((S_1, S_2,\ldots , S_m)\) has its own trajectory data where m denotes the scene number, Fed-TP is trained with the data in each scene without aggregating private data of all scenes. In each training round of federated learning, K scenes are randomly selected from the total m scenes. The training is divided into local and global steps, as follows:

  1. (1)

    Local training: each federated client conducts local training for model \(\omega _g\) in rounds E to update the model parameter under the current data. As shown in Eq. (7), after E rounds of training, each client updates the global model \(\omega _g\) to the local model \(\omega _{g+1}^k\) of the client. Model parameters of all federated clients are transmitted to the federated server for the joint update using encryption and privacy protection technology. The local training loss is denoted as \(L_k\), which is defined as Eq. (8) below:

    $$\begin{aligned}{} & {} \forall k \ \ \omega _{g+1}^k \leftarrow \omega _g - \eta * \gamma _k \ \ \end{aligned}$$
    (7)
    $$\begin{aligned}{} & {} L_k = \textrm{min} \left\| Y_i - {\hat{Y}}_i \right\| _2 \ \ + \beta * \textrm{KL}(D_i, {\hat{D}}_i) \end{aligned}$$
    (8)

    where \(\gamma _k\) represents the model parameters of the k-th scene selected. The hyper-parameter \(\beta \) denotes the trade-off between the trajectory loss and KL divergence. Considering the KL divergence in Eq. (8) is used as an additional constraint to the trajectory loss, we empirically set \(\beta \) to a number less than 1. Afterward, we calculate the ADE/FDE values of the proposed method in the used dataset while increasing \(\beta \) from 0.1 to 0.9 with a step of 0.05, and the best performance is achieved while setting \(\beta \) to 0.1.

  2. (2)

    Global training: the federated server-side jointly trains each local dataset. However, the client and the server do not involve data transmission, thus significantly protecting data privacy while enhancing data diversity. The federated server first sends the initial global model \(\omega _{g}\) to each client. For neatness, the new round of global model \(\omega _{g+1}\) is obtained by aggregating and updating each client model according to the federated client weight parameter \(u_k\) and the federated client model parameter \(\gamma _k\). The federated server then encrypts the global model \(\omega _{g+1}\) and sends it back to each federated client, overwriting the original model, as shown in Eq. (9). SGD is used to update the model until the loss defined in Eq. (10) converges. The calculations are as follows:

    $$\begin{aligned}{} & {} \omega _{g+1} \leftarrow \omega _g - \eta \sum _{k=1}^{K} u_k * \gamma _k \ \ \end{aligned}$$
    (9)
    $$\begin{aligned}{} & {} L_s = \frac{1}{K} \sum _{k=1}^K L_k \ \ \end{aligned}$$
    (10)

    where \(L_s\) denotes the global training loss obtained by the average summation of \(L_k\) on each federated client. In Eq. (9), the federated server model is aggregated by the federated client models. There is no data transmission between them, reducing the risk of data leakage, thus protecting data privacy. The pseudo-code of the proposed Fed-TP is presented in Algorithm 1.

Different federated learning frameworks

A suitable federated learning framework is critical for Fed-TP to achieve satisfactory trajectory prediction and privacy protection performance. Therefore, we compare commonly used federated learning frameworks, including FedAvg, FedProx, and FedAtt. Details are introduced as follows:

  1. (1)

    FedAvg [15] consists of four steps: ① The federated server sends a global model to each participating client. ② All participating clients use local data to perform stochastic gradient descent to train local models. ③ Each participating client sends its trained model parameters to the federated server. ④ The federated server averages the aggregated model parameters to generate a global model for the next round of training.

  2. (2)

    FedProx [43] adds a proximal term to the local objective function of each federated client based on FedAvg to control the deviation of the client-updated model from the global model. Subsequently, the stability of FedProx is improved compared with FedAvg.

  3. (3)

    FedAtt [44] highlights the respective importance of each participating client during model aggregation by introducing an attention mechanism. Afterward, FedAtt minimizes the weighted distance between the federated server and the federated client by iteratively updating the parameters to achieve good generalization.

Algorithm 1
figure a

m is the total number of clients; K is the client participating in the training; B is the batch size; The number of client training rounds is E; G denotes the federated server training epochs; \(\eta \) is the learning rate; \(u_k\) is the federated client weight parameter.

Experiments

Datasets

The proposed method is evaluated on three public datasets, including ETH [45], UCY [46], and Stanford Drone Dataset (SDD) [47], which are widely used for pedestrian trajectory prediction [32, 33, 35, 36, 48]. These datasets are taken from diverse scenes, including the hotel, university, zara, and different places at Stanford. The trajectory data is captured with cameras from different angles, and the layout of the scene and the number of pedestrians vary greatly. Therefore, the trajectory data is reliable and rich. Concretely, the ETH and UCY datasets contain 1536 pedestrians’ walking interactions and other social activities. The ETH dataset contains two scenes: eth and hotel. The UCY dataset contains three scenes: univ, zara1, and zara2. The SDD dataset collects the trajectory data of pedestrians and vehicles with a drone. SDD contains 8 scenes: gates, little, nexus, coupa, bookstore, deathCircle, quad, and hyang. All trajectories are sampled at 2.5 Hz. We pre-process the trajectory data by extracting 20 consecutive frames to form a sample, in which the first eight frames are input, and the last 12 frames are the ground truth. Therefore, the observed and predicted horizons are 8 (3.2 s) and 12 (4.8 s) time steps [49], respectively.

Table 1 The division of training and testing data in FD1 and FD2

To evaluate our method under the federated framework, we reintegrate ETH, UCY, and SDD into federated dataset1 (FD1) and dataset2 (FD2). As presented in Table 1, FD1 contains five scenes: eth, hotel, univ, zara1, and zara2. Since the training and testing scenes need to correspond one by one with complete data to conduct comparative experiments, FD2 comprises coupa, gates, hyang, and nexus scenes from SDD. The training and testing sets of FD1 and FD2 are divided based on the official data partitioning in the three public datasets. In addition, in practical applications, data is distributed in scattered scenes. For FD1 and FD2, the training set of each scene only contains local training data in that scene. Subsequently, we use the federated framework to train scattered data jointly. Data does not leave the local scene, instead of the previous way of centralized training. Such a training manner effectively avoids data leakage during transmission, thus protecting data privacy. After training the federated model, the testing is separately performed in each scene with the federated model.

Evaluation metrics

There are two metrics for evaluating the trajectory prediction performance, including average displacement error (ADE) and final displacement error (FDE), which are defined as follows:

  1. (1)

    At each time step, ADE calculates the L2 distance between the ground-truth and predicted trajectories. ADE is defined as follows:

    $$\begin{aligned} \textrm{ADE} = \dfrac{ \sum _{i=1}^n \sum _{t=T_{\textrm{obs}+1}}^{T_\textrm{pred}} \sqrt{{\left( \left( x_i^t,y_i^t\right) - \left( {\hat{x}}_i^t,{\hat{y}}_i^t \right) \right) }^2} }{n \times T_\textrm{pred}} \nonumber \\ \end{aligned}$$
    (11)

    where n is the total number of observed pedestrians, \(\left( {\hat{x}}_i^t,{\hat{y}}_i^t \right) \) and \(\left( x_i^t,y_i^t \right) \) represent the predicted and ground-truth coordinates of pedestrian i at time step t.

  2. (2)

    FDE calculates the L2 distance between the ground-truth and predicted trajectories at the final time step. FDE is defined as follows:

    $$\begin{aligned}{} & {} \textrm{FDE} = \nonumber \\ {}{} & {} \dfrac{ \sum _{i=1}^n \sqrt{{\left( \left( x_i^{T_\textrm{pred}},y_i^{T_\textrm{pred}}\right) - \left( {\hat{x}}_i^{T_\textrm{pred}},{\hat{y}}_i^{T_\textrm{pred}} \right) \right) }^2} }{ n } \nonumber \\ \end{aligned}$$
    (12)

Implementation details

One-layer LSTMs are used for the encoder and decoder, where the dimensions of the hidden states are 32. The total epoch is set to 300. The initial learning rates for FD1 and FD2 are 0.001 and 0.0001, respectively. The proposed Fed-TP is built using the Pytorch framework and trained with an NVIDIA RTX-3080 GPU.

Table 2 Comparison results of different parameter K using three federated algorithms (lower is better)
Table 3 Comparison results of different parameter E using three federated algorithms (lower is better)
Table 4 Comparison results of different parameter B using three federated algorithms (lower is better)
Fig. 4
figure 4

ADE curves of the three federated algorithms for a univ and b coupa

Table 5 Comparison results of different federated algorithms (FA) on FD1 (lower is better)
Table 6 Comparison results of different federated algorithms (FA) on FD2 (lower is better)

Evaluation of key parameters

This section evaluates Fed-TP’s key parameters (K, E, and B) with three federated paradigms. Tables 2, 3, and 4 report the comparison results of different key parameters. The other two parameters are fixed when we evaluate a specific parameter. From the results, we can conclude as follows:

  1. (1)

    Parameter K denotes the number of clients participating in the training. However, the trajectory data involved in the training are not artificially integrated but retained locally for collaborative training. Hence, data privacy protection is strengthened because there is no data transmission. Table 2 shows that a larger K leads to lower ADE/FDE values. That is to say, the more clients participating in the training, the better trajectory prediction performance can be achieved. Considering the difference in data set scenes, the values of K for FD1 and FD2 are set to 5 and 4, respectively.

  2. (2)

    Parameter E denotes the training rounds for each federated client. This parameter affects the computational efficiency and controls the local model training performance. A small E indicates insufficient client training, whereas a large E may result in over-fitting. Table 3 reports that all three methods almost achieve the best performance when E is set to 7. Considering the computational efficiency and trajectory prediction performance, E is set to 7 in subsequent evaluations.

  3. (3)

    Parameter B denotes the batch size. Table 4 shows that the setting of B slightly influences FD1 but significantly influences FD2. Considering the comparison results, the batch sizes for FD1 and FD2 are set to 128 and 16, respectively.

Table 7 Comparison of global training time (s) for three federated algorithms (FA) on FD1 and FD2 for one round
Fig. 5
figure 5

The training effect of federated multiple scenes compared with single-scene on FD1. a Is the total amount of data held by each of the five scenes, and b is the ADE curve for a single scene, federally training two scenes and federally training four scenes, respectively

Table 8 Comparison results of different training paradigms (TP) on FD1 (lower is better)
Table 9 Comparison results of different training paradigms (TP) on FD2 (lower is better)
Fig. 6
figure 6

Trajectory visualization results in three scenes of eth, univ, and coupa. The red, green, and blue lines represent the observed, predicted, and ground-truth trajectories. a Shows that trajectories predicted by the single-scene training paradigm significantly differ from the ground-truth trajectories. b Indicates that trajectories predicted by the centralized training paradigms are closer to the ground-truth trajectories. c Shows that trajectories predicted by Fed-TP fit the ground-truth trajectories better than those predicted by the single-scene training paradigm but are slightly inferior to trajectories predicted by the centralized training paradigm

Comparisons of different federated algorithms

Comparisons of different federated algorithms are conducted using the above-mentioned key parameters, K, B, and E, which are experimentally determined on FD1 and FD2, respectively. The same data set and hyper-parameters ensure the fair comparisons of three federated algorithms. Generally, the three federated algorithms present similar trajectory prediction performances in different scenes and can solve the problem of trajectory data island and data privacy leakage in joint training scenes. Figure 4a, b show the error curves of the three algorithms for univ of FD1 and coupa of FD2, respectively. The disparity of the ADE curves of the three algorithms is weak. Comparison results for other scenes of the two datasets are reported in Tables 5 and 6. On FD1, the experimental results of the three federated algorithms are similar. However, the FedAtt algorithm has a lower average result of 0.45/0.62 and 0.16/0.12 compared with FedAvg and FedProx on FD2. At the same time, it can be seen from Table 7 that the three federated algorithms combine their respective scenes on FD1 and FD2, respectively. The time of the global training round is close, but FedAtt takes the least time. Therefore, FedAtt is used as the training paradigm of the proposed Fed-TP in the following evaluations.

Comparisons of different training paradigms

Evaluation of key parameters” and “Comparisons of different federated algorithms” compare the key parameters and effects of three federated algorithms. This section compares the performance of different training paradigms, including single-scene, centralized, and federated. For single-scene training, only the training data of one scene is used for the trajectory prediction model. Afterward, the ADE/FDE values are calculated on the testing data of all scenes. Compared with Fed-TP, centralized training uses manually integrated training data for all scenes rather than federated.

Tables 8 and 9 show that single-scene training cannot perform satisfactorily due to lacking training samples. Compared with single-scene training, centralized training achieves the best performance by introducing multi-scene collaborative training. By aggregating all scene data for unified training, the average ADE/FDE values decrease by 0.34/0.66 and 4.68/8.71 on FD1 and FD2, respectively. However, directly aggregating all scene data ignores data privacy and may lead to leakage. The proposed Fed-TP can protect data privacy while aggregating all scene data to train a satisfactory trajectory prediction model. As shown in Fig. 5, there are five scenes in Fig. 5a, and each scene has a different amount of data. Due to the problem of data island, a single scene can only be trained locally. Thus the prediction trajectory error of the scene with fewer data will be high as the black line in Fig. 5b, while the error decreases as the number of scenes trained collaboratively under the federated framework increases. However, as shown in Tables 8 and 9, Fed-TP is slightly inferior to centralized training by increasing the average ADE/FDE values to 0.41/0.83 and 13.90/27.39 on FD1 and FD2, respectively.

Qualitative evaluations

Figure 6 shows the visualization of generated trajectories in the three scenes of eth, univ, and coupa to evaluate the proposed Fed-TP qualitatively. The red, green, and blue lines represent the observed, predicted, and ground-truth trajectories. Figure 6a shows the trajectory visualization under single-scene training, and the predicted trajectory deviates significantly from the ground-truth trajectories. In contrast, trajectories predicted by the centralized and Fed-TP training paradigms are closer to the ground-truth trajectories, which indicates the effectiveness of multi-scene training. Moreover, Fed-TP can protect data privacy, which is more suitable for real-world applications.

Conclusion and future works

Fed-TP is proposed to forecast pedestrians’ future trajectories with data privacy protection. A lightweight DO-TP is used to conduct trajectory prediction in each local scene. Subsequently, the privacy and security of personal data in each scene are protected by co-training the multi-scene trajectory data under the federated learning architecture. Moreover, three federated algorithms are compared to find the most suitable training paradigm for trajectory prediction. Evaluations are carried out on reintegrated ETH, UCY, and SDD. Results demonstrate that Fed-TP can effectively balance the trajectory prediction performance and user data privacy protection.

The proposed method can solve the real-world problem of not being able to effectively jointly analyze pedestrians’ motion behaviors in different scenes due to data island. At the same time, there is no data transmission during the training process to protect pedestrian privacy from being leaked. However, Fed-TP involves the transmission of model parameters between the server and clients, and there will be threats to privacy, such as model inversion attacks in real networks. In the future, we will learn the combination of federated learning and privacy protection technologies, such as homomorphic encryption and differential privacy, to better protect the security of pedestrian privacy. On the other hand, despite its good performance in several simple scenes, Fed-TP may degrade in complex scenes because social interactions between pedestrians are not considered. Therefore, our future work will concentrate on introducing pedestrians’ interactions and lightweight scene understanding strategy to improve the robustness of the model in dealing with unexpected changes in dynamic environments.