Introduction

With the development of automobile intelligence, there is an urgent need for automatic driving assistance system (ADAS) that has more profound comprehension of traffic contexts. Moreover, many researches show that incorrect prediction of driving intention is a major cause of crashes [1,2,3]. Therefore, high precision driving intention prediction is essential to improve the active safety of automated vehicles. Driving intention prediction is a technique based on analyzing historical trajectory information of surrounding vehicles and then inferring future driving situations. It may come naturally to experienced human drivers, but for advanced driving assistance systems and automated vehicles, correctly recognizing the intentions of other drivers remain a major challenge. As shown in Fig. 1, in a mixed-traffic scenario with autonomous and human-driven vehicles, intention awareness of surrounding vehicles is vital for improving the rationality of decision and autonomous vehicles control safety [4, 5].

Fig. 1
figure 1

Autonomous vehicles in real perception. Autonomous vehicles (blue) needs to sense and predict the intent of surrounding vehicles (green)

Existing some studies have found that supervised learning methods are widely used to learn the representation of vehicle trajectories [1, 2, 4]. Among them, recurrent neural network (RNN) and their varieties are most widely used in vehicle intent prediction tasks. To extract high dimensional features of time series, existing algorithms and models utilize a large number of trajectory labels as learning targets. However, it may lead to a sequences of issues [6, 7]:

  1. 1.

    The underlying trajectory data has a richer structure than what labels could provide. Existing methods often require large numbers of samples and many data labels, which not only increase the cost of acquiring labels but also converge to brittle solutions;

  2. 2.

    Due to the vulnerability of the model in the learning representation of trajectories, existing methods do not have outperform adversarial attack capability;

  3. 3.

    Trajectory labels that are only useful for solving specific tasks, but cannot be effectively transferred to other tasks. The supervised learning trained model usually adapted to a specific environment, and the generalization ability is not stable.

In intention prediction tasks, it is not reasonable to design a corresponding model for each driving environment because the constraint of computational resource and cost. However, the driving environment is dynamic during autonomous driving. It can result in a considerable divergence in the prediction accuracy of model if the model’s generalization ability is weak, which may affect autonomous vehicles driving safety [8, 9]. In recent years, self-supervised learning has gotten a lot of attention for generating representations from unlabeled data for downstream tasks [10,11,12]. Compared to supervised learning, self-supervised pre-trained models can achieve the same performance with little labeled data [13]. Meanwhile, self-supervised learning methods extract features with more robustness. For example, Mohsenvand et al. [14] developed a contrastive learning method for biological signals, by recombining channels from multi-channel recordings, and increasing the number of sample quadratically per recordin. To solve the problem, Eldele et al. [8] proposed a time-series representation learning framework via temporal and contextual contrasting framework (TS-TCC). Tonekaboni et al. [9] proposed a self-supervised learning method called time neighborhood coding. However, since existing self-supervised learning methods ignore the spatio-temporal information embedded in the trajectory information, they have sub-optimal performance in the representation of trajectories.

Given the excellent performance of self-supervised that learns generalized and high-quality representations from unlabeled data, we utilize contrastive learning to learn representations of trajectories. Contrastive learning is an effective method to achieve self-supervised learning, so we design a novel trajectory representation learning method based on contrastive model. Pretext task is a key component of contrastive learning, and a good pretext task strategy enables contrastive learning to collect significant information [15]. As a result, we design a data augmentation mechanism and bidirectional cross-prediction task to construct a pretext task. The pretext task promotes high-quality representation of model learning trajectory data and improves adversarial attack on model. In addition, we found that the vehicle trajectory data forward and backward not only follow the same social norms, but obey the same interaction behavior constraints. For example, both forward and backward vehicle trajectories are subject to the same traffic rules with the only difference is their time direction. The forward prediction network and the backward prediction network are closely coupled to meet the reciprocal constraints, which allows them to be learned together [16, 17]. We propose a novel trajectory contrastive learning from unlabeled trajectory data termed bidirectional trajectory contrastive based on this unique property. Our model utilizes the features of bidirectional trajectory to generate more training positive samples, with the goal of gaining the generalized trajectory representation by complicated cross-view prediction tasks. Moreover, forward and backward representations of vehicle trajectories are coupled to learn the spatio-temporal features embedded in trajectory data by a bidirectional trajectory contrastive learning method. The bidirectional trajectory contrastive learning method significantly also enhances the transfer ability of the BTCL model.

In summary, the main contributions of this paper are as follows:

  1. 1.

    We construct a novel pretext task that consists of different trajectory data augmentation strategies and a cross-view trajectory prediction task. It not only encourages the high-quality representation of trajectory, but also improves the adversarial attacks on the BTCL model.

  2. 2.

    We propose a new bidirectional trajectory contrastive learning module. The module can gain more training samples to learn the generalized representation of trajectory. It further increasing the prediction accuracy in downstream tasks and transfer ability of the BTCL model.

  3. 3.

    We design a vehicle trajectory representation model (BTCL). Experiment results based on NGSIM and HighD real-world datasets show that the prediction accuracy and recall rate of our model are better than some state-of-the-arts. Particularly, our model also achieves excellent performance in transfer learning experiment and adversarial attack experiment.

The remainder of the paper is outlined as follows: “Related works” presents related work. The problem is formally defined in “Problem formulation” and consists mainly of the selection of input data for the model and the acquisition of outputs. “BTCL model” describes the proposed algorithm, including the role of each module of our model and the general algorithmic flow. The experimental setup is presented in “Experimental setup”. Experiments are described in “Experimental results”, including model performance, model generalization ability experiment, and ablation experiment, etc. Finally, conclusions and prospect are presented in “Conclusions”.

Related works

Intention prediction methods

In recent years, a large and growing body of studies has investigated vehicle situation prediction. Around 2010, because neural network technology was still underdeveloped, much of the literature concerns model driven rather than data driven. The method based on kinematics or dynamics presents a physical model by using the vehicle’s speed, heading, acceleration, and other information. For example, constant turn rate and velocity (CTRV) [18], switching Kalman filter [19], Monte Carlo simulation [20] and other methods. However, these methods usually have certain limitations. For example, the model cannot adapt to dynamic and nonlinear real scenarios due to its linear modeling characteristics in specific scenarios, which may lead to a lower accuracy of prediction results. The task of driving intention prediction can be treated as a classification task, accompanied by support vector machines (SVM) [21], Hidden Markov Models (HMM) [22] and Bayesian networks [23]. With the emergence of machine learning, people naturally apply machine learning methods to the task of vehicle situation prediction. The vast majority of research have found that machine learning models have higher accuracy of model predictions.

Although the traditional methods have some achievements in driving intention prediction task, there are still problems such as the large number of constraints to be considered in model building, the insufficient nonlinear modeling capabilities and weak model transfer ability. There is a large volume of published studies describing the role of historical trajectory sequence to driving intention prediction [5, 24,25,26]. In order to improve the accuracy of model prediction, the first and most critical stage is learning the properties of the trajectory sequences. With the continuous innovation and promotion of deep learning algorithms, Zyner et al. [27] proposed a recurrent neural network (RNN) based model for intent prediction at unsignalized roundabouts. The algorithm is verified on the dataset in the real scene. Compared with the quadratic discriminant analysis (QDA) classifier, the convergence is faster and the accuracy is higher.

To improve the performance of intention prediction tasks, many models consider richer extrinsic information. Phillip et al. [24] proposed an LSTM-based solution to infer the intention of the vehicle before entering the intersection and it was verified on the NGSIM, Lankershim, and Peachtree datasets. They designed an ablation experiment to evaluate the influence of historical movement characteristics, neighbor characteristics, and traffic characteristics. The study indicated that the multi-modal prediction can increase the driving safety of unmanned vehicles. To predict multi-modal behavior, Zyner et al. [25] first used the Encoder–Decoder three-layer RNN model to predict the situation of vehicles entering an intersection without traffic lights. In addition, a weighted Gaussian mixture model (GMM) is used to make a multi-modal prediction of the future trajectories of the vehicles. In real driving scenarios, accurate intention prediction can improve the accuracy of trajectory prediction. In a study conducted by Zhang [26] shown that situation prediction framework based on LSTM, and used the design of the upper and lower structure to make full use of the guiding effect of intention prediction on the trajectory and improve the accuracy of trajectory prediction. Studies investigating human driver intentions have found that future driving intention is influenced by other road participants. Deng et al. [5] developed an RNN model to inference intention of lane change on highways considering the dynamic interactions among surrounding vehicles and the state of the ego vehicle.

Self-supervised learning

In recent years, there has been an increasing amount of literature recognizing the supervised learning model exits some problems such as weak generalization ability and high label-setting cost [10,11,12,13]. And for autonomous driving field, the scene of vehicle driving is dynamic and unstable. The studies presented thus far provide evidence that the inability to migrate the model can cause the vehicle to face huge dangers during driving. Self-supervised contrastive learning can start from the data itself and learn the true features of the data through the contrast between the positive and negative samples of the data [28,29,30]. Self-supervised learning is first applied in the computer vision field. In the learning of image data, data augmentation technology is applied to construct different views of the input data, then the feature representation is learned by maximizing the similarity between different views of the same sample and minimizing the similarity of different sample views [31, 32]. Through this kind of methods, self-supervised learning has shown strong ability in image classification and image restoration.

However, these image-based contrastive learning methods cannot solve the time dependence of data. Moreover, previous approaches to data augmentation in self-supervised algorithms have not addressed time series problem. Mohsenvand [14] and Cheng et al. [33] designed contrastive learning methods of biological signals, such as EEG and ECG. However, the above two methods were proposed for specific applications and cannot be generalized to other time series data. To solve these problems, Eldele et al. [8] proposed a time series contrastive learning model based on time and context according to the characteristics of time series, and verified it on multiple datasets. The experimental results show that the model is better than or equal to the existing supervision model. Tonekaboni et al. [9] designed a TNC model. The TNC model allows the discriminator to learn to identify the source of the sample by constructing positive samples and unlabeled samples. And it utilizes the local smoothness in the signal generation process to define a stable neighborhood of the feature. Finally, the potential state of the signal is expressed by sliding the window over time. Similarly, the model has a good performance on the classification task of time series data. The above two studies have been verified on the dataset with time series characteristics, and the experimental results show the effectiveness of the self-supervised model on time series data.

Many studies that have noted the feasibility of contrastive learning model on time series data have been verified on real datasets. Moreover, through the research [16, 17], we find that the vehicle trajectory data forward and backward not only follow the same social norms but obey the same interaction behavior constraints. Based on the above studies, we design a bidirectional trajectory contrastive learning model, which is a self-supervised contrastive learning model for learning the high-quality trajectory representations of vehicles. Our model not only improves the accuracy of intention prediction but also has outperformed transferability and the robustness from adversarial attacks.

Problem formulation

In this subsection, we formulate the input and output of the driving intention prediction model.

Input of intention prediction

We take 30 frames of data for each trajectory from a real-world dataset as model input. The lane change sample takes 30 frames of data before the lane change, and the lane-keeping sample takes 30 consecutive frames of data randomly. The purpose of the current study aims to determine the intention of the vehicle at the next moment by analyzing the 30 frames of vehicles historical trajectory information in the real dataset. Therefore, we define the model inputs as:

$$\begin{aligned} \text {Input} = \{ {I_0},{I_1},{I_2}, \ldots {I_t}\}, \end{aligned}$$
(1)

where \(I_t\) is the input of each frame, we select 30 frames of input for the total input, so \(t \in [0,30)\) .

Fig. 2
figure 2

Model input. The prediction of surrounding vehicles takes into account not only the historical trajectory information but also the steering angle of the vehicle

Figure 2 shows that experienced drivers often predict the situation of surrounding vehicles based on not only the location of the vehicle in the real scene, but also the acceleration of the vehicle and the deflection angle of the front of the vehicle [1]. Thus, our expression of \(I_t\) takes the following form:

$$\begin{aligned} {I_t} = \{ {x_t},{y_t},{a_t},{\theta _t}\} , \end{aligned}$$
(2)

where \({x_t}\) and \({y_t}\) represent the position information of the vehicle at each timestamp, x and y denote the values of the x and y coordinates of the vehicle, respectively. \({a_t}\) is the acceleration of the vehicle at each moment, and \({\theta _t}\) is the deflection angle of the vehicle shown in Fig. 2.

Fig. 3
figure 3

Bidirectional trajectory contrastive learning model. A new self-supervised comparative learning model, consisting mainly of pretext task, bidirectional trajectory contrastive and driving style contrastive

Output of intention prediction

Intention prediction problem can be regarded as a classification task, so our output is:

$$\begin{aligned} \text {Output} = {{\textrm{argmax}}}\, (P (\text {LK}), P (\text {LL}), P (\text {LR})), \end{aligned}$$
(3)

where \(P (\text {LK}), P (\text {LL}), P (\text {LR})\) denote the probability that the vehicle will stay straight, turn left and turn right, respectively. The output of the prediction model represents the event for which we predict the maximum probability.

BTCL model

This paper proposes a driving intention prediction model based on bidirectional trajectory contrastive learning. We first learn the generalizable representation through bidirectional trajectory contrastive learning, and then use a multilayer perceptron (MLP) to predict the driving intention.

A growing number of researches on time series shows that self-supervised learning models can not only ensure the quality of the trajectory representation but also improve the transferability of the representation. Therefore, we propose a novel trajectory contrastive learning model to learn the feature of the trajectory sequence. Figure 3 displays an overview of BTCL model. We first generate two different views of the input data using different data augmentation models. Then, a bidirectional trajectory contrastive learning module is proposed, and the autoregressive model is used to explore the time features of the data. The model learns latent representations of historical trajectory information through a cross-prediction task across views. Through the driving style contrastive learning module, we further maximize the consistency between the contextual features of the autoregressive model. This section mainly analyzes the model from three aspects: pretext task, bidirectional trajectory contrastive module, and driving style contrastive module.

Pretext task

Constructing appropriate pretext task is the key to the success of contrastive learning, this paper focuses on the implementation of data augmentation and cross-view trajectory prediction task to construct a pretext task. The cross-view trajectory prediction main task is to maximize the similarity between different views of the same sample and minimize the similarity with other samples. According to the features of the vehicle trajectory data, it is necessary to design an appropriate data augmentation module for it. Generally, contrastive learning uses the same data augmentation module, that is, the input is passed through the same data augmentation module to obtain the representation of the two views, \({x_1}\) and \({x_2}\) . In the study of [8, 34, 35], it found that the usage of different data augmentation modules can improve the robustness of representation learning method. Since the above study confirms that it is beneficial to use different data augmentation modules for contrastive learning models. In this paper, we design two data augmentation module for our model, one is a strong augmentation and the other is weak augmentation. For the strong augmentation method, we first slice the data into M sequences, randomly permute the segments to create a new window, then add random jitter to the processed data. The weak augmentation is achieved by adding random noise to the original sequence.

For each input, we define their strong augmentation as \({I^s}\) and weak augmentation as \({I^w}\),

$$\begin{aligned} \left\{ \begin{array}{l} {I^s} = {f_s}(I)\\ {I^w} = {f_w}(I) \end{array} \right. , \end{aligned}$$
(4)

where \({f_s}\) and \({f_w}\) are strong augmentation function and weak augmentation, respectively.

To obtain high-dimensional representation, we pass the data augmentation view to the encoder. These encoders are composed of three block convolutional. For each input, as shown in Eq. (5), the encoder maps it to a high-dimensional feature representation:

$$\begin{aligned} \left\{ \begin{array}{l} {h_1} = {{\textrm{Block}} _1}(x)\\ {h_2} = {{\textrm{Block}} _2}({h_1})\\ Z = {{\textrm{Block}} _3}({h_2}) \end{array} \right. , \end{aligned}$$
(5)

which \(\text {Block} = {{\textrm{MaxPool}}} \,({{\textrm{ReLu}}} ({{\textrm{Conv}}} 1d()))\), it is composed of 1-dimensional convolution, ReLU activation function and MaxPool. \({{{\textrm{Block}}} _i}\) denotes the i-th convolution block, x denotes the data after data augmentation. The feature information obtained after block convolutional is denoted by h, and \(Z = [{z_1},{z_2}, \ldots ,{z_T}]\) , T is the total time step.

First, the input data \({I^s}\) and \({I^w}\) are processed through the encoder, and then get two views of high-dimensional representation \({Z^s}\) and \({Z^w}\) by data augmentation module. Finally, we input the obtained high-dimensional representation into the trajectory contrastive module.

Bidirectional trajectory contrastive module

The bidirectional trajectory contrastive module uses an autoregressive model and constructs a contrastive loss function to extract potential temporal and spatial features. More and more studies have found that trajectories with time series feature forward predicted and backward predicted are meaningful. Both the forward and backward trajectories follow the same social norms and the same physical constraints. Forward predicted and backward predicted only difference is their time direction. Therefore, we propose a bidirectional contrastive learning model to learn a more robust trajectory representation through two cross view and cross prediction tasks (Fig. 4).

Fig. 4
figure 4

Bidirectional trajectory contrastive. Perform forward and backward (the solid line indicates the trajectory data input, and the dashed line indicates the trajectory prediction output) contrastive learning tasks separately for historical trajectory sequences to promote the learning of vehicle trajectory representation

Due to the high efficiency and fast speed of Transformer, we designed an autoregressive model \({f_{ar}}\) based on Transformer. It is used to extract the context vector. It is mainly composed of a multi-headed attention (MHA) module and a multilayer perceptron (MLP) module,

$$\begin{aligned} \left\{ \begin{array}{l} \text {Attention} (Q,K,V) = \text {softmax}\left( \frac{{Q{K^T}}}{{\sqrt{{d_k}} }}\right) ;\\ \text {MHA} (Q,K,V) = \text {Concat} (hea{d_1}, \ldots ,hea{d_h}){W^o};\\ hea{d_1} = \text {Attention} (\text {QW}_i^Q, \text {KW}_i^K, \text {VW}_i^v); \end{array} \right. \end{aligned}$$
(6)

where MHA is composed on the basis of Attention (self-attention), We compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V. Where \(W_i^Q \in {^{{d_{\text {model}}} \times {d_k}}},\) \(W_i^K \in {^{{d_{\text {model}}} \times {d_k}}},\) \(W_i^V \in {^{{d_{\text {model}}} \times {d_v}}},\) and \( {W^O} \in {^{h{d_v} \times {d_{\text {model}}}}}\).

$$\begin{aligned} \text {MLP} = \text {Dropout}\, (\text {ReLU} (\text {FC} (\text {input}))), \end{aligned}$$
(7)

the MLP module is composed of two fully connected (FC) layers with a nonlinear ReLU function and dropout between the layers. To generate a more stable gradient, we utilize the pre-norm residual link method. We stack L identical layers to get the final output feature.

The operation process of our autoregressive model is as follows: First, the feature vector \({Z_{i \le t}} = [{z_1},{z_2}, \ldots ,{z_i}]\) is mapped to the hidden dimension through the linear layer \({W_{Tran}}\) which is

$$\begin{aligned} {Z_c} = {W_{\text {Tran}}}({Z_{ \le t}}), \end{aligned}$$
(8)

Next, we add c to the feature vector \( {Z_c} \) so that we get input feature \({\varphi _0} = [c;{Z_c}]\), c state as a representative context vector in the final output. The subscript 0 represents the input of the 0-th layer. The calculation process is as follows:

$$\begin{aligned} \left\{ \begin{array}{l} {\varphi _{lc}} = {{\textrm{MHA}}} \,({{\textrm{Norm}}} \,({\varphi _{l - 1}})) + {\varphi _{l - 1}},1 \le l \le L;\\ {\varphi _l} = {{\textrm{MLP}}}\, ({{\textrm{Norm}}}\, ({\varphi _{lc}})) + {\varphi _{lc}},1 \le l \le L, \end{array} \right. \end{aligned}$$
(9)

where Norm is the normalization method LayerNorm.

$$\begin{aligned} \text {Norm} = \text {LayerNorm} \,(\text {input, normshape, eps}), \end{aligned}$$
(10)

the normshape is the normalized dimension and eps is a constant to prevent the denominator from being 0 in the calculation.

We use the autoregressive model \({f_{ar}}\) to summarize the high-dimensional features \({Z_{i \le t}} = [{z_1},{z_2}, \ldots ,{z_i}]\) into a context vector \({c_t}\) , which is \({c_t} = {f_{ar}}({Z_{ \le t}})\) . Then this context vector \({c_t}\) is used to predict the time step from \({Z_{t + 1}}\) to \({Z_{t + k}}.\) To predict the future time step, we utilize a bilinear model \({f_k}\) to calculate the relationship between \({x_{t + k}}\) and \({c_t}\) :

$$\begin{aligned} {f_k}({x_{t + k}},{c_t}) = \exp ({({W_k}({c_t}))^T}{z_{t + k}}), \end{aligned}$$
(11)

where \({W_k}\) is a linear function that maps to the same dimension as z .

In our bidirectional trajectory contrastive model, the \({f_{ar}}\) of the regression model is used to generate not only the strong and weak augmentation context vectors \(c_{ft}^s\) and \(c_{ft}^w\) in the forward contrast, but also the strong and weak augmentation context vectors \(c_{bt}^s\) and \(c_{bt}^w\) in the reverse contrast. Then the BTCL model learns the robust representation of time series through bidirectional cross-view prediction tasks. That is, the forward weak augmentation \(Z_{t + k}^w\) is predicted by the positive strong augmentation \(c_{ft}^s\) , and vice versa. The reverse trajectory contrastive also uses the same method, that is, the reverse strong augmentation \(c_{bt}^s\) is used to predict the reverse weak augmentation \(Z_k^w\) , and vice versa. Contrastive loss minimizes the dot product between the predicted representation of the same sample and the real sample and maximizes the dot product with other samples \({N_{t,k}}\). For this reason, we design four loss functions:

$$\begin{aligned} \left\{ \begin{array}{l} L_{\text {FTC}}^s = - \frac{1}{K}\sum \limits _{k = 1}^K {\log \frac{{\exp ({{({W_k}(c_t^s))}^T}Z_{t + k}^w)}}{{\sum \nolimits _{n \in {N_{t,k}}} {\exp ({{({W_k}(c_t^s))}^T}Z_n^w)} }}} \\ L_{\text {FTC}}^w = - \frac{1}{K}\sum \limits _{k = 1}^K {\log \frac{{\exp ({{({W_k}(c_t^w))}^T}Z_{t + k}^s)}}{{\sum \nolimits _{n \in {N_{t,k}}} {\exp ({{({W_k}(c_t^w))}^T}Z_n^s)} }}} \end{array} \right. ,\end{aligned}$$
(12)
$$\begin{aligned} \left\{ \begin{array}{l} L_{\text {bTC}}^s = - \frac{1}{K}\sum \limits _{k = 1}^K {\log \frac{{\exp ({{({W_k}(c_{bt}^s))}^T}Z_k^w)}}{{\sum \nolimits _{n \in {N_{t,k}}} {\exp ({{({W_k}(c_{bt}^s))}^T}Z_n^w)} }}} \\ L_{\text {bTC}}^w = - \frac{1}{K}\sum \limits _{k = 1}^K {\log \frac{{\exp ({{({W_k}(c_{bt}^w))}^T}Z_k^s)}}{{\sum \nolimits _{n \in {N_{t,k}}} {\exp ({{({W_k}(c_{bt}^w))}^T}Z_n^s)} }}} \end{array} \right. , \end{aligned}$$
(13)

where the length of the predicted sequence is denoted by K, \(L_{\text {FTC}}^s\) and \(L_{\text {FTC}}^w\) are the loss functions of a pair of cross-contrast in the forward contrast, \(L_{\text {bTC}}^s\) and \(L_{\text {bTC}}^w\) are the loss functions of a pair of cross-contrast in the reverse contrast.

Driving style contrastive module

The context vector of each trajectory actually represents the different driving style of each driver. We further propose a module to contrastive driving styles through context vectors, aiming to learn more features of spatio-temporal information. The input of driving style contrastive module is obtained by the autoregressive model proposed in the trajectory contrastive model, namely \({c_t} = \varphi _L^0\). First, we use a non-linear projection head to perform a non-linear transformation on the context vector. Then the projection head maps the contexts into the space where the contextual contrasting is applied.

Given N samples of input, we can provide two contexts for each sample from its two augmentation views, so there are 2N context vectors. For a context \(c_t^i\) , we define \(c_t^{{i^ + }}\) as a positive pair from another augmentation view \(c_t^i\) in the same input. At the same time, the remaining 2N-2 contexts in the same batch are considered as \(c_t^i\) negative samples. That is, \(c_t^i\) and other 2N-2 negative samples form a negative sample pair. Therefore, we can derive the context contrast loss to maximize the similarity between positive sample pairs and make negative sample pairs dissimilar to each other.

We define the loss function for driving style contrastive. To normalize the loss, we divide the similarity between \(c_t^i\) and its positive sample \(c_t^{{i^ + }}\) by the total similarity between it and 2N-1 samples.

$$\begin{aligned} \left\{ \begin{array}{l} L_{\text {DSC}}^f = - \sum \limits _{i = 1}^N {\log \frac{{\exp (\text {sim}(c_{ft}^i,c_{ft}^{{i^ + }})/\tau )}}{{\sum {_{m = 1}^{2N}{1_{[m \ne i]}}\exp (\text {sim}(c_{ft}^i,c_{ft}^m)/\tau )} }}} \\ L_{\text {DSC}}^b = - \sum \limits _{i = 1}^N {\log \frac{{\exp (\text {sim}(c_{bt}^i,c_{bt}^{{i^ + }})/\tau )}}{{\sum {_{m = 1}^{2N}{1_{[m \ne i]}}\exp (\text {sim}(c_{bt}^i,c_{bt}^m)/\tau )} }}} \end{array} \right. , \end{aligned}$$
(14)

where \(\text {sim}(u,v) = {u^t}v/\left\| u \right\| \left\| v\right\| \) is the cosine similarity, and \({1_{[m \ne i]}} \in \{ 0,1\} \) , when \(m \ne i\) , it is 1, otherwise it is 0. \(\tau \) is the parameter of our model. The final driving style contrast loss is \( {L_{\text {DSC}}} = L_{\text {DSC}}^f + L_{\text {DSC}}^b \) .

The overall self-supervised loss is a combination of bidirectional trajectory contrastive loss and driving style contrastive loss:

$$\begin{aligned} L = \alpha (L_{\text {FTC}}^s + L_{\text {FTC}}^w + L_{\text {bTC}}^s + L_{bTC}^w) + \beta {L_{\text {DSC}}}, \end{aligned}$$
(15)

where \(\alpha ,\beta \) are the parameters, they represent the share of the trajectory contrastive and driving style contrastive loss functions in the total loss function.

Details of the model architecture

We propose a driving intention prediction model based on bidirectional trajectory contrastive learning. The pseudocode is shown in Algorithm 1.

figure a

Firstly, in the pretext task, our model perform data augmentation processing on the trajectory information according to the designed two different data augmentation functions, and obtain different representations \({I^s}\) and \({I^w}\) of the same sample. Then our BTCL model encodes the data augmentation sequence to obtain high-dimensional features \({Z^s}\) and \({Z^w}\). To better learn the context vector of the information, BTCL model first maps the input data to high-dimensional spaces. Then we use an autoregressive model formed by stacking Transformers to extract the context vector from high-dimensional features. In the calculation process for bidirectional trajectory contrastive, we designed two sets of forward and backward cross prediction tasks to learn the representation of trajectory. To learn a more robust representation, this paper combines the autoregressive model and the log bilinear model, and two loss functions are designed to guide the model capture feature information (Fig. 5). Finally, we design a driving behavior contrastive learning model by context contrastive method, which can maximize the similarity between positive sample pairs and the dissimilarity between negative sample pairs. To guide the efficient operation of the driving behavior contrastive model, this paper designs a unique loss function to guide the model learning. The bidirectional trajectory contrastive loss and the driving style contrastive loss are weighted and summed to obtain the total loss function of the BTCL model. To a certain extent, the smaller the total loss function, the more model have expressive and robust for trajectory. Learning the robust representation of trajectories through self-supervised contrastive learning is the first step in our driving intention prediction. Next, we train an MLP to obtain results of driving intention prediction. The parameters of MLP network are [128, 32, 3]. It consists of a three-layer fully connected layer with 128, 32, 3 neurons, respectively. Using only a MLP to make predictions can also better show that our model has a strong trajectory representation ability. Our code will be published in.Footnote 1

Fig. 5
figure 5

Intention prediction. Implementing a downstream prediction task using an MLP to predict the likely intention of green vehicles

Experimental setup

In this section, we present data preprocessing, model training, baselines and evaluation metrics as experimental setup work.

Datasets processing

We use real-world datasets on HighD and NGSIM to validate the performance of our model and the effectiveness of each model.

We split the data into 7:1:2 for training, validation and testing. After extensive experimental verification, we applied a batch size of 128, and usage Adam optimizer with a learning rate of 3e–4, \({\beta _1} = 0.9\) and \({\beta _2} = 0.99\) . The model achieved optimal performance when we set \(\alpha = 1\) and \(\beta = 0.7\) . In the transformer, we set the \(\hbox {L} = 4\), and the number of heads to 4. In addition, we set the hidden dimension to 100 and \(\tau = 0.2\). The pretraining and downstream tasks are done for 100 and 50 epochs. Because we noticed that the model performance does not improve with further training.

HighD dataset

HighD dataset is a large naturally driven vehicle trajectory dataset released by the School of Automotive Engineering at RWTH Aachen University in Germany in 2018. The dataset is obtained from a bird’s eye view by using a UAV equipped with a high-definition camera. The method can achieve no occlusion and capture the information of the vehicle at a high resolution. The dataset includes six highway locations, with a total of driving information of 110,000 vehicles. The total mileage of vehicles during recording are 45,000 km. The recorded data includes cars and trucks, and the open dataset has been processed by a neural network with centimetre accuracy [36].

Figure 6 displays a bidirectional lane the highway in HighD dataset, and the lane numbers increase successively from top to bottom. We select the forward direction of the vehicle as X axis and the other side as Y axis. Then, we map the information in the dataset to the coordinate system for constructing the coordinate information of vehicles.

Fig. 6
figure 6

HighD dataset

NGSIM dataset

Another benchmark dataset is the NGSIM highway datasets, it contains vehicle information for both highways US-101 and I-80. This dataset was collected in 2005 using multiple overhead cameras observing sections of the highway. This data was taken over three sections of 15 minutes each and contains trajectories of roughly 5000 vehicles. Visual tracking techniques were used to extract vehicle trajectories from the image data at a rate of 10 Hz [37]. Figure 7 is the illustration of US101 part in the real scene according to NGSIM. The NGSIM dataset has 6 lanes in total, 5 one-way lanes and 1 on-ramp for on-off. In our work, only five one-way lanes are selected to establish the coordinate system, and the direction of vehicles is taken as axis X and the other side as axis Y.

Fig. 7
figure 7

NGSIM dataset (US101)

Data processing

For the track with lane change, we select the information of 30 frames before lane change as the input of the proposed model. The input is \({I_t} = \{ {x_t},{y_t},{a_t},{\theta _t}\} ,\) the position information xy and acceleration a can be obtained from the dataset, and the steering angle \(\theta \) can be obtained by calculation, which \(\varDelta x\) and \(\varDelta y\) represent the change in the vehicle per unit timeslice on the x and y axes, respectively.

$$\begin{aligned} \theta = {{\textrm{arctant}}} \left( \frac{{\varDelta y}}{{\varDelta x}}\right) . \end{aligned}$$
(16)

We process NGSIM and HihgD datasets according to the above requirements. After our processing and screening, we get 3844 trajectories from NGSIM dataset and 17175 trajectories from HighD dataset. The numbers of trajectories in the processed NGSIM dataset are 2000, 1162 and 682 for straight, left and right turns, respectively. The number of trajectories obtained in the HighD dataset is high, including 8074 straight-track data, 3705 left-turn data and 5396 right-turn data.

Fig. 8
figure 8

The training process of BTCL. a Self-supervised learning training. b The training process of driving intention prediction

Model training process

Figure 8 shows the training process of the model in this paper. The training process of our model is divided into two parts. It includes self-supervised trajectory representation learning and downstream task training. Figure 8a shows the training process of the self-supervised model. With the increase in training times, the value of the loss function gradually decreases, and tends to be stable in the end. We think that the self-supervised learning process has trained enough trajectory information representations. Figure 8b shows the classifier training process. The learned trajectory features are applied to intention prediction tasks for three classifications. With the increase in training times, the loss function and model accuracy converge to a stable value.

Baselines and evaluation metrics

Baselines

To prove the validity of the model, we select these algorithms that have excellent performance on spatio-temporal sequence classification tasks in recent years as the baselines. We use the same data to train baselines and compare the obtained experimental results with the BTCL model.

  • MLSTM-FCN [38]: Multivariate LSTM-FCNs model includes fully convolutional block and an LSTM, it can effectively deal with sequence problems.

  • TS-TCC supervision [8]: This model learn sequence features through encoder consisting of 3 block convolution and classifier model.

  • LEM [39]: A gradient-based method for processing long sequence dependent tasks, capable of learning complex high-dimensional feature information.

  • CoRNN [40]: Coupled oscillatory recursive neural network, a time discretization based on second-order ordinary differential equations model the RNN network structure controlling nonlinear oscillators.

  • TCL:We propose a Trajectory Contrastive Learning (TCL) model, which is different from BTCL in that it only contrastive trajectories in a forward direction.

where we set the batchsize of baselines to 128, the embedding dimension to 64, and the learning rate is uniformly 0.0003. In the LEM model, the time parameter \(\varDelta t = 0.0017\). After our tests, we set the CoRNN model with parameters \(\varDelta t = 0.25\), \(\gamma = 1.3\) and \(\varepsilon = 6.4\).

Since LSTM has a strong performance in handling time series questions, we choose MLSTM-FCN as the baseline to demonstrate the capability of our BTCL model on spatio-temporal sequence problems. Using the TS-TCC supervision model as a comparison model, the advantages of contrastive learning can be verified in comparison with the supervised model with a similar structure and the BTCL. LEM is chosen for demonstrating the ability of our model to handle the long sequence dependence model. The CoRNN model has a strong ability to handle complex information, so we choose it as a baseline.

Evaluation metrics

Accuracy is the most common evaluation metric in classification tasks. However, only rely on accuracy can be misleading in terms of an unbalanced dataset. For example, the number of lane changes in a highway driving dataset is usually less than lane keeping. Thus, an intention predictor that always output lane-keeping gains high accuracy score regardless of the input data. Therefore, we use two indicators, Accuracy and Recall to validate the effect of the model.

  • Accuracy: It is defined as the number of correctly classified data samples divided by total number of data samples. The higher the accuracy rate, the better the classification result of the model.

    $$\begin{aligned} {Accuracy} = \frac{{{pred}}}{{{true}}}, \end{aligned}$$
    (17)

    where pred indicates the number of samples with correct predictions, true denotes the total number of samples.

  • Recall: For a given class, recall is defined as the ratio of total number of data samples that are correctly classified in that class to the total number of samples in the given class. A low recall indicates a large number of data in the given class that are incorrectly classified in other classes.

    $$\begin{aligned} {Recall}\_i = \frac{{{pred}\_i}}{{{true}\_i}}, \end{aligned}$$
    (18)

    where \( \text {true}\_i \) and \(\text {pred}\_i\) represent the true and predicted values of i samples respectively.

Experimental results

In this section, we follow the experiments setup in Section 5 for experimental analysis. To avoid the occasionality of experimental results, we take the average of five experiments by setting different random seed parameters as the final experimental results.

Table 1 Comparison of the accuracy and recall on HighD datasets
Table 2 Comparison of the accuracy and recall on NGSIM datasets

Performance analysis

We verify the accuracy and feasibility of our model by comparing the proposed model with baselines. Accuracy and Recall are used to compare the capability of our model baselines. The validate result of HighD and NGSIM datasets are highlighted in Tables 1 and 2. Firstly, the result of the comparison between MLSTM-FCN and BTCL model reveal the accuracy of our model is higher than LSTM. Secondly, we compare the BTCL model with the TS-TCC supervision. It can be shown that BTCL model is more accurate than supervised learning in similar model structure. The reason is that the self-supervised method is able to obtain higher-order trajectory information and capture spatio-temporal information. Next, by comparing with LEM model, experimental results show that the accuracy rate of BTCL model is higher than LEM model, and Recall is greater or slightly smaller than LEM model. This experiment confirms that BTCL model can accurately learn the feature information in long sequences. Finally, a comparison of the results with those of CoRNN model confirms BTCL model have highly competent in handling complex information. TCL algorithm represents that we only make forward contrastive on the trajectory. Compared with BTCL, TCL model has suboptimal performance due to its lack of bidirectional trajectory contrastive structure. In general, our model is superior to or equal to existing state-of-the-art models. It demonstrates that the bidirectional trajectory contrastive module learns the generalized representation of vehicle trajectory information by exploiting the coupling relationship of bidirectional trajectories. And the effectiveness of the pretext task, which helps the model learn a more high-quality and robust representation by breaking the limits of manual labels.

The experimental results show that our model is superior to baselines in performance experiments on the HighD dataset. However, in performance experiments, our model’s prediction accuracy in the left-turn channel is poor for two reasons: (1) in the HighD dataset, there are few left-turn data, which makes the self-supervised model inability to learn fully the accurate representation of left-turn trajectory feature; (2) to verify the learning ability of our model on vehicle trajectory, we only use a linear classifier to decode the learned trajectory representation. It may lead to insufficient decoding effect on trajectory information.

The experiment results for NGSIM dataset are similar to the HighD dataset. However, in the prediction results of NGSIM dataset, BTCL model shows better performance. We think the reasons are as follows: (1) although the NGSIM dataset also exists data imbalance, the difference in the number of data samples in under three categories are not obvious; (2) the lane change intention of NGSIM datasets are more obvious.

From the data in the experimental results of HighD dataset and NGSIM dataset, it is apparent that there would often be data imbalance due to the dataset we used in real scenarios. The reason for this data distribution is that vehicles are much more likely to stay straight than turn left or right in a real road environment. The accuracy of small number samples can be adversely affected under data imbalance conditions. It can also be seen from our experimental results that the accuracy of model prediction is proportional to the amount of data. For example, in the experiment of HighD dataset, the amount of straight-line data is the largest, followed by right-turn data, and left-turn data is the least. Therefore, the accuracy of these three predictions also decreases.

Transfer learning experiments

Since the information provided by data labels is fragile, which can lead to the problem of weak transferability of the model. Their results in trajectory representation learned by the model that are too fragile to be transfered over similar scenarios and tasks. However, in the scenario of automatic driving, the driving scene is changeable. The failure or suboptimal transfered effect of the model can lead to catastrophic situations. In this subsection, we verify the generalized ability of our model through model transferred experiment. The transferability of the BTCL is the guarantee of the safe driving of autonomous vehicles. To verify the transfered capability of our model, we design a transfer learning experiment to predict driving intentions in NGSIM dataset through the contrastive learning results of HighD dataset and vice versa.

Fig. 9
figure 9

Transfer learning experiments. a NGSIM model transfers to HighD scenario validation. b HighD model transfers to NGSIM scenario validation

Our experimental results are shown in Fig. 9. When we compare the experimental results of the supervised learning algorithm CoRNN with the self-supervised learning algorithm TS-TCC and BTCL, it can be obtained that the model transfer ability of self-supervised learning is significantly better than the supervised learning. Then, we compare the experimental results of the BTCL model with the other two algorithms and results show that our BTCL model has better performance with the other two models. The pretext task breaks the constraints of label information and the bidirectional trajectory contrastive module learns the generalized trajectory representation. Experimental results show that our model has generalized trajectory representation ability, and it allows our models transfer to a wider range of scenarios.

Adversarial attacks experiments

We also run experiments on a synthetic dataset to show that our model can effectively respond to sensor failure, malicious hacker attacks and other data perturbation problems. We take the HighD dataset as an example. Firstly, we use the HighD dataset to train our model and get the real trajectory representation. Then, we synthesize 3 datasets by adding a noise generated by a normal distribution to HighD as shown in Eq. (19).

$$\begin{aligned} \left\{ \begin{array}{l} {jitter} = \frac{1}{{\sqrt{2\pi } \delta }}\exp \left( - \frac{{{{(x - \mu )}^2}}}{{2{\delta ^2}}}\right) \\ {output} = {input} + {jitter} \,(\mu ,\delta ) \end{array} \right. , \end{aligned}$$
(19)

which jitter is the normal distribution formula, the parameter \(\mu \) is the mean value and parameter \(\delta \) represents the variance. Input is the real data in the HighD dataset. In the three datasets we synthesized, we set parameter \(\mu \) to 0 and parameter \(\delta \) to 0.1, 0.3, and 0.5, respectively. Finally, the experimental results are presented in Fig. 10. When we add the perturbation, the prediction accuracy of our model only slightly decreases. A more robust representation of the vehicle trajectory is obtained for our BTCL model by the joint execution of pretext task and bidirectional trajectory contrastive module. And we demonstrate that BTCL has effective adversarial attacks on model through adversarial attacks experiments.

Fig. 10
figure 10

Adversarial attacks experiments

Ablation experiments

In this subsection, we design a specific set of research experiments to verify the effectiveness of each module. The results of the correlational analysis are displayed in Fig. 11.

Fig. 11
figure 11

Ablation study experiments

Fig. 12
figure 12

Three sensitivity analysis experiments on HighD dataset. a The effect of parameter K on experimental results. b The effect of parameter \(\alpha \) on experimental results. c The effect of parameter \(\beta \) on experimental results

To verify the effect of pre-training of our self-supervised learning model, we design an LSTM to extract trajectory features, and used the same linear classifier to decode and predict trajectory features. Experimental results show that the prediction accuracy of feature extraction using LSTM alone is far lower of our proposed model. The main reason is that our model can capture temporal dependent information more accurately. Next, we omit the Driving Style Contrast (DSC) module in BTCL and compare it with our complete model. The results show that the prediction accuracy of our model decreases by about 3% after delete the DSC module. The presentation DSC module is a benefit for promoting the accuracy of our model. The loss of the DSC model can make the contrastive learning of trajectory that cannot accurately learn the trajectory feature of drivers with different driving styles, which may cause a reduction in the model prediction accuracy. Finally, we train the model of without the Bidirectional Trajectory Contrastive (BTC) module. The prediction result of the model without bidirectional trajectory contrastive is far lower than that of the BTCL model, and the total accuracy is even reduced by 30%, which is enough to show that the effectiveness of the bidirectional trajectory contrastive that algorithm designed by us. Our BTCL model can effectively solve the problem inaccurate extraction of hidden features in trajectory information.

Sensitivity analysis

We take HighD dataset as an example to analyze the parameters sensitivity of the proposed model, mainly involving the number of predicted future timesteps K, and the weight of the loss function \(\alpha \), \(\beta \) parameters.

The results are obtained from the analysis of parameters are shown in Fig. 12. K is the length of prediction sequence and the selection of K values in the bidirectional trajectory contrastive learning model can greatly affect the accuracy of feature extraction. Therefore, we compare the model according to the experiment when we select K=2, the prediction accuracy of the model is the highest. Therefore, we set the value of the parameter K=2. This result may be explained by the fact that the contrastive learning can not be able to get enough features when the value of K is too small, and since the trajectory features available at \(K>2\) are not enough to complete the contrastive learning of trajectories, which may lead to decrease in model performance.

As for the weight of the loss function, the experiment confirms that the model has the best effect when \(\alpha \) and \(\beta \) are 1.0, 0.7. This setting of the loss function weights in this way is attributed to the bidirectional trajectory contrastive module plays a key role in the BTCL model. The driving style contrastive module is only beneficial for our model.

Conclusions

In this paper, we propose a novel self-supervised bidirectional trajectory contrastive learning model for driving intention prediction in autonomous driving. The BTCL model improves the accuracy of intention prediction, the generalized representation of the model, and the ability to adversarial attacks. A pretext task considering vehicle features and a bidirectional trajectory comparison learning model are employed to jointly learn the generalization and high-quality representation of vehicle trajectories. Finally, we validate the effectiveness and robustness of the model with two real-world datasets, NGSIM and HighD, and conduct performance analysis and adversarial attack experiments.

In the future, we will combine classifiers and regression models to propose a self-supervised model for multi-task prediction, which enables the prediction of driving intentions and high-precision prediction of vehicle trajectories. Moreover, considering theoretically justify BTCL model is also the focus of future research.