1 Introduction

The interaction and cooperation of road users is an integral part of urban traffic. In highly automated and connected traffic, a self-driving car must be able to identify and evaluate an interaction situation as well as perform suitable cooperative manoeuvres. This is especially important with vulnerable road users [1] to guarantee comfort and safety in space sharing conflicts (see previous book chapter by Trommler et al.). In this project called “Cooperative Interaction of Cyclists and Autonomous Vehicles” (KIRa) the behaviour of cyclists in mixed traffic was investigated. As described in the previous book chapter by Trommler et al., the focus of the research is low-speed areas, specifically urban intersection scenarios with at least one or more cyclists present and automated vehicles. To allow an assessment of such situations, the gathering and communication of the traffic participants’ data is essential. This is especially challenging with VRUs, as technical solutions like ITS-G5 and ETSI standardized messages like cooperative awareness (CAM) [2] or manoeuvre coordination message (MCM) [3] can only be used between connected vehicles. Otherwise, the on-board sensors of an automated car could be used to detect and track VRUs, but this might not be possible at intersections, if the VRU is occluded by buildings or parking cars. Instead, a more decentralized approach was chosen: Mobile sensors on the bicycle, like GPS, accelerometers or gyroscopes can be used to gather position and movement data of a cyclists, by utilizing devices like smartphones that already broadly use these sensors. The requirement for such an approach is a connection between the road users to allow the exchange of the data.

The kinematic and position data of the traffic participants are hereby not sufficient to assess a shared space conflict properly. The behaviour of a cyclist is influenced by various factors. While the velocity and movement direction of the bicycle are decisive, the manoeuvres are also heavily affected by external conditions, for example the infrastructure or other road users. The main goal of this project was to infer and predict the cyclist’s behaviour, considering and utilizing the different influencing factors. To achieve this, the identification of behavioural indicators is necessary as well as a suitable behavioural model. The main indicator for the bicycle’s movement represents its trajectory, which estimates the future position and state of the bicycle. The proposed algorithm is capable to combine various types of data in order to accomplish an accurate trajectory forecast for cyclists. This information can later be used by automated vehicles, to allow an early and appropriate cooperative manoeuvre. That way a safe and comfortable traffic flow can be ensured for all road users.

2 Related Work

Trajectory forecasts for bicycles are not a very common research topic, however there is a lot of work regarding cars trajectories that were derived for this use case. Typically, there are two types of approaches: physical models and machine learning models, both with individual advantages and strengths. The research on physical approaches is usually focused around finding a suitable movement model for bicycles. As mentioned before, these can, for example, be derived from pre-existing car models [4]. While a physical model can yield high accuracies in very specific use cases, a machine learning model, like the neural network (NN) will often outperform it when evaluated in a wider array of scenarios [5]. The reason for that is the higher adaptability of a machine learning model, when it’s trained with a large amount of naturalistic data. Another advantage of the NN is its capability to incorporate data of various types to allow for a more accurate trajectory forecast. This makes the approach the best choice for this project, as the aim is to combine potential influencing factors on the cyclist’s behaviour. One of these influencing factors can be other road users in a given traffic scenario, especially at intersections. A proper trajectory forecast algorithm should therefore be interaction-aware and aimed to avoid conflicts, which is for example realised in the proposed algorithms of Huang [6] and Ju et al. [7]. In addition to potential interaction partners, the infrastructure of a traffic site can have a significant influence on the cyclist’s behaviour. The movement of a cyclist can depend on different features of an intersection, like the lane markings, road borders and visibility of other road users [1]. This infrastructural data can be included in the prediction model, for example by converting it into rasterized maps [8].

In this project, we propose a neural network that combines and extends on the aforementioned approaches. The model can make use of different data types and sources in order to accurately predict a cyclist’s future movement in interaction with AVs. For this, not only its kinematic data, but also data from all surrounding vehicles and area maps are utilized.

2.1 Datasets

Alongside the used training procedure, the quality, type and scope of the training data is crucial for the performance of the resulting model. For example, if the used position coordinates are unreliable, the accuracy of the model will be limited from the start and a meaningful evaluation will be hindered. For this reason, multiple publicly available trajectory datasets were analysed for their applicability in this project.

The criteria for a good training dataset depend on the use case of the desired model. In this work, cooperative traffic scenarios were investigated, as already outlined in Sect. 1. Given this area of application, the following characteristics are used for the selection of a suitable dataset:

1. The type of featured scenarios These should be comparable with the in Sect. 1 defined scenario, an intersection situation where multiple road users including at least one cyclist have to interact cooperatively with other vehicles. This limits the selection to recordings from urban areas. In natural data, the density of the traffic can vary a lot. A high density can make the analysis of individual interactions difficult, but offer a larger amount of potential training data.

2. The included road users While there is a multitude of recordings of cars in real traffic, the number of datasets which include the movement of the cyclists in the area is rather small. In addition to that, each type of road users must be explicitly distinguishable from each other, for example through a prior classification and a unique label.

3. The type of measurement data Since this work is focused on the trajectory of cyclists, the most important measurements are position coordinates combined with corresponding timestamps. Further interesting data are velocity, acceleration and driving direction (heading) of the vehicles. Image and video material can be helpful to improve the understanding of a traffic scenario and individual situations.

4. Frequency of the data The Frequency of the data should not be too low to ensure that even the movement of faster vehicles can be seamlessly described. Some traffic monitoring data is recorded with frequencies of 1 Hz or less and are therefore not suitable for this project. Datasets with a recording frequency of more than 2 Hz are sufficient for the analysis of low-speed scenarios in this use case.

5. Size of the dataset For the training of machine learning algorithms, especially neural networks, a large-scale and diverse dataset is required. The more complex the desired model, the more data is needed to effectively train the model. The minimum required number of training data is in most cases way smaller for a MLP than for a more complex neural networks like a LSTM or CNN, varying from a few hundred data points to several thousands. Measurement data with a high diversity will be required, if the trained model should be able to adapt to new scenarios. Alternatively, data augmentation methods can be used to enhance the model’s transferability to new conditions.

Under consideration of these criteria, three public datasets were chosen: the “ApolloScape trajectories” dataset [9], the “Lankershim boulevard dataset” from the NGSIM project [10] and the “Intersection Drone Dataset” (InD) [11]. The included measurement data are best suited for the investigation of cyclists’ trajectories. Nevertheless, compromises must be made when using a public dataset, which is not specifically designed for the own use case. For this reason, the mentioned datasets are further described below regarding their features, strengths and weaknesses.

The ApolloScape dataset was recorded in the city centre of Hong-Kong and includes recordings from multiple big intersections. That means it features mainly dense and complex traffic scenarios with a high amount of traffic participants. The data contains position coordinates of cars, pedestrians and cyclists, that are distinctively classified and labelled. All data were recorded with a frequency of 2 Hz, which is rather low, but sufficient in this case, since the average velocities of the road users are not very high. In conclusion, ApolloScape offers a large and accurate dataset. The downside is the complexity of the traffic scenarios, making it hardly possible to analyse individual interactions between specific traffic participants. Furthermore, a localization of the position coordinates in a map is not possible, because these are all relative to a local coordinate system and not a world coordinate system.

The second dataset is part of a collection of trajectory datasets from the NGSIM project. It contains recordings from multiple intersections along the Lankershim Boulevard in Los Angeles. The measurement data does not only consist of world coordinates from cars, trucks and cyclists, but also their velocity, acceleration and heading. The featured traffic scenarios are less dense than in ApolloScape, and allow for a better investigation of interaction situations between these road users. The biggest disadvantage of this dataset is the sparse number and therefore limited diversity of cyclist recordings.

The InD dataset was created with drones at various intersections in German cities. The high number of cyclists at these places create diverse measurement data, which are very suitable for the training of machine learning models. The observed intersections are smaller than the ones in the NGSIM dataset and not regulated by any traffic lights, which leads to very heterogeneous behaviour of the cyclists. One special feature of this dataset is, a drone image is provided for each measurement location. Later, the position coordinates of all road users can be mapped to these images, which is a decisive factor for the proposed algorithm (see Sect. 3).

The described datasets were used to develop a proof of concept for a trajectory forecasting algorithm in cooperative traffic scenarios. The large-scale ApolloScape dataset was used to design and validate a workflow for extracting, structuring and pre-processing data, as well as a first training algorithm. The NGSIM dataset could later be used to develop a method to incorporate maps into the training procedure. This approach will be further described in Sect. 3.2. The InD Dataset is best suited for the training and validation of complex neural networks because of its high diversity in cyclist-car interactions and its high amount of data.

3 Algorithm

3.1 Model Implementation

To calculate a cyclist’s trajectory as an indicator for their behaviour, multiple algorithms were implemented. A trajectory forecast allows for a time discrete approximation of the future state of a bicycle through, e.g., its position and velocity. One goal of this project was to investigate different influencing parameters on VRUs’ behaviours and to explore factors that can support the development of accurate predictions. A cyclist’s movement in traffic is not only dependent on the kinematic of bicycles itself but can be affected, by for example, the given infrastructure or individual factors of each person. For that reason, NN have been utilized as they are capable of incorporating different types of data into the trajectory forecast. When choosing a suitable architecture for a NN, there are many possible model types, each with their own possibilities, strengths and weaknesses. In this project, three specific NN variations were implemented and compared for their applicability in trajectory forecasting. The concept of the used architectures, a multi-layer perceptron (MLP), a long-short-term-memory-NN (LSTM) and a convoluted neural network (CNN), will be described below.

1. MLP The first implemented NN was a MLP [12], as it is the most simplistic variant of the aforementioned. The input for the model is a 3s time series of the cyclist’s position coordinates and the output are multiple 2D-coordinates of the cyclists’ estimated future position in 0.5s intervals. This means the measurement data has to be converted and split into evenly long sequences as a preparation for the training. The focus of this first model implementation was to validate the data extraction, transformation and sequencing as well as the training procedure itself. The MLP itself consists of one input layer, three dense layers and one output layer, each connected with a ReLu activation function. The number of layers and the best suited activation function has been determined experimentally which later could be reused to design the fully connected parts of the more complex models. Despite its rather simple structure, the results showed that the MLP is already capable to predict a cyclist’s trajectory, especially for straight movements.

2. LSTM The described MLP model calculated the cyclists’ trajectory based on only its past position, but the behaviour of the cyclist is strongly dependent on the movement of all other road users in a scene. Therefore, the next step was to extend the model to include the cooperative aspect of a traffic scenario into the prediction to allow for an interaction-aware trajectory forecast. The chosen algorithm for this task is a LSTM based on the TrajNet++ framework [13], which was originally designed to predict movements of pedestrians in crowded areas with focus on the cooperation and interaction of the individual attendants. The pre-existing implementation was altered and adapted for the use on bicycle trajectories rather than pedestrians. The model includes a classification of potential interaction situations through given parameters like the distance between the road users, the movement direction or their field of view. The conditions for this classification were adjusted to fit the faster movement of cyclists and cars in this use case and the bigger possible distances between interaction partners. The input of the model can now not only include the cyclist’s previous positions, but the movement of all road users within a specific area around the cyclist. The output of the model has been kept the same as in the MLP, a time series of estimated position coordinates for the bicycle. This ensures the comparability between the implemented models.

3. CNN After looking at the performance of the LSTM in the bicycle trajectory forecast on intersections, it is still noticeable that the model excels in predicting forward movement, but shows weaknesses when estimating tight and sudden cornering of the cyclist. Cyclists in those kinds of manoeuvres show a high amount of diversity in their actual movement, because of their ability for dynamic and fast steering. That means the greater possible range of motions for a cyclist when turning will make it harder for a model to predict the correct driving path. One way to increase the accuracy in those situations is to narrow the potential action radius of the bicycle. This is accomplished by providing infrastructural data to the model, like street markings and borders. In order to process this new kind of input data, a third NN type has been chosen. A CNN allows to extend the input for the trajectory forecast, by extracting infrastructural data from maps, but just as the LSTM it is also capable of incorporating the movement data of all road users in one given scene. How this is accomplished will be further described in the next section (see Sect. 3.2).

The main part of the CNN, the feature extractor, is based on MobileNetv2 [14], a well-established CNN with high accuracy and fast computation times. Transfer learning allows to adapt the pre-trained model for the new use case of this project, while keeping its already high precision. The fully connected part of the model is loosely based on the MLP that was described before, but experimentally optimized in this application. The output of the model is still kept the same, also to allow a comparison with the other implemented models.

Out of the three described model architectures, the CNN showed the highest potential in integrating different data types for a trajectory forecast with the highest possible accuracy. The input of the model could always be extended by additional parameters, as an indicator for the cyclist’s behaviour. It will be part of future works to investigate such parameters and utilize them in the proposed algorithm.

3.2 Data Augmentation

When training the described CNN, the data pre-processing and augmentation is equally important as the model implementation itself. As mentioned earlier, the reason for using a CNN is its capability of processing images fast, in this case recordings of the infrastructure in the form of maps, satellite, or drone images. Further, the model should also be given the position coordinates and kinematic data of the cyclist and all surrounding vehicles. An algorithm is proposed to merge those different data types into one combined image, which can then be used as a training input for the CNN. This data processing will be described step by step.

Fig. 1
A diagram of the y position versus the x position in meters plots a short C-shaped curve at the center.

The cyclist’s position coordinates (black line) added to a blank image. The global coordinates of the bicycle (UTM) were converted into a local coordinate system that is shown here

The first step is to convert the cyclist’s past position coordinates into local image coordinates (see Fig. 1). For consistency, the origin of the movement is always the centre of the image. The dimensions of the area that is shown in the image is a parameter of the model and was optimized for the given use case, here 100\(\times \)100 m. The chosen image size must be compatible with the input layer of the CNN.

Fig. 2
A diagram of the y position versus the x position in meters plots a short C-shaped curve at the center with various dots on the left and right, and a concave down, decreasing curve on the left.

All other vehicles’ position coordinates in the scene are added to the same image (grey lines)

As second step the position coordinates of all surrounding vehicles in the scene were converted in the same way and added to the image (see Fig. 2). Distinct grey scale values can be used to separate individual vehicle types from each other. In this case the black line shows the cyclist and the grey line the cars.

Fig. 3
A diagram of the y position versus the x position in meters plots a short C-shaped curve at the center with various dots on the left and right, and a concave down, decreasing curve on the left in different color gradients.

Kinematic data of all vehicles and the cyclist is scaled to values between 0 and 255 (typical image color values). These values are then added to the corresponding points in the image. The cyclist is shown in blue and the car in purple

The last step of the data conversion is to include kinematic data in the image, in this case the velocity and heading value of each vehicle. For this, the given data is transformed to a scale between 0 and 255, or colour values for an image. The scaled values can then be added to one corresponding colour channel at each corresponding position of the vehicle. This means in practice that the trajectory of the vehicles is visualized in different colours depending on their velocity and heading (see Fig. 3).

Fig. 4
2 diagrams of the y position versus the x position in meters of the original intersection image and the corresponding greyscale image labeled A and B. A has a photograph of the 4-way intersection, while B has a diagram of the intersection.

Important features in the original image are highlighted using an edge detection algorithm

Next up the infrastructural data is extracted from given images of the intersection. This can be done by utilizing satellite images or in this case drone images of the scene (see Fig. 4a). Since the color channels of the image are already used in the last step to encode kinematic data, the map is converted to grey scale. To further highlight important information, e.g., street borders, the image is processed with an edge detection algorithm (see Fig. 4b). This new image is merged with the previously described converted vehicle data, by matching the position coordinates exactly to the cut out. Now, the generated image (see Fig. 5) contains all relevant data and can be used as input for the CNN. This algorithm is repeated for the next given time step until all measurement data are converted into images. This way a training set of more than 40,000 images could be created from the InD data, an amount that is more than sufficient to train deep neural networks.

Fig. 5
A diagram of the y position versus the x position in meters plots a short C-shaped curve and a concave down, decreasing curve on the diagram of the 4-way intersection.

Combined image with the position and kinematic data of the vehicles and the preprocessed map image of the scene. The cyclist is shown in blue and the car in purple

When creating a training set with this many images, it is important that none of the possible scenarios is overrepresented, otherwise the model will over-fit on these scenes while not performing well in other conditions. In this case one specific intersection was featuring a lot more cyclists than any other location and therefore offered way more measurement data, which means that the NN could possibly be trained on this specific intersection and fail to adapt to other infrastructures. To avoid this problem, a random rotation is applied to each image that counteracts overfitting and makes the trained models usable with other scenes as well. The adaptability of the NNs is proven by utilizing a test dataset from an intersection that was completely excluded from the training.

4 Evaluation

To evaluate the prediction accuracy of the implemented models, a test dataset has been created. For that, 1,000 images have been used which were generated from a scene, that was completely excluded from the model’s training. This way the model’s adaptability to new scenarios and conditions can be investigated. The test dataset is the same for all used NN variations to ensure comparability between the calculated trajectory forecast precisions.

With the given test dataset, a statistical evaluation was conducted to investigate the three NN’s prediction accuracy. In the field of trajectory forecasts, a specific error metric is commonly used to calculate the deviation of an estimated position from its ground truth. This error metric is called average displacement error (short ADE) and is widely used in literature to evaluate similar approaches [15]. The advantage of using an established error metric is, that it does not only allow a comparison between the proposed models but also with pre-existing ones.

$$\begin{aligned} \begin{aligned} dx = \sum _{i=1}^{n} (\hat{x} _i - x_i)^2 \\ dy = \sum _{i=1}^{n} (\hat{y} _i - y_i)^2 \\ ADE = \sqrt{dx + dy / 2n} \end{aligned} \end{aligned}$$
(1)

To determine the ADE, the deviation of a predicted point from the ground truth is calculated in lateral (dx) as well as longitudinal (dy) direction (see Eq. 1). Both individual values are used to evaluate a model’s precision for specific manoeuvres of the bicycle. The lateral error indicates how well turning movements can be predicted and the longitudinal error shows the precision during straight driving. The combined displacement of the prediction is then calculated by the mean of these two values. These values were averaged for all predictions that correspond to each image in the test dataset in order to obtain the average error for every implemented model.

4.1 Evaluation of Different Input Data

The main purpose of the evaluation process was to compare the influence of different data types on the trajectory prediction accuracy for the cyclist. For that, the three NN types were compared, which are used to incorporate additional data each. As mentioned in Sect. 3 algorithm the MLP uses only the cyclist’s past positions for a trajectory prediction, the LSTM is capable of incorporating movements of potentially interacting vehicles, and the CNN additionally uses infrastructural data as input for the model. A Kalman filter was used to compare the proposed machine learning algorithms with a physical approach. Like the MLP, the Kalman filter uses only the cyclist’s position as input and therefore both models offer a baseline accuracy for the trajectory predictions.

Table 1 ADE comparison of the four used models. A constant prediction time frame of 3 s has been used with all models

The table (see Table 1) shows the ADE of the three implemented NNs and the Kalman filter. A constant prediction time frame of 3 s was used for all models to make the results comparable.

The MLP revealed the highest average error of 1.9 m, followed closely by the Kalman filter with an error of 1.78 m. Both of these algorithms only use the cyclist’s past position as input for the trajectory forecast, which means a lower accuracy is to be expected here. Despite the higher overall error, the lower longitudinal error indicates that a prediction of movements in driving direction is possible even with these comparatively simple models.

The LSTM column shows that the inclusion of interactions between the cyclist and vehicles in close proximity yields a significant accuracy increase. A LSTM is generally better suited for a time series forecast than the aforementioned models, which means the lower average error is caused by not only the interaction-aware prediction but also a more viable algorithm for this use case.

The best model in this comparison is the CNN with an ADE of 0.84 m. It includes all relevant vehicles’ past positions and kinematic data as well as infrastructural data of the cyclist’s surroundings. An accuracy increase of up to 42% can be seen in longitudinal and lateral direction compared to the LSTM. Especially the improvement in lateral movements is important, as it makes the CNN the best model to accurately predict cyclists’ turning manoeuvres out of the compared approaches.

The influence of the infrastructural data on the trajectory forecasts was further investigated by implementing a variation of the CNN that is not using this data. All other training and model parameters remained constant, which makes this version comparable to the aforementioned CNN (Table 2).

Table 2 ADE comparison of two CNN variations. One with and one without the usage of map data

The comparison of the ADEs for both CNN variations shows a clear improvement in the forecast accuracy when using infrastructural data.

4.2 Evaluation of Prediction Time Horizons

The prediction time horizon is one of the most important parameters of any trajectory forecast model. It defines the time frame over which the prediction is supposed to happen. Naturally, the accuracy of a calculated trajectory decreases with a higher time horizon. The overall precision of a model determines how long the prediction time can be chosen before the accuracy is not sufficient anymore for the given use case. The previous results were generated using a prediction time horizon of 3 s. In the following, the influence of different time horizons on the ADE is investigated. This can be achieved by training a separate model with each of the respective prediction time frames from one to six seconds (Fig. 6).

Fig. 6
A multi-line graph of A D E in meters versus prediction time in seconds. It plots the C N N, M L P, L S T M, and Kalman filter lines with an increasing trend where M L P has the highest A D E and prediction time.

Comparison of the four used trajectory forecast models with increasing prediction time horizons

The comparison shows that the ADE of each model increases with higher prediction times. The MLP showed the worst results at high prediction times to a point where the model is no longer trainable if the time horizon would be even higher. The best performing model is the CNN, which yields a better performance than the other models at all prediction times. All error values increase to a value where they are no longer reasonable in the use case of an intersection. The crucial difference is that the CNN could be used to predict trajectories of cyclists up to 4s length, while the Kalman Filter and the MLP are already too inaccurate with a prediction of 3 s.

In future work, possibilities of increasing the forecast time frame can be further investigated. The most important advantage of the used NN is the fact that additional types of data, e.g., behavioural or psychological parameters can be incorporated in order to allow for an even more precise and long-term trajectory prediction.

5 Discussion and Future Work

In both this book chapter and the previous one by Trommler et al., influencing factors on the behaviour of cyclists in interaction situations with cars were investigated. The analysis of naturalistic cycling studies in the book chapter by Trommler et al. showed that the cyclists’ actions can be very dynamic and depend on a variety of conditions. Determining every deciding factor on a certain manoeuvre is not always possible. A neural network like the proposed model of this chapter is a well-suited method to predict such manoeuvres. Not only is this machine learning algorithm capable of utilizing input parameters of all sorts but it may also implicitly learn behaviour patterns of cyclists, by training it with a large amount of real traffic data.

The implementation and evaluation of the neural networks proved to be a viable approach for an accurate trajectory forecast of cyclists in cooperative interaction situations with cars. The dynamic and fast movement of a bicycle makes the cyclist’s behaviour generally hard to anticipate. Providing information about the scenery and also the other vehicles helps to improve this estimation, by reducing the cyclist’s possible range of action. Comparing the effect of additional input data showed that the cyclist’s movement can be predicted better by utilizing influencing factors in its surrounding, like the infrastructure. This can be seen in the contrast between the proposed CNN’s accuracy to the Kalman Filter (see Sect. 4). The advantage of a neural network in this use case is its capability to incorporate parameters that a physical model could not. In future work, this approach can be expanded to utilize even more behaviour indicators like individual factors of the cyclist.

The evaluation of the proposed CNN for cyclist trajectory prediction showed that its accuracy can vary depending on the type of movement. The amount of behaviour patterns that a machine learning algorithm can model is limited by the given scenarios that are included in the training dataset but also by the used input parameters and if they can serve as indicators for a certain manoeuvre. One example of a manoeuvre that cannot be predicted early is the starting of cyclist who was standing and waiting at an intersection. From the currently used parameters, the kinematic data of the bicycle and its past position, there is no indication for such a motion making it impossible to estimate when the cyclist is going to start. To allow for such a prediction, additional input parameters would need to be incorporated. Such potential indicators for the cyclist’s starting behaviour were investigated in the previous book chapter by Trommler et al.. A body pose detection algorithm could be used to extract these features and use them as additional input for the proposed neural network. This would widen the algorithms potential field of use and will be subject to future research.

6 Conclusion

Neural networks can combine the autonomous vehicle’s trajectory with infrastructural data to forecast a collision free trajectory for cyclists in interaction situations. The research in this project showed that a cyclist’s behaviour in mixed traffic is dependent on many factors. A neural network, like the CNN that was discussed in this work proved to be a potent algorithm to incorporate data of various types, e.g., the kinematic data of the road users and map data, into a trajectory forecast. Combining these data types significantly increased the overall accuracy of the movement prediction. In future work this algorithm will be extended by including more potential influencing parameters on the VRUs behaviour.