1 Introduction

Traffic is what moves us. Due to rapid developments in the past years in the fields of hardware, software, communication, and connectivity, it is in close reach that we do not even have to steer anymore. In special use cases, e.g., on motorways with reduced speeds, we are already there. Autonomous cars aim to combine additional comfort with exceeding efficiency when it comes to traffic jam avoidance, pathfinding, and car sharing. The most crucial aspect though is the potential to create safer traffic with fewer accidents and fatalities. This is the key issue that prevents the installation of autonomous cars right now.

The project Detecting Intentions of Vulnerable Road Users Based on Collective Intelligence as a Basis for Automated Driving (DeCoInt\(^2\)) aims at providing the benefits of cooperation in traffic with a focus on pedestrians and cyclists. The consortium consists of three partners who work together, contributing novel ideas and algorithms to tackle this crucial challenge in future traffic. Collaboration and cooperation are necessary for the automated domain as well. Single sources of data always lack information. The shift from forward-looking sensors to 360\(^\circ \) perception systems is a first step to alleviate this issue, but still, a single sensor-equipped vehicle is not able to resolve occlusions or to sense behind corners. Nevertheless, this is crucial for a holistic understanding of the current situation and upcoming dangers to provide protection for VRUs and enable efficient and comfortable driving.

Autonomous cars shall not operate in isolation. Their implementation drastically influences every mobility aspect in the public space. Vulnerable road users (VRUs), i.e., micromobility users or pedestrians, are sharing parts of the same space as autonomous vehicles. Whereas cars have the capability to communicate and share information on a technical level, VRUs are not able to tell an autonomous car that they want to cross the street by establishing eye contact as humans would. Special care has to be taken to make automated traffic safe for all vulnerable traffic participants.

Fig. 1
An illustration of a safety system for vulnerable road users. Equipped vehicles share information about their perception and prediction directly with the vehicle on the other side of the road and with the individual through the signal post.

Vision of a connected and cooperating world to provide safety for VRUs

Fig. 2
3 schematics. A, exemplary cyclist body area network of a smartphone, smartwatch, and smart helmet are connected via Bluetooth. B, car equipped with Lidar, stereo camera, and motion analyzer. C, wide angle stereo camera setup at the research interaction. Interaction views are marked in the boxes.

Cooperative perception and movement prediction sensor setup

The vision of future traffic the DeCoInt\(^2\) [8] project is based on is illustrated in Fig. 1. Equipped vehicles share information about their perception and predictions and thus extend their individual limits. Even VRUs themselves and static infrastructure can communicate and contribute in this local ad hoc information environment. These are the three core components in our project to perceive pedestrians and cyclists: the intersection infrastructure, the sensor-equipped vehicle, and sensor-equipped VRUs. Together we collect data to build a labeled ground truth database, apply existing approaches in real-world circumstances, and learn from the observed behavior by training novel models, thus pushing the state of the art of VRU protection systems in an automated and connected world. Figure 2 depicts the actual sensor setup. In the bottom, the static, wide-angle, synchronized stereo camera setup mounted at the research intersection [28] in Aschaffenburg, Germany, together with two sample images illustrating the fields of view, is shown. The area of the corner of the main road to a side road is the common field of view. Additionally, we collected data with a research vehicle [36] equipped with a LiDAR, a stereo camera, and an automotive dynamic motion analyzer (ADMA). The latter provides a self-localization ability. We created a local coordinate system with the ADMA of the research vehicle and the stereo camera setup of the intersection having the same origin. The third component are the VRUs themselves. We conducted measurement campaigns following specific scenarios involving the research vehicle and VRUs in the area of the research intersection. The VRUs are equipped with smart devices [7], as, for example, depicted in Fig. 2a. In the common field of view, labeled data can, together with the precise calibration of the stereo camera system, provide a positional ground truth. The VRU smart devices provide inertial and positional information about the VRU. Altogether, throughout the project, we collected short sequences with instructed and uninstructed VRUs capturing the movements listed in Table 1. Curated parts of the database are made publicly available [41, 44, 65, 66]. Extensive descriptions of the data collection and preparation, approaches, algorithms, and evaluations can be found in the Ph.D. theses [7, 36, 53, 64] that evolved from the project.

Table 1 The gathered dataset with the number of scenes, persons, and motion primitives describing the possible motion states of VRUs

1.1 Main Goals

The main focus of our approach, and consequently the DeCoInt\(^2\) project, is the investigation of techniques for cooperative intention detection and trajectory forecasting of VRUs. Our overall goal is to detect the intentions of VRUs early and reliably using the collective intelligence of all road users. A schematic of this process is depicted in Fig. 3.

Fig. 3
A block diagram of cooperative perception and movement prediction system. The signal from the vehicle, infrastructure, body, and others passes through cooperative perception and detection units, situation understanding, and cooperative maneuver and trajectory planning.

Schematic representation of the overall process to cooperative perception and cooperative intention detection [8]

Due to the ability of VRUs to suddenly start a motion or to change the direction of motion, a dangerous situation may occur within hundreds of milliseconds. To avoid accidents, autonomous vehicles must be aware of their surroundings at all times. This includes not only the current but also the future positions of VRUs. Based on position forecasts, each autonomous vehicle can then plan a safe trajectory in mixed traffic. To achieve this goal, we aim to perform cooperative trajectory forecasting for VRUs. We generate forecasts over a short time horizon of 2.5 s, which is sufficient to perform emergency braking or evasive maneuvers [7, 64]. While we aim at forecasting trajectories with high positional accuracy, all predictions of VRU behavior are subject to error. This is especially true for larger forecast horizons since VRUs can change direction quickly without evidence of the behavior at the time at which the forecast is made. Therefore, we need to quantize the uncertainty of our forecasts as well. This can be achieved by probabilistic trajectory forecasting, where, instead of one position for every forecast time step, regions with a certain probability are predicted. The main goal of probabilistic trajectory forecasting is to generate reliable estimates, i.e., if we estimate regions with a probability of 95%, the true position of the VRU should fall into that region in 95% of all forecasts. Another goal of probabilistic trajectory forecasting is to estimate regions that are as small as possible to allow efficient maneuver planning. Since a single sensor setup is prone to occlusion in dense traffic, our goal is to perform these forecasts cooperatively.

According to the Oxford Dictionary, cooperation is the action of working together to the same end [51]. Hence, the action is the process of combining information originating from different sources, i.e., vehicles, sensor-equipped infrastructure, and VRUs themselves, to increase the safety of VRUs. In the following, all involved entities are referred to as agents. In our work, we see the cooperative system from the perspective of an ego vehicle. All agents (including the ego vehicle) perform cyclist detection and intention detection locally. These agents exchange information via a wireless ad hoc network (i.e., V2X network). Ego-vehicle information (such as the position) and fused information of earlier stages are always available to the ego vehicle. For the sake of brevity, the corresponding arrows are not shown in Fig. 3. Perception incorporates the detection of cyclists, e.g., the detection of cyclists in camera images, RADAR, or LIDAR scans. Smart devices and other wearables detect the position using the integrated GNSS receivers, predict the VRU class, and perform intention detection (i.e., basic movement detection and trajectory forecasts) using their inertial sensors (cf. Sect. 4.2). Furthermore, we assume that the time between the agents is synchronized, e.g., via GPS time.

Fig. 4
A block diagram presents the feature and the decision-level fusion paradigms for the cyclist’s basic movement detection. Basic movement detection includes 3 steps, classification, feature decision-level fusion, and features.

Two fusion approaches for the basic movement detection [7]

We conduct cooperative intention detection on the feature- and the decision-level. We can further subdivide this into the fusion of basic movements and trajectory forecasts. We depict a schematic showing the feature- and the decision-level fusion paradigms for the cyclist’s basic movement detection in Fig. 4. We refer to the feature-level fusion paradigm as the fusion of sensor measurements and features from different sensors and sources. This combined information is used to detect the cyclist’s intention better. In the decision-level fusion paradigm, predictions from basic movement and forecasting models of different road users are fused.

1.2 Outline

In this chapter, we describe how we solved the issues of detecting and tracking VRUs together with the follow-up process of intention detection as our main contribution. Our focus lies on the cooperative information gain based on the multimodal setting depicted in Fig. 2. In doing so, we first evaluate the weaknesses of uni-modal object detection and determine the strengths and chances of multiple different sensor sources via cooperative tracking in Sect. 2. Towards confidence estimations, we identify external factors influencing the tracking performance, i.e., context information. VRU tracks form the essential input to the intention detection process in Sect. 3, which we divide into basic movement detection and trajectory forecasting. We perform both steps separately on the three different device types we have available, i.e., the stationary cameras, the moving vehicle, and the smart devices. This is beneficial, as we point out throughout the work, due to different constraints and possibilities. Moreover, we identify additional input data derived from the sensors, such as optical flow images and poses. The basic movements form an additional input for trajectory forecasting, the essential part of our work. At first, we predict the movement of pedestrians and cyclists in a deterministic manner again separately for our three sources. Then, we show approaches of how to estimate, predict, and evaluate the confidence in the forecasts made on stationary cameras and a moving vehicle in our work on probabilistic trajectory forecasts. At this point, we emphasize the contribution of our work to the trajectory planning task of autonomous vehicles. Object detection, together with tracking and probabilistic trajectory forecast, have direct use in the planning of efficient and safe vehicle paths. In Sect. 4, we showcase the benefits of cooperatively using our three information sources in the intention detection process as we already have for the tracking stage. As shown in Fig. 2, the VRUs themselves equipped with multiple smart devices form an information source. Therefore, we show how the different devices and wearing positions contribute to specific information gains, and we can combine them beneficially. In the next step, we examine different methods for cooperative intention detection, including feature- and decision-level fusion for basic movements and trajectories. In this context, we present innovative solutions for a great variety of problems, such as delay, sensor outage and occlusion, out-of-sequence fusion, and information loops. Moreover, we investigate and compare different approaches and elaborate on the possibilities of implementing such a system utilizing current V2X protocols and standards, such as collective perception messages (CPM) and collective awareness messages (CAM).

2 Cooperative Perception and Tracking

The first step on the road to intention detection and prediction of VRUs is to detect pedestrians and cyclists and identify them throughout the scene, i.e., perform tracking. This has to be done in a precise and accurate way as it is the basis for all the following steps. Accuracy addresses the ability to ensure that a detected VRU is existing and indeed of the claimed type. If a VRU is not detected although it is in the field of view of the sensors and not occluded, the accuracy is reduced. The precision measures the distance of a detected VRU to the ground truth real-world object. Our approach is to achieve reliable and precise detection and tracking in a multimodal and multi-agent setting, i.e., in a cooperative way. We make use of a stereo camera setup at a static road site unit, a stereo camera mounted in a vehicle, and the VRUs themselves equipped with smart devices to provide superior coverage and more precise solutions than one single sensor could provide. Additionally, we will show the impact of context information on the performance of our system representative for all state of the art detection algorithms and the ability to gather context information more extensively and accurately in cooperation.

2.1 Context Dependent Detection

In this section, we explore the performance of state of the art object detection methods. On the way to fault-free and therefore safe autonomous driving, such perception techniques should be reliable in any case. For the viewing angle and lighting situation, two exemplary cases of context knowledge, we will discover significant differences to the regular performance.

2.1.1 Viewing Angle Dependent Bicycle Detection

Neural networks (NNs), in the field of image processing especially convolutional neural networks (CNNs), define the state of the art techniques for detecting objects, segmenting images, and many more tasks. Although setting the bars higher for algorithms in this domain and outperforming all previous approaches assuming the testing data to be at least similar to the training data, there is still room for improvement. In this section, we want to highlight some flaws of state of the art detection algorithms based on our data.

Fig. 5
2 gray-scale photographs. A, a car and a bicycle are detected. B, a car and a person are detected are detected.

Bicycle, person, and car detections from the left and right camera view

Figure 2c shows the viewing angles of our static stereo camera setup. There is a bike lane next to the sidewalk on the main street. The lane is directed towards the left camera. The right camera has an orthogonal view of the lane. An example of a cyclist riding on the described bike lane can be seen in Fig. 5. The figure also shows the detection boxes with classes and confidentialities created by a Faster R-CNN [56] network based on a ResNet-101 [29] backbone trained on the COCO dataset [46], which had a state of the art performance at the time of the evaluation considering a close to real-time execution speed. A cyclist is not labeled as a separate class. A cyclist is detected by a person bounding box and a bicycle bounding box which have an intersection over union (IoU) above a determined threshold. We experienced an IoU of 0.3 as sufficient. The detection of cyclists is a central part of our project as we want to increase the cyclists’ safety by predicting their behavior. In particular, reliable bicycle detection is crucial. Figure 5 shows the bicycle detected in the right, orthogonal camera view, but not in the left, straight ahead camera view. This is indeed no exception in evaluating the bicycle detection rate with respect to the two mentioned camera angles. We evaluated 51 scenarios of cyclists riding on the bicycle lane or on the pavement next to it in the direction indicated in Fig. 5. The scenarios cover most of the possibilities cyclists can be visible in the two cameras.

Table 2 Viewing angle dependent bicycle detection rate evaluated on 51 scenarios

Table 2 lists the detection rates with respect to all frames of all 51 scenarios. The detection rate in the right camera is 0.8993. It might be a little lower than the expected detection rate of an object detection task, but the fact that there are people sitting on the bicycles in every image and the, in some parts, low contrast with respect to the background makes the task more difficult than in the trained dataset. Moreover, the weather conditions are challenging in some scenarios. We will elaborate on that fact in more detail in the following. Nevertheless, the detection rate of 0.2476 in the left camera is significantly worse than the one on the right. The detection rate is smaller for the left camera angle in 50 of the 51 scenarios. The cause can be manifold. It might be an underrepresentation of such images in the training dataset or simply a more challenging task due to the fact that less of the bicycle is visible and more of the bicycle is hidden by the rider. Whatever the case, the consequence is that we are not able to reliably detect a cyclist in such a case if only the straight-ahead camera is available.

We, therefore, propose in our approach that at least a second camera angle is necessary to be able to track cyclists without gaps. This will be discussed in more detail in Sect. 2.3. Moreover, we created a novel algorithm to subtract the background from the foreground in static camera setups [55] that is able to identify moving objects. Regions of movements without detected objects indicate missed detections.

Table 3 Lighting situation dependent person detection evaluated on 51 scenarios

2.1.2 Lighting Situation Dependent Person Detection

The differences in the detection performance with respect to the viewing angle give a hint that is important to take additional information into account when we estimate our trust in the output of our perception system. We call this additional information context. It is an explicit goal of our project to investigate the influence of context information on the detection and prediction capabilities of our algorithms. Therefore, we split our database with respect to another criterion, the time of day. In Sect. 2.4, we take a look at further types of weather and lighting condition context types. Nevertheless, the main focus in the process of the creation of our dataset was to provide a basis for the development of algorithms for VRU intention detection and the corresponding evaluation. The different times of day and viewing angles are a side product of the data-capturing process. They are represented in significant amounts of scenarios to be able to make deductions. We also discovered rain or sun glare but in too few scenarios to provide statistically significant results. In the current projects of the three participating teams, we are continuing to work on extending the existing database specifically with respect to such further aspects to evaluate and address model weaknesses. To avoid the viewing angle as an influencing factor, we choose the person detection in the two camera views for the 51 scenarios already mentioned. The mean detection performance is similar for both cameras with a recall of 0.825 in the right camera view and 0.780 in the left one. There are 11 scenarios captured in the evening or during a thunderstorm which resulted in less daylight and darker images. We refer to such scenarios as dawn_dusk in contrast to daytime which is the regular case. Table 3 shows the detection rates. The detection rate is lower for the 11 dawn_dusk scenarios for both cameras. In the left camera, the difference is about 0.11 and in the right one even 0.21. The sample size is still small such that with more data and a network training based on more dawn_dusk images the difference might not be so big anymore. Nevertheless, we have found a motivation to further work on the detection and determination of context information and to include context information in the data collection process and the assessment of detection confidentialities.

2.2 Cooperative Detection and Tracking of Cyclists

Cooperation is an integral component on the way to a comprehensive and reliable detection of VRUs in an automated traffic environment. To showcase the ability of an equipped multi-agent system to overcome the limitations of single ego-vehicles for example, we gathered data in multiple measurement campaigns in real-world environments based on the setup described in Sect. 1.

Fig. 6
Two views of the stereo camera system. A, left camera view on which a cyclist is detected. B, right camera view on which no cyclist is detected.

Cyclist occluded by a truck in the right camera of the static stereo camera system

In a reduced setting, we showcase the benefit of cooperation in [54]. We concentrate on the tracking of cyclists, i.e., we assumed that the objects we want to track are cyclists. The relevant device the cyclists carry is a smartphone in the trouser pocket. Followup works described in Sect. 4.2 elaborate more on additional devices mounted at different wearing positions. The output of the VRU sensors is an estimation of the velocity, the yaw rate, and the GPS-based position. Additionally, we detect and determine the 3D positions of the cyclists with the static stereo camera setup mounted at the research intersection. The intersection sensor setting provides a positional accuracy of less than 10 cm in every direction. The accuracy of the intersection detections is superior to the ones provided by smart devices. Therefore, if both sources are available, the smart devices do not contribute to better tracking performance. Nevertheless, in cases of occlusions, e.g., a truck blocking the view of a cyclist from one camera, no 3D object detection can be performed by the stereo camera system anymore. An example of such a scenario can be seen in Fig. 6. Smart devices are still able to communicate their measurements. In [54], we show that in more than 84% of the turning scenes under occlusion, the additional smart device information benefits the tracking performance significantly. Figure 7 depicts a scene in which the cooperative tracking follows the ground truth closely, whereas the tracking based on the static camera setup only loses sight and can only predict the following positions based on the tracking model.

The chosen model is the bicycle model tracked with an extended Kalman filter (EKF) [3, Chap. 10]. The state transition for the state space \(\textbf{x} := [x, y, \gamma , \dot{\gamma }, v]^T\) with the positional coordinates x and y, the orientation \(\gamma \), its derivation \(\dot{\gamma }\), i.e., the yaw rate, and the velocity v is given by

$$\begin{aligned} f(\textbf{x}) := \begin{bmatrix} x + \cos (\gamma )\, a - \sin (\gamma )\, b \\ y + \sin (\gamma )\, a + \cos (\gamma )\, b \\ \gamma + \dot{\gamma }\, T\\ \dot{\gamma }\\ v\\ \end{bmatrix} \end{aligned}$$
(1)

with \(a =\frac{\sin (\dot{\gamma }\, T)\, v}{\dot{\gamma }}\) and \(b = \frac{(1 - \cos (\dot{\gamma }\, T))\, v}{\dot{\gamma }}\) for a time step T. The z coordinate can be determined by the stereo camera setup besides x and y, but it is not used in the referred evaluation. The smart devices contribute with the velocity v and the yaw rate \(\dot{\gamma }\). The occlusion in the case referred to in Fig. 7 starts at the white-filled circle. Starting at this point, the trajectories drift apart. Due to the yaw rate information by the smart devices, the green cooperative track can follow the ground truth closely. The grey and black-filled circles depict one and two seconds after the start of the occlusion.

Fig. 7
A multiline X Y graph compares the trajectories of stereo camera tracking only, cooperative tracking including smart devices, and ground truth trajectory. The trends start from ( negative 2, 4) and follow the same path till (2, 2), and then their paths divert.

Comparison of static stereo camera tracking only (red triangles) with cooperative tracking including smart device data (green circles) under a temporal occlusion of 2 s. The blue squares show the ground truth trajectory

2.3 Pedestrian and Cyclist Tracking Including Class Probabilities

So far, we have mentioned the perception of cyclists that contains an intersecting detection of a bicycle and a person and the tracking of cyclists themselves. The movement model used for the cyclist tracking in Sect. 2.2 describes movements by arcs and therefore is especially suitable for cyclists but unstable in cases of sudden, not smooth, or even backward-oriented changes in the movement direction, which is in the nature of pedestrian trajectories. Therefore, a linear constant velocity or acceleration model that is independent in the lateral and longitudinal directions is more suitable for a pedestrian. We have already mentioned that being able to track VRUs is a basis for further steps in our VRU intention detection approach. Besides the choice of the tracking algorithm, a tracker tends to perform best if it is applied to an object of the class it has been designed or trained for. In our case, we have the cyclist model with the state space \([x, y, \gamma , \dot{\gamma }, v]\) and the pedestrian model with the state space \([x, \dot{x}, y, \dot{y}]\). In the following, we also develop models to predict the behavior of cyclists and pedestrians. They depend on the knowledge of the class of the VRU, too. Therefore, we want to extend the aforementioned state spaces and the tracking described in Sect. 2.2 by an additional class probability functionality. There are two information sources for the class probability. The fit of the respective model to the movement behavior observed and the object class predictions by the NN classifier. The former is a problem that is studied in the literature in the field of multiple model approaches. The idea is to have a set of possible models, and each of them is fed by the measurements, i.e., the detected object positions. In every step, it is evaluated with a probability score of how well each model fits the perception. We intend to implement the individual model tracking with Kalman filters following Sect. 2.2. Therefore, the bicycle model is implemented via an EKF and the pedestrian model via a two-dimensional constant velocity Kalman filter. The interacting multiple model (IMM) [15, 24] approach is popular, especially together with Kalman filters. It shows a robust behavior with respect to model mismatching [49]. In addition to the individual model states, the IMM holds a common mixed state and covariance estimate that form the state of the IMM model. We name the state estimates at a given point of time for the bicycle and the pedestrian model \(x_b\) and \(x_p\), respectively. Every model is assigned a model probability \(\mu _b\) and \(\mu _p\). The IMM state estimate is given by

$$\begin{aligned} x_{\text {IMM}} := x_b \mu _b + x_p \mu _p. \end{aligned}$$
(2)

The covariance \(P_{\text {IMM}}\) is deduced analogously. To perform the prediction step of the IMM, mixed states \(\hat{x}_j := x_b \mu ^{b|j} + x_p \mu ^{p|j}\) are calculated for every model j with \(\mu ^{i|j} := \frac{1}{\psi ^j}\rho ^{i,j}\mu _i\) being the conditional model probabilities for model i assuming j, \(\psi ^j\) being a normalization factor, and \(\rho ^{i,j}\) being the respective entry in the state switching matrix \(\rho \). The state switching matrix adds to the stability of the IMM. Initially, the probabilities of staying in a state or switching states are initialized with 0.5. With the growing age of the track, the probabilities of staying in a state iteratively grow. At every prediction step, the mixed model states \(\hat{x}_b\) and \(\hat{x}_p\) are propagated together with the covariances as new states to the individual models. The prediction step is performed based on the propagated states in the way defined by the individual models to gain \(\tilde{x}_b\) and \(\tilde{x}_p\). The update step is performed on the individual models given the incoming measurements, i.e., person or cyclist detections. The residuals \(r_b\) and \(r_p\) given by the differences of the measurements to the predicted model states define the model likelihoods \(\lambda _b\) and \(\lambda _p\). The likelihoods are the log of the probability density function of the zero averaged normal distribution with the covariance given by the innovation covariance matrices of the Kalman filters of model b and p. The likelihoods are used to update the model probabilities by

$$\begin{aligned} [\mu _b, \mu _p] = \frac{c \cdot [\lambda _b, \lambda _p]}{c [\lambda _b, \lambda _p]^T} \quad \text {with} \quad c := [\mu _b, \mu _p] \rho \end{aligned}$$

and ‘\(\cdot \)’ denoting a point-wise product. The IMM state \(x_{\text {IMM}}\) can be calculated again following formula (2). One adaptation has to be made with respect to the standard IMM algorithm described so far. The state spaces of the bicycle and the person model differ. The state space of the IMM is the union of the individual state spaces, thus \([x, \dot{x}, y, \dot{y}, \gamma , \dot{\gamma }, v]\). To make the IMM state and covariance compatible with the individual model ones in cases of propagation and update, the individual model states have to be lifted to the IMM model state space in such cases following [63].

The standalone IMM tracker is able to classify pedestrians in 38 scenarios with a precision of 0.914 by its inherent model probabilities. Nevertheless, if a cyclist is waiting at traffic lights for example, the bicycle model is unstable and does not fit the behavior very well due to small rapid movements in the process of impatient waiting for example. In the regular movements detected in our scenarios, cyclists did not follow the bicycle model enough for the IMM to classify it. The average precision in 46 scenarios with moving cyclists is 0.335. In comparison to a pedestrian, a cyclist is still classifiable on average, as a true cyclist track holds from a frame-wise perspective more cyclist classifications than a pedestrian track. Still, for a standalone classification, one would expect more from a classifier. The reason might be the tracking of the head of the cyclist that we perform. Nevertheless, by taking into account the detected class labels as well, the classification can be improved. The relative amount of assigned bike detections measured by IoU with a person detection as described in Sect. 2.1.1 with respect to the age of the track provides a sufficient feature. The classification precision is 0.970 for the pedestrian scenarios and 0.969 for the cyclist ones.

Table 4 Comparison of cyclist tracking based on a bicycle model taking only cyclist detections into account with IMM tracking taking pedestrian and cyclist detections into account

The IMM tracker extends our setup by the functionality to track two classes of VRUs simultaneously without having to decide at the level of the object detector output which measurement is assigned to what kind of tracker. In the case of the viewing angle dependent bicycle detection described in Sect. 2.1.1, the cyclist detections are unreliable. Therefore, a standalone cyclist tracker receives only a few cyclist measurements. Table 4 depicts the tracking performance averaged over the 51 cyclist scenarios already evaluated in Sect. 2.1.1 comparing the bicycle model taking only cyclist detections into account with the IMM approach based on both pedestrian and cyclist detections but classifying a cyclist. The performance measures MOTP and MOTA are standard tracking metrics [4] measuring the precision and accuracy of the given track, respectively. Due to the small MOTA score, one can induce that the bicycle model is far less capable of tracking the object at all. This results from the missing detections. Because of the lower MOTP value, the IMM does not only cover the object better but is additionally capable of giving a more precise estimate of the location due to the mixed-in pedestrian component.

2.4 Cooperative Context Determination

We have already mentioned the relevance of context in the field of object detection. Moreover, we have shown that cooperation in detection and tracking can overcome the limitations of singular sensor sources and extend the tracking ability. The sources of context information can be various such as its types. In Sect. 2.1, the context information is based on external ground truth information that is able to be manually determined as the data is relatively small and the scenery is fixed. This is not possible in general. Therefore, in this section, we want to take a look at how we can extend the generation of context information and gather it in a cooperative way to aim for more reliability and better coverage.

2.4.1 Cooperative Semantic Maps

A straightforward idea that comes to mind when thinking about how to extend the available information with some extra knowledge is to use maps, more so maps that are enriched with additional annotations. We call this semantic maps. Especially in the field of prediction and motion planning, maps can help to avoid invalid impossible paths. This will be discussed in the following in more detail, especially in Sect. 3.2.2. But also the viewing angle dependent object detection evaluation Sect. 2.1.1 shows that knowledge of bike lanes with respect to the camera mounting positions and orientations contributes to a more accurate assessment of the expected detection performance.

Fig. 8
4 Maps. A, O S M map. Roads and buildings are plotted. B, enriched O S M map. The parking areas are also lotted along with the roads and buildings. C, map captured by sensor. D, sensor map fused with enriched O S M map.

Semantic map fused from sensor information with enriched OSM maps

We use a local map provided by the open street map (OSM) [48] organization to gather static map information. The amount of information and the accuracy varies depending on the contributions to the map pool by the community. In our case houses and streets are contained, cf. Fig. 8a. Considering additional annotations, high-definition maps hold precise and excessive information. Nevertheless, it is expensive to capture HD maps and thus they are only available in certain areas. To extend our initial maps, we included information about sidewalks (yellow) and parking slots (brown) in the map visible in Fig. 8b. Whereas the bare OSM map can be retrieved automatically, the extensions are made with human interaction. Still, it is expected that maps like Fig. 8b are available in close future and even already are in many locations. Another way to gain maps is to use sensor information. Using LiDAR point clouds and image-based segmentation provided by a stereo camera both mounted on our research vehicle, map Fig. 8c was created. There is more detail to it in Sect. 3.2.2. It is created and fused in multiple capturing drives. The output has the advantage that it can be captured automatically and can contain all the information the segmentation classifies. The disadvantage lies in a limited field of view and the dependency on the accuracy of the ego-positioning ability of the vehicle and the classification together with the association to the LiDAR point cloud. The latter can have serious errors due to perspective issues. Moreover, it is difficult to create smooth and convex solutions. For example, in Fig. 8c, the holes at the edges of the pavement are visible. Therefore, it is beneficial to fuse the two information sources. Figure 8d shows the result. The houses are complete and the edges of the pavement are sharper. The benefit of such a semantic map for movement prediction is shown in Sect. 3.2.2 and the fusion benefits the accuracy in a straightforward way.

2.4.2 Cooperative Weather, Road, and Lighting Conditions

In Sect. 2.1.2 we show the effect the brightness of the daylight has on the detection rate of objects. We extend this context information to weather and conditions. The assumption is that not only the task of perception but also the behavior of traffic participants are affected, for example, by heavy rain or icy roads. We train our models to detect objects and predict their behavior under the assumption that they act the same way we have seen during training under similar conditions. It is crucial that these conditions have been met in the training phase. Otherwise, unpredictable behavior is the consequence. To support the description of conditions, context information might be useful. In this section, we describe what kind of context information we thought of being interesting and how we cooperatively detect it. As already mentioned, it was not possible to conduct enough field studies to evaluate in a statistically significant way the influence of the specific context types on the performance of our algorithms. This is one topic of the current project KI Data Tooling [33] the partners of this project also contribute to.

Table 5 Types of context with the sets of possible values

Table 5 lists the types of context and conditions we considered with the expected, i.e., labeled, values. Not all of them are contained in the dataset described in Sect. 1. Moreover, the list is not comprehensive and extended in the context of KI Data Tooling. As already mentioned in the evaluation in Sect. 2.1.2, the times of day ‘daytime’ and ‘dawn_dusk’ can be found in the research intersection dataset. To be able to detect the context types automatically, we trained a model for every type based on a ResNet-50V2-architecture [30]. The challenging parts are to build a good training dataset and to conduct consistent labeling. Without further knowledge, for example, it is not easy for a human spectator to detect, e.g., rain in images. Nevertheless, we labeled 28563 single images manually by ourselves. The images originate from our own dataset, the University of Passau Weather in Autonomous Robotic Driving (UPWARD) dataset containing 15566 samples and from the DENSE SeeingThroughFog dataset [13] providing another 12997 samples. We train and validate on 23028 samples (12630 UPWARD, 10398 DENSE) and test on 5535 images (2936 UPWARD, 2599 DENSE). To address the heavy class imbalance, we apply undersampling. This necessitates independent training of a separate model for each attribute.

Although the context classes are not all contained in the 50 scenarios used in Sect. 2.1—one scenario does not contain vehicle data and is therefore removed—we want to show the performance of the context detection models on them, because due to the setup we can evaluate the benefit of cooperation. We use the two cameras mounted at the research intersection and the camera mounted in the research vehicle as sensors. The mounting angle of the right intersection camera is such that the reflections of the street give, in any case, ‘wet’ due to the road context. This is due to the fact that the training data is gathered from lower-mounted cameras. The right camera does not contribute to a fused result as well.

Table 6 Number of true context detections on images of 50 scenes from the research vehicle, the left and right intersection cameras, and a fused result

The evaluation results are shown in Table 6 by the number of correctly classified scenarios. To reduce the labeling effort, one ground truth label was created for every scenario. This might not be very accurate in case of illumination for example as sun glare can be limited to a short time span and the rest of the scenario is not affected. This is also the reason why the vehicle performs much worse than the static cameras in the illumination context. The left camera is mounted in a way that allows a good detection performance with respect to the time_of_day. Precipitation is detected best by the vehicle.

Overall, we discovered two major takeaways throughout the process of cooperative context determination. Firstly, it is difficult to determine consistent labels for the specified classes and to detect them properly as a human spectator. The granularity of the labeling is also a factor that has to be covered in more detail. Secondly, the fused result does not always give the top result but does in almost every case exceed the vehicle’s performance capabilities. For a single car equipped with a camera, it is not possible, at least at the state of our training data, to detect the defined classes of context with an acceptable rate. Additional information sources are necessary.

3 Intention Detection

In this section, we describe our work in the field of VRU intention detection. The goal of intention detection is to create a basis for maneuver planning algorithms to be able to interact with VRUs safely. Therefore, we have to make a forecast about the VRUs’ future trajectory, including uncertainties. The main focus of our project is on cooperative intention detection. In the first step, we investigate methods for intention detection in a non-cooperative way using different sensor modalities and analyze their strengths and weaknesses. These investigations are described in this section. We then select suitable methods for use in a cooperative manner and explore how much improvement can be achieved through cooperation. The investigations regarding cooperative intention detection are described in the next section.

We define intention detection as a two-stage process comprised of basic movement detection and trajectory forecasting. A schematic of the process is depicted in Fig. 9.

Fig. 9
A block diagram. The sensor transmits data to the basic movement primitive detection block and trajectory forecast block. The outputs are the situation prediction and the maneuver planning.

Schematic of the two-stage cyclist intention detection and trajectory forecasting model [5]

The first stage is basic movement detection to identify the VRUs’ current state of motion, e.g., waiting or starting. As the results from basic movement detection alone do not allow to make a statement about the future VRU positions, the state estimations are used as an intermediate result within the intention detection process. Our goal is to demonstrate that the state estimations can help to improve the trajectory forecast results. Furthermore, we show that basic movement detection results can be significantly improved by incorporating video and pose information into the process. Additionally, we investigate using data from smart devices worn by VRUs as a basis for basic movement detection. Our methods for basic movement detection are described in Sect. 3.1.

The second stage of the intention detection process is trajectory forecast, which generates estimates of future VRU positions. The forecasted trajectories are the output of the intention detection process and form the basis for maneuver planning in automated vehicles. One of our goals is to include video information and basic movement detections in the forecasting process. Secondly, we aim to generate probabilistic trajectory forecasts to quantify the uncertainties of our estimates. To demonstrate the applicability of our methods, we combine our probabilistic forecasts with a maneuver planning method. Trajectory forecasting is described in Sect. 3.2.

3.1 Basic Movement Detection

Basic movement detection of VRUs has become an active field of research over the past decade. While many existing methods focus on specific scenarios or movement states often placed in a lab environment with ideal conditions [32, 52], we aim to demonstrate heuristic approaches covering all possible scenarios and states. We investigate using different sensor sources, i.e., stationary cameras mounted at an intersection, a stereo camera from within a moving vehicle, and smart devices worn by the VRUs themselves (Sect. 1). Furthermore, we examine different representations of the VRU sensor data in the form of trajectories, human poses, or video sequences. In this section, we describe different methods for basic movement detection. In Sect. 3.2, we discuss methods to incorporate basic movement detections into the forecast process.

Basic Movement Detection Using Stationary Cameras

The use of stationary cameras for intention detection leads to multiple advantages. Compared to sensors in a moving vehicle, stationary cameras can be mounted at higher positions and at an angle to each other to resolve occlusions and to reduce uncertainties of single sensors. Stationary cameras also have the advantage that the environment is known and the background is static. Furthermore, since stationary systems are less limited to space and power consumption requirements compared to systems inside vehicles, more powerful systems with regard to their computing capabilities can be used. We use these advantages by incorporating video information into our basic movement detection.

Many existing methods use a single past VRU trajectory as input data for basic movement detection [1, 27]. However, compared to the original video feed from the sensors, a lot of information about the VRU behavior is lost, e.g., movements of the upper body may signalize a starting motion, or the VRU’s gaze direction can indicate a turning motion.

An approach to preserve information about the VRU’s body gestures uses motion history images (MHI) [35]. To generate the MHI, the binarized silhouette of the VRU is extracted from every image. The silhouettes from the current image and past images of a certain observation period are then stacked into a single image, where the most recent silhouette has the value 1.0, and older silhouettes receive smaller values between 1.0 and 0.0 with regards to their timestamp (Fig. 10). This creates an image that encodes the past movements of the VRU, which can now be used with a simple image classifier to perform basic movement detection. However, the method strongly depends on the quality of extracted silhouettes. Also, a lot of information is lost through the binarization of the images.

To increase the level of information, more recent approaches utilize human pose trajectories for basic movement detection. Instead of using a single trajectory from an anchor point, such as the center of the VRU’s head, multiple trajectories of the VRU’s joints are used. This way, important features such as distinct body poses or leg movements are preserved while greatly reducing the feature size compared to the original video stream. One disadvantage of the method is that it depends strongly on the quality of the pose detection. While larger joints can be detected relatively reliably, detecting smaller features, such as the eyes, which can be used to extract the gaze direction, proves difficult. Furthermore, information about the surroundings, such as road markings or obstacles in the VRU’s way, is lost.

Fig. 10
A flowchart. Cyclist detection maps to the segmentation of bicycle and person that in turn maps to binarization and stacking of segmented images and motion history images.

Exemplary MHI generation of a starting cyclist

Fig. 11
4 grayscale images describe the extraction of video sequences from the original video feed. The VRU is detected in the current image, followed by the sequence generation. The extracted images are then stacked into a short video.

Extraction of a video sequence from the original video feed from camera 1. In the VRU detection step, every VRU is detected, and a region of interest is created (left). In the second step, images (right, top) and optical (right, bottom) are stacked to sequences that are used as input for our model

Therefore, in our approach to basic movement detection with stationary cameras, we directly utilize video sequences for basic movement detection. Figure 11 describes the extraction of video sequences from the original video feed. In the first step, the VRU is detected in the current image. The detection window is used to create a region of interest that covers the near vicinity of the VRU, which is used to extract images from the current time step and past time steps within the observation period. In our case, the past observation horizon covers one second. The extracted images are then stacked into a short video of the VRU moving inside the region of interest, which is used as input for a three-dimensional convolutional neural network (3D-CNN). In a preliminary investigation, where we focused on detecting starting motions of a cyclist, we used these image sequences as the only input for the network [10]. However, more recent studies in the field of action recognition aside from intention detection in road traffic have shown that the use of an optical flow sequence in addition to the image sequence leads to significant improvements with regard to detection accuracy [18]. Therefore, we additionally use the optical flow sequence for our investigations. Furthermore, to reduce negative effects caused by occlusions, our movement detection is performed using inputs from both cameras of our wide-angle stereo-camera system described in Sect. 1. We investigate the use of single cameras individually, both cameras simultaneously, and the use of only image sequences or optical flow sequences, respectively. Since the past VRU trajectory is known, we also examine its additional use as input data. For our investigations of stationary systems, we use the dataset created with the wide-angle stereo-camera system described in Sect. 1. Our methods are compared to a single trajectory approach, as well as an MHI-based approach. The feature extraction from the video sequences is performed using the proposed network architecture from [18]. To evaluate the results of individual time steps, we use standard metrics used in classification. To evaluate the detection results over time, we use the segment-based approach proposed in [7, 22], allowing us to rate the detection method in terms of how often a motion state is wrongfully detected over time. We see this as an important metric since wrong detections during a motion state can lead to a false trigger of an emergency brake assistant of an automated vehicle. A detailed description of the used algorithms and the conducted experiments can be found in [71].

Our experiments regarding the user input data show that the best results are achieved by using all inputs, i.e., image and optical flow sequences from both cameras and the past trajectory. However, only slightly worse results are achieved if we omit the trajectory input. If we compare the use of input data from both cameras to only one camera, we can see a significant improvement by adding the second camera. This is partly due to the resolvement of occlusion, but we also found that some motion states are better detected using a certain camera angle. For example, starting motions are better detected when the VRU is viewed from the side. We compared our motion sequence (MS) based method to the MHI and trajectory-based methods and found that our approach outperforms both in terms of frame-based classification scores and segment scores. The inference time of the algorithm using an NVIDIA RTX 2080 Ti GPU is about 33 ms and can therefore be used in a real-time system. The detailed results of our experiments can be found in [64, 71]. While our results show that our method outperforms existing approaches, we cannot make a statement about whether or not the improvements transfer to the use in trajectory forecasting methods. This aspect will be discussed in Sect. 3.2, where we investigate the use of basic movement detections to improve trajectory forecasts.

Basic Movement Detection from a Moving Vehicle

When we compare stationary intention detection to intention detection from within a moving vehicle, the requirements for the algorithms change significantly. Since the sensors are usually mounted behind the windshield of the car or behind the radiator grill, VRUs are often occluded by other vehicles or objects at the roadside. The consequence is that we often have a significantly shorter observation period to estimate future VRU behavior. Compared to stationary cameras, we do not know the surroundings of the cameras, and we have to deal with changing backgrounds. Furthermore, the vehicle cannot accommodate large PCs, and the power consumption is limited. Due to these requirements, we investigate the use of human pose trajectories for intention detection from within a moving vehicle and compare them to single trajectory approaches. The sparsity of the representation allows us to design lightweight models that allow for real-time capability despite limited resources. At the same time, we maintain a high level of information about the VRU behavior by capturing the trajectories of the body extremities.

In the first step, we evaluate the quality of human pose estimation from within a moving vehicle. While some datasets regarding 2D pose estimation exist, e.g., [2], they are not designed for research in traffic environments. The amount of data with annotated 3D poses are quite limited. Typically, they are created in lab environments, e.g., [31], and do not include any cyclists. As a consequence, the recorded scenarios lack realism with regard to the behavior of the recorded people. Furthermore, there are too many dissimilarities compared to real-world traffic scenarios, such as the surroundings and occlusions of the VRUs. Therefore, we created a dataset recorded in real traffic. The human poses are labeled manually, and we extracted 2D and 3D poses. For the generation of reasonably good ground truth for the 3D poses, we use our wide-angle stereo-camera system at the research intersection. Using this dataset, we evaluate two methods for human pose estimation. The first method detects 2D poses in an image. The second method uses 3D lifting to estimate 3D poses, which we transfer to the world coordinate system using a stereo camera. Our investigations show that both methods perform well and can be used as a basis for vehicle-based intention detection. The detailed results can be found in [38].

Fig. 12
Two photographs with diagrams of human pose trajectories for a vehicle-based basic movement detection system. At the left top, a cyclist is waiting to pass away the car on the road. At the bottom left, the cyclist is in a position to start cycling.

Example scene of a starting cyclist recorded from within a moving vehicle. On the left, two images from the scene are visualized, showing waiting in the first image and starting in the second. The starting motion is clearly visible in the poses extracted from the images shown on the right. The cyclist’s upper body is bent forward, and uses his foot to push off the ground, which is a distinct motion indicating starting process

Based on these results, we conduct experiments regarding the applicability of human pose trajectories for vehicle-based basic movement detection. In a preliminary investigation, we limit our traffic scenario to starting cyclists. An example scene is visualized in Fig. 12. The goal is to detect starting motions as early as possible while maintaining high detection scores. The method is compared to a single trajectory approach. The focus of the evaluation is on comparing the two approaches using different observation periods. As mentioned above, from the perspective of a moving vehicle, VRUs are often occluded, highlighting the importance of a method that functions well for small observation times. In our experiments, we evaluate observation periods between 0.12 and 1.0 s. Both methods use the same model architectures, i.e., a fully connected network (FCN). Only the inputs differ, where the input of the single trajectory model is the past head trajectory, and the pose-based model receives all joint trajectories. We find that both models show similar performance for input periods of 1.0 s. The results of the single trajectory model strongly deteriorate with smaller observation periods, while the pose-based model maintained significantly higher scores for all investigated periods. The investigations regarding observation periods for starting cyclists are published in [39].

Building on our findings on the observation period, we develop a holistic approach to pose-based basic movement detection for pedestrians and cyclists. Compared to the previous method, our investigation is not limited to a single scenario but includes all possibly occurring motion states. Furthermore, we switch our model architecture from an FCN to a recurrent neural network (RNN). The advantage is that RNNs are specifically designed to model time series and allow for variable input lengths. Our method is therefore able to estimate motion states despite short observation periods and successively improves with larger periods. As in the previous investigation, we compare our method to a single trajectory approach, where the pose-based method outperforms the single trajectory approach, especially for short observation periods. The evaluation can be found in [40].

Basic Movement Detection Using Smart Devices

In the previous sections, we described stationary and vehicle-based basic movement detection, where we used camera sensors in both cases. While both methods have different advantages and disadvantages, they are both error-prone with regard to the shortcomings of camera sensors. Camera-based approaches depend on lighting and weather conditions and are affected by occlusions. In contrast, these conditions do not affect smart devices worn by VRUs themselves. Therefore, they have great potential to serve as additional sensor sources for intention detection.

Fig. 13
A schematic of the six-step detection process that includes smart device sensors, preprocessing, segmentation, feature extraction, dimensionality reduction, classification, postprocessing, postprocessing and probability calibration, and basic movements.

Process for basic movement detection using smart devices consisting of six stages: Preprocessing, segmentation, feature extraction, feature selection, classification, as well as post-processing and probability calibration [12]

In [6, 9, 10, 12, 58], we investigate how one can use the inertial sensors of smart devices for basic movement detection. The approaches presented in the different publications are based on human activity recognition involving a machine learning classifier at its core [17]. A schematic of the six-step detection process using accelerometer and gyroscope data is shown in Fig. 13. First, the inertial sensor measurements are preprocessed (i.e., the data is transformed into a rotationally invariant coordinate system), then the signal is windowed, and features are extracted based on the windowed data. Subsequently, feature selection is performed to filter features relevant for detection. These filtered features are then used for detection. For this purpose, the detection problem is modeled as a classification problem. The classifier (e.g., an extreme gradient boosting classifier) is trained with labeled example data. Finally, a probability calibration of the detection probabilities output by the classifier is performed, and a temporal filter filters out any outliers. More details about the approach can be found in  [7, 10, 12]. Regarding the early detection of cyclists’ starting movements, we showed that our approach reaches an F\(_1\) score of 67% within 0.33 s after the first movement of the bicycle wheel. Further, investigations concerning the influence of the device wearing location show that for devices worn in the trouser pocket, the detector has fewer false detections and detects starting movements faster on average. Moreover, we found that we can improve the results by training distinct classifiers for different wearing locations. In this case, we reach an F\(_1\) score of 94% with a mean detection time of 0.34 s for the device worn in the trouser pocket.

Based on these findings, we investigate an extended smart-devices-based approach to detect longitudinal (i.e., waiting, starting, moving, and stopping) and lateral (turning left, going straight, and turning right) basic movements. Smart devices can be used very well for the detection of longitudinal basic movements; our approach achieves a macro F\(_1\) score of 72% with an average detection time of only 0.36 s, i.e., on average a movement change is detected within 0.36 s. Curves or changes of direction of movement (i.e., lateral basic movements) can be detected even more reliably (F\(_1\) score of 82%) and equally fast (mean detection time of 0.38 s). A detailed evaluation and further results can be found in  [7]. In [16], we showed the successful transfer of our smart-device-based movement detection approach to the early anticipation of pedestrians’ movements. Yet in [11], we moved from movement transition detection to short-term cyclist’s movement transition forecasting.

3.2 Trajectory Forecasting

The goal of trajectory forecasting is to estimate future VRU positions. The forecasts build the basis for automated vehicles to safely interact with VRUs, where the forecast horizon, i.e., the time span for which the positions are estimated, depends on the application. In our case, the goal is to perform a short-time forecast for a horizon of 2.5 s, which is often named a relevant horizon to perform emergency brake maneuvers. To perform forecasting, we consider the VRU behavior we extract from video sequences. We avoid incorporating information about the traffic situation, such as traffic lights since the VRUs’ disregard of such can lead to potentially dangerous situations. In the next section, we describe our methods for deterministic trajectory forecasting, where the goal is to forecast the VRU positions in the form of points. Afterward, we describe our approaches to add uncertainty estimation to our methods.

3.2.1 Deterministic Trajectory Forecasting

To perform deterministic trajectory forecasts, we utilize similar methods to the ones used for basic movement detection described in Sect. 3.1. While the same network architectures can be used, the problem is modeled as regression.

Deterministic Trajectory Forecasting Using Stationary Cameras

We investigated the incorporation of video information for trajectory forecasting using stationary cameras. We used the same representation as we used for basic movement detection, i.e., the image and optical flow sequences from both cameras and the past trajectories. We investigated the use of different inputs and compared the results to a method solely based on the past trajectory. In contrast to the results from basic movement detection, the results achieved by incorporating the optical flow sequences from both cameras and the past trajectory outperform the results achieved using all inputs. We attribute this to the fact that the optical flow sequences mainly contain information about the movement of the VRU, and excess information, such as the image background, is removed. The extraction of the optical flow is therefore comparable to an attention mechanism [62]. The positional accuracy is improved by 16.9% using the optical flow sequence compared to 8.2% when all inputs are used. We found that compared to the trajectory-based method, turning motions are better distinguished from straight motions due to a distinct head movement of the VRU towards the direction visible in the optical flow sequence. The detailed results of our investigations can be found in [64, 72].

Deterministic Trajectory Forecasting from a Moving Vehicle

From within a moving vehicle, we utilize 3D poses for trajectory forecast similar to the basic movement detection described in the previous section. In our evaluation, we perform trajectory forecasts for both pedestrians and cyclists and compare the results to a single trajectory method [36, 42]. The focus of our investigation is again on the length of the observation period, where periods between 0.2 and 1.0 s are considered. Furthermore, we compare two different variants of the poses. One variant uses joints of the entire body. In the second variant, the arms, i.e., the elbows and wrists, are not used as input. We hypothesize that the main features to forecast the future trajectory are the orientation of the VRU and the head and leg motions. We found that in the case of pedestrians and cyclists, the forecast accuracy improved by up to 6.93% for pedestrians, and 17.9% for cyclists by using poses. In both cases, no significant improvements were achieved by using the complete poses compared to the armless poses, demonstrating that the arm movements do not add additional information about the VRUs’ future positions. While especially in the case of cyclists, this may seem counterintuitive since they are supposed to indicate turning motions by hand signals, we found that turns are seldom signalized. However, cyclists often perform a shoulder check before turning, demonstrating the importance of tracking head movements. In [43], we use RNNs for a pose-based trajectory forecasting of pedestrians and cyclists based on observation periods varying between 0.04 and 1 s. The use of 3D poses improves forecasting accuracy, especially for short observation periods, compared to a single trajectory method.

As discussed in the previous section, basic movement detection aims to add additional information to the trajectory forecast process. Therefore, we developed a two-stage approach to incorporate basic movement detections into the forecasting and compared the results to a single-stage approach [27, 64]. Instead of a single forecast model, we train specialized models for different VRU movements, such as starting or waiting. The forecast is generated by performing a forecast for every motion state and weighting the results with the probabilities estimated by the basic movement detection. The methods for basic movement detection, as well as trajectory forecast, are interchangeable. In our evaluation, we compare all possible combinations of the single trajectory and video-based methods. We found that forecast accuracy can be significantly improved if basic movement detection adds new information to the forecast. No improvements were achieved when the basic movement detection does not introduce new information to the model. Compared to the best video-based single-stage model, the best two-stage model did not improve the accuracy. Leading to the conclusion that in the case of deterministic trajectory forecast, incorporation of basic movement detection is only helpful if new information is introduced by the detection method, e.g., by cooperation with smart devices. While this holds for deterministic forecasts, probabilistic forecasts are a different matter, which we discuss in the next section.

Deterministic Trajectory Forecasting Using Smart Devices

Furthermore, we investigate the use of smart devices for trajectory forecasting. For this work, we focus on a single wearing position and consider a Samsung Galaxy S6 device placed in the trouser pocket. In this investigation, we do not use the two-stage intention detection process consisting of basic movement detection followed by trajectory forecasting. Instead, we focus on the realization of a trajectory forecasting module using the smart device sensors and examine the potential of this approach in principle. Since the GNSS is too inaccurate, we only forecast relative positions in the ego-frame. In doing so, we do not need absolute positioning information for forecasting. If we want to use the issued forecast with respect to a global coordinate system, we merely have to transform it back from the ego-frame. A possible use case would be, for example, that the smartphone issues a trajectory forecast in the ego-frame and transmits this forecast to an oncoming vehicle. The vehicle sees the cyclist and can determine the cyclist’s position. This vehicle can now use the cyclist’s position to transform the received forecast into a global or its local coordinate system. The advantage of forecasting the trajectories in the ego-frame is that the trajectory forecast is independent of the possible large absolute positioning error of the GNSS receiver integrated in the smart device. Furthermore, this approach allows us to only predict trajectories based on the inertial sensors. In the following case study, we investigate an approach to cyclist trajectory forecasting using only the smart device inertial sensors. We use a neural network for trajectory forecasting [57]. The forecasting time horizon is 2.5 s, and we have a lead time increment of 40 ms. Hence, the neural network has an output dimensionality of 126 (63 \( \times \) 2, i.e., one for the longitudinal and one for the lateral position). The preprocessing and feature extraction of our smart device-based forecasting approach is mostly analogous to the approach for basic movement detection, i.e., we use multiple different statistical features curated from sliding windows of various sizes as input for the neural network. However, the feature selection for trajectory forecasting is more difficult because we have not only one output variable but two output variables for each forecasting lead time, i.e., 126 in total. Hence, we cannot transfer the feature selection method designed for classification tasks, i.e., basic movement detection in a straightforward way. To solve this, our approach aims to convert the multivariate regression task into a multi-class classification task. Therefore, we first perform clustering in the output domain, i.e., in the 126-dimensional target space. We use the cluster assignments to discretize the output variables into a set of 100 target classes. In this way, we reduce the multivariate regression to a classification task and may apply feature selection methods for classification tasks. Note that we only use this modeling for feature selection. We apply two feature selection approaches, a filter based on the chi-squared statistics and a model-based approach using a gradient-boosting classifier. We union the features selected by both methods. As before, the intuition about combining two different feature selection methods is to get a diverse set of features. Subsequently, we train a neural network with these features. For this purpose, we first standardize the features. We optimize the neural network on the ASAEE [26] using the Adam optimizer [34]. The hyperparameters of the neural network, i.e., the learning rate, the number of hidden layers, the number of neurons in the hidden layers, and the number of epochs for training are determined using Bayesian optimization [61]. We use exponential linear units (ELU) as the activation function [20].

The results of our investigation are presented in Fig. 14. We compare the feature selection approach to the one where we do not reduce the number of features. As we can see, the model that uses all features has better ASAEE scores across all movement types. Additionally, we compared the smart device-based model, which uses all features, to infrastructure- and vehicle-based trajectory forecasting approaches. We observed that the smart device-based approach has worse ASAEE scores for almost all movement types than vehicle- or infrastructure-based approaches. However, there are a few exceptions, e.g., the ASAEE for starting movements is lower. Furthermore, the forecasting errors for turning, i.e., right and left, are comparable to those of the infrastructure-based approach. The smart device-based approach performs here better than the vehicle-based approach. We observe a similar result for moving cyclists. Besides, we also observe that the variance or interquartile range (IQR) is usually noticeably greater for the smart device-based approach with regard to the ASAEE. This applies to both directions, i.e., in some cases, the smart device-based approach is considerably better but sometimes also notably worse. These results show that the smart device-based approach is not yet fully competitive with the vehicle- or infrastructure-based approaches. However, the smart device-based approach performs comparably or better in some cases.

Fig. 14
A table has 2 rows and 3 columns. The column headers are complete, waiting, starting, moving, stopping, straight, turning right, and turning left, for the 2 types of models all features, and selected features.

The table shows the ASAEE in m/s of the respective motion types. We consider two different smart device trajectory forecasting models: One model using all features and a second using only the features selected by the feature selection procedure

3.2.2 Probabilistic Trajectory Forecasting

Most existing methods for VRU trajectory forecasting create deterministic forecasts, i.e., estimates of the future VRU positions in the form of points (e.g., [27, 52]). Since these estimations are error-prone, methods to quantify their uncertainties are needed to create a basis for maneuver planning methods in automated vehicles. While there are few existing approaches to model uncertainties of trajectory forecasts (e.g., [1, 50]), the authors’ focus is on the positional accuracy achieved by their methods. The estimated uncertainties are treated as byproducts, and no further evaluations are performed to rate the quality of the estimates. However, to use uncertainty estimates as the basis for safe maneuver planning, it is crucial to validate that the chosen methods can create reliable outputs. Furthermore, the estimated uncertainties should be kept as small as possible.

Fig. 15
A gray-scale image of a road presents probabilistic forecasts. It estimates confidence regions for future time steps. It forecasts the regions with 68 percent and 95 percent probabilities where cyclists will reside at 1.5 seconds and 2.5 seconds.

Example for the probabilistic forecast of cyclist trajectory for time steps 0.5 s (purple), 1.5 s (purple), and 2.5 s (purple) into to future. The inner regions (solid lines) describe that the cyclist will reside with a probability of 68% within the region, and the outer regions (dashed lines) with 95%, respectively. The red line describes the cyclist’s head trajectory over the past second

To achieve these goals, we perform probabilistic forecasts, where instead of single point estimates, we estimate confidence regions for future time steps. The regions describe an area where the VRU will reside within with a certain probability (see Fig. 15). We propose the use of three different approaches based on widely known techniques for uncertainty modeling. The first approach forecasts the parameters of probability distributions from which confidence regions can be created. The second approach extends quantile regression (QR) to multivariate targets, called quantile surfaces (QS). Both approaches are implemented using stationary cameras. The third approach is used within a moving vehicle and is based on occupancy grid maps. Furthermore, we compare standard metrics and propose novel approaches to rate the quality of our uncertainty estimates. Finally, we combine our approaches with a method for maneuver planning to demonstrate their applicability in the real world.

Probabilistic Trajectory Forecasting Using Stationary Cameras

A widely used method to add uncertainty quantification to the output of neural networks is to estimate the parameters of a probability distribution. Usually, Gaussian distributions are used. In the field of VRU trajectory forecasts, this method has been used to forecast bivariate Gaussian distributions in earlier work (e.g., [1, 50]). However, the focus of these articles is on the positional accuracy of the forecasts, i. e., only the means of the distributions are used for evaluation and no statements about the quality of the uncertainty estimates are made. Therefore, we created a method that forecasts cyclist trajectories in the form of bivariate Gaussian distributions and evaluated the confidence regions generated from the estimated distributions with regards to their reliability [69]. We consider the regions to be reliable if the frequency with which the real position lies within the estimated region equals the probability of the region. For example, if we look at the 80% confidence region, the real position should fall into the region in 80% of all times. Our evaluations using our real-world dataset [67] show that the method is not able to create reliable outputs. More precisely, the method produced underconfident probabilities, meaning that the regions’ probabilities are smaller than the percentage of real positions within the regions. This especially applies to waiting conditions, where an early forecast of the exact starting time is not possible, leading to the conclusion that VRU trajectories are inherently multimodal and cannot be modeled by a single Gaussian distribution.

Fig. 16
Three part image. A, 3 line graphs of motion state detection. The graphs are labeled move, left, and right. B, 3 contours plot presents the forecast of motion-specific Gaussian distribution. C, a contour plot of Gaussian mixture.

Mulit-modal forecasting pipeline: a Motion state detection with example probabilities for states move, left, and right, with high probabilities for move and left. b Individual Gaussian forecasts for the same states for forecast horizons 0.5, 1.5, and 2.5 s. c Gaussian mixture generated by weighting Gaussian forecasts with motion state probabilities

To solve this problem, we developed a two-stage approach to forecast multimodal distributions similar to the deterministic approach from the previous section [70]. The pipeline of our approach is visualized in Fig. 16. The first stage performs basic movement detection by creating a probability for every possible VRU motion state (e.g., starting or waiting). Simultaneously, a Gaussian distribution is forecasted for every motion state using the uni-modal model from [69], where we train one specialized model for each motion state. In the second stage, the motion state probabilities are used to weigh the estimated density function of the specialized models, leading to a Gaussian mixture distribution. For detection of the current motion state, we investigate the use of the trajectory-based and image-based methods for basic movement detection described in the previous section [68, 71]. Compared to the deterministic two-stage approach, the probabilistic approach has a significant advantage. While in the deterministic case, only a weighted mean of the position is created, we add multimodality to the probabilistic approach by incorporating basic movements. Every estimated mode represents a motion state of the VRU. E.g., Fig. 15 shows an example of a cyclist beginning to make a right turn. The basic movement detection outputs high probabilities for the motion states moving straight and turning right and low probabilities for the remaining states, leading to two dominant modes. Our evaluations show that incorporating both detection methods into the probabilistic forecasts leads to reliable uncertainty estimates, solving the problem caused by the uni-modal approach. As indicated by the results of the basic movement detection, the regions estimated using the video-based method achieve a better sharpness than the trajectory-based method. The 95% confidence regions estimated by the video-based method are on average 14% smaller compared to the trajectory-based method’s estimates, demonstrating that the results from basic movement detection can be applied when incorporating basic movement detection into the probabilistic trajectory forecast process. While we evaluated the method using data from the stationary cameras, the method can also be applied in a moving vehicle since the method for basic movement detection, and the architecture for trajectory forecast are interchangeable.

Our second method to forecast reliable confidence regions is based on QR. By extension of the single-output of QR to multivariate targets, we QS [7] serving the same purpose as the confidence regions created by the Gaussian mixture approach. The method consists of a two-stage model described in Fig. 17. The first stage performs deterministic point forecasting followed by the probabilistic QS estimation that uses the point estimate as the center. The method is capable of producing star-shaped estimates. While the method is based on a uni-modal approach, the star shape of the estimated regions allows us to model the uncertainties of our forecasts reliably. In contrast to the Gaussian mixtures, the method is not able to estimate multiple separate regions for a single probability, possibly leading to larger regions. However, due to the two-stage approach, any existing forecasting method that produces deterministic outputs can be extended by a probabilistic output without requiring additional detection of basic movements. This leads to a much leaner model with no need to train specialized models for every motion state, especially eliminating the need for time-consuming labeling of motion states.

Fig. 17
A schematic of a two-stage model, central tendency estimation, and quantile surface estimation. An example presents a pedestrian and a truck. The Forecast of the pedestrian trajectories is marked. The longitudinal and lateral directions are marked.

Qunatile surfaces forecasting pipeline: In the first step, the central tendency estimation is performed using a classic deterministic forecasting approach. In the second step, we pass the central tendencies together with the used input features to the quantile surface estimation, which generates the probabilistic outputs for different confidences

Probabilistic Trajectory Forecasting from a Moving Vehicle

An approach from within a moving vehicle is described in [37]. As in the deterministic forecast, the method utilizes 3D poses. Additionally, we incorporate semantic maps to represent the surroundings of the VRUs, allowing us to prevent implausible forecasts, such as a VRU moving through an obstacle. The maps are created using 3D positions from LiDAR in combination with a semantic segmentation performed on images from a stereo camera and contain information about static obstacles, such as buildings, and dynamic obstacles, such as cars. Our forecast model is described in Fig. 18. The probabilistic forecast is performed in a discrete way in the form of occupancy grids. We forecast one grid for every forecast time horizon centered at the current position of the respective VRU. Instead of a continuous probability distribution, we predict a probability for every cell within the grid. In our evaluation of the discrete method, we compare the use of only the head position to the complete pose with and without the semantic maps. Furthermore, we compare the discrete method to the single Gaussian approach from [69]. We compare the reliability, sharpness, and positional accuracy of the models. The comparison of poses with the single trajectory approach shows that the positional accuracy is improved by 9.7% in the case of the Gaussian approach and by 7.2% for the discrete method. In both cases, reliability and sharpness are also improved by using poses. While the semantic maps lead to a slight improvement in accuracy, improvements are more apparent when evaluating qualitatively, showing that fewer forecasts intersect with obstacles. Comparing the Gaussian method to the discrete method, we find both have advantages and disadvantages. While the Gaussian model overall achieves a better reliability score, only the discrete method is able to model certain motion types, e.g., waiting, reliably due to its ability to model multimodality.

Fig. 18
A schematic of the probabilistic forecast model. Past trajectory and 3-D human poses representing the past motions of V R U, and semantic map representing the surroundings maps to the model which generates probabilistic trajectory forecast in the discrete grid for each forecast time horizon.

Grid-based discrete probabilistic forecast: The model input consists of the past 3D pose trajectory of the VRU. Additionally, we use a semantic map representing the VRU’s surroundings. The model outputs grid maps for each forecasted time horizon containing probabilities for every grid cell describing the likelihood of the VRU occupying the respective cell at the forecast time horizon

3.2.3 Application in Planning Algorithms for Autonomous Vehicles

To investigate whether our methods can serve as a basis for maneuver planning methods, we conducted a case study regarding an autonomous vehicle overtaking a cyclist [59] intending to safely overtake the cyclist while maintaining a lateral safety distance of at least 1.5 m. We combine our probabilistic methods with a model predictive planning (MPP) approach to achieve this goal. We simulate overtaking maneuvers based on cyclist trajectories from our real-world dataset leading to two different outcomes. Either a successful overtaking maneuver could be performed, or the vehicle stays behind the cyclist without overtaking due to larger uncertainties in the forecasted regions. While the second behavior is less desirable, it is considered safe. The MPP algorithm expects the estimated confidence region to have a convex hull in the form of a polygon. Since neither the multimodal nor the QR approach output a convex hull, we compare different approximation methods. We choose a method where a single rectangle aligned with the VRU’s ego coordinate system per forecast horizon is used to approximate the region. The rectangle shape is chosen to keep the computational load of the MPP small since every edge adds to the load. For safety reasons, the rectangle over-approximates the actual region. Comparing the forecast methods showed that both methods can estimate reliable confidence regions. The multimodal approach can estimate sharper regions compared to the QR approach, which becomes evident, especially for larger forecast horizons. In our case study, the most desirable outcome is a successful overtaking of the cyclist. An example of a successful overtaking is displayed in Fig. 19. The less desirable yet acceptable behavior would be for the vehicle to abort the overtaking maneuver and stay behind the cyclist until overtaking is possible. The second case mainly occurred for large confidence regions. None of our tests resulted in a collision. Our results show that our methods can be used as the basis for interaction between autonomous vehicles and VRUs and highlight the importance of reliable and sharp uncertainty estimates.

Fig. 19
A schematic of an example of successful overtaking. Two rightward arrows at the top denote the direction of car motion. The rectangles represent the car positions. The illustration at the bottom represents successful overtaking.

Planned overtaking maneuver based on the forecasted confidence regions. The rectangles starting in the car lane represent the planned car positions. The rectangles on the bike lane are the forecasted cyclist regions. Future time steps are color coded so that the depicted boxes correspond to the same point in time

4 Cooperative Intention Detection

Up till now, we focused on investigating intention detection using different sensor modalities independently. This helped us to gain an understanding of the different challenges of individual modalities. The goal of our project however is cooperative intention detection. Therefore, the following section describes our methods to combine intention detection from stationary cameras, vehicles, and smart devices into one system in order to improve the intention detection results. Before we describe our methods for cooperative intention detection, we give a short interim summary of what we have learned about the strengths and weaknesses of different sensor modalities used independently.

4.1 Interim Summary of Vehicle, Infrastructure, and Smart Device Based Intention Detection

In the previous section, we covered VRU intention detection methods. We especially showed that approaches for intention detection from within a moving vehicle face very different challenges than approaches using stationary cameras.

The process for moving vehicles is especially complicated due to occlusions from the viewpoint of the vehicle’s sensors caused by other vehicles or objects on the roadside. This is challenging since many times we only have a short time frame within which we can observe the VRU’s behavior to estimate future behavior. Therefore, the focus of our investigation was on finding appropriate methods that allow us to take the short observation period into account, which we achieved by the incorporation of human 3D poses into the intention detection process. We were able to improve the results for both basic movement detection and trajectory forecast, especially for short observation periods. By utilizing recurrent neural networks, we were able to consider observation periods of different lengths.

Compared to vehicle-based intention detection, stationary intention detection has many advantages. By mounting cameras at a higher elevation and using multiple cameras in a wide-angle stereo-camera system, we were able to resolve most occlusions. Furthermore, we are not as restricted to space and power consumption as we are inside a vehicle, allowing us to use a dense representation of the surroundings as a basis for our intention detection algorithms. Therefore, we investigated the direct incorporation of video sequences into our methods, leading to significant improvements compared to existing methods for basic movement detection. While stationary intention detection solves many problems of vehicle-based intention detection, it is not feasible to equip every existing road with cameras. However, stationary systems can be installed at busy traffic junctions, where many occlusions and most accidents with VRU involvement occur.

Another possibility we investigated is the use of smart device sensors for intention detection. Since smart devices are worn by the VRU directly and are not affected by occlusion at all. However, compared to camera-based intention detection we achieve far less accurate results due to sensor limitations. Therefore, we don’t see smart device sensors as a feasible stand-alone solution for intention detection. However, we think that a combination of smart devices and vehicle-based intention detection can be used to improve the overall results.

4.2 Cyclists as Additional Sensors

Nowadays, almost everyone carries smart devices in form of a smartphone, smartwatch, or similar with them while taking part in traffic. Accordingly, we examine the use of smartphones and other wearable devices for the task of intention detection of vulnerable road users. These devices are equipped with a great variety of sensors, e.g., inertial measurement units or GNSS receivers. To work mobile, most smartphones are permanently online; they share their location or send the accelerometer profile to the server of the fitness application provider for further analysis. Essential for this are communication technologies such as UMTS, 4G, and 5G or, in the future, 6G, which allow us to send and receive large amounts of data within a few milliseconds. In 2010 David and Flach [21] proposed using smartphones for advanced pedestrian protection, i.e., as a sort of wireless safety belt. Many studies are investigating the usage of smartphones and other wearables for pedestrians in cooperative intelligent transport systems (C-ITS) [60]. However, cyclists have gained little attention. In contrast to vision-based approaches, smart devices also enable reliable intention detection in cases of occlusions. The position and the detected intentions, e.g., of crossing cyclists appearing from an occlusion, can then be communicated between approaching traffic participants using modern means of communication (such as 5G, V2V). Regarding our work, the utilization of smart devices worn by cyclists for the intention detection of vulnerable road users was the focus of our experimentation. We investigate various aspects, including smart device-based positioning as well as the influence of the wearing location of the smart devices [6]. We propose a novel basic movement detection approach for robust and yet fast basic movement detection using the smart device inertial sensors solely [12]. We investigate the usage of smart devices for cyclist trajectory forecasting [7]. Moreover, we propose a novel cyclist ad-hoc network involving the usage of multiple cooperating smart devices (e.g., smartphone, smartwatch, or sensor-equipped helmets) for intention detection at the same time [7, 22]. The main challenges of cooperative intention detection for cyclists are:

  1. 1.

    The localization of the cyclist [7]

  2. 2.

    The detection of the cyclist and their intention [7, 58]

  3. 3.

    The forecasting of the cyclists trajectory (probabilistically) [7]

  4. 4.

    The incorporation of multiple smart devices [7, 22].

4.3 Smart Device Cooperation for Intention Detection

Instead of a single smart device, in the future, people will carry many devices, e.g., a smartphone, smartwatch, and smart helmet. Smartwatches, for example, are already widely used today. Additionally, those may also include cloths containing sensors or helmets equipped with sensors, i.e., smart helmets. It is also likely that future bicycle generations will be equipped with intelligent assistance systems, sensors (e.g., cameras, Lidar, or Radar), and V2X communication capabilities [14]. All of these smart devices can potentially be used to anticipate cyclists’ movements, to communicate them (e.g., to an oncoming vehicle), and thereby make an important contribution to improving cyclists’ safety. The smart devices described previously measure different aspects of cyclist movement due to their different wearing locations or other sensor types. If these devices are connected, for example, using a kind of wireless body area network (BAN) [45] for cyclists, then the smart devices can exchange information. This information can be fused and refined and subsequently be used for cyclist intention detection. An example of this is depicted in Fig. 2a. The smart devices can communicate with each other, e.g., via Bluetooth, and the smartphone might provide communication abilities with cloud services.

The worn devices provide both redundant as well as complementary information. The smart helmet and the smartwatch, for example, might have better GNSS signal due to their wearing location, so their information should be preferably used for positioning. The smartphone, which is located, for example, in the cyclist’s trouser pocket, can give information about the pedaling frequency. If we combine these two pieces of information, we could, for example, improve the positioning or the forecasting of the future trajectory. We can fuse the communicated information either in a centralized manner (e.g., on the smartphone) or in a decentralized fashion (e.g., on each device itself). This provides safe handling of a user’s data in regard to privacy. Still, the data information could also be processed non-locally on a remote server through a secure cloud connection should the computational requirements exceed the capabilities of the smart devices or simply to save battery power.

In the following, we present two case studies to demonstrate the potential of a body area network incorporating the usage of multiple smart devices for a cyclist’s movement anticipation. In the first case study, we investigated the use of a helmet equipped with sensors, i.e., a smart helmet. However, because off-the-shelf and ready-to-use smart helmets are not yet commonly available, we utilize a smartwatch attached to the cyclist’s helmet. In the second case study, we investigate the use of multiple smart devices for longitudinal basic movement detection.

4.3.1 Combining a Smart Helmet with a Smartphone for Improved Orientation Estimation

In this section, we investigate the possibility to use of a smart helmet as an additional device connected to a smartphone. In our investigations concerning GNSS-based position, velocity, and orientation estimation, we found that especially the device placed on the helmet provides excellent velocity and orientation measurements. However, the sampling rate of 1 Hz is far too low for our intended applications, e.g., basic movement detection. Therefore, we present an approach combining inertial sensor measurements with GNSS measurements. In this case study, we combine the GNSS measurements from the smart helmet with the inertial sensors of a smartphone carried in the trouser pocket. Thereby, the utilized data comprises 48 test subjects and 257 trajectories. An implementation of our approach could be that the smart helmet sends its current GNSS measurement via Bluetooth to the cyclist’s smartphone. On the smartphone, the GNSS data is now combined with the smartphone inertial sensor data to obtain an improved velocity or orientation estimate. For the orientation estimation, we use a Kalman filter running on the smartphone. The velocity estimation based on the combination of GNSS and inertial sensor data was much more difficult. However, we achieved very good results using machine learning models and HAR techniques.

Fig. 20
Two clustered bar graphs. Graph A plots trouser pocket and helmet and only helmet R square scores for all motion types, moving, and waiting. Graph B plots trouser pocket and helmet and only helmet R M S E for all motion types, moving, and waiting.

Performance of the cyclist’s orientation estimation using a smartphone in the trouser pocket and a smart helmet, as well as the smart helmet only. The evaluation was carried out for different velocity ranges, i.e., all of the available data, moving faster than 0.5 m/s, and slower than 0.5 m/s

First, we present the results of the orientation estimation involving a smartphone and a smart helmet. We examine the following device combinations: First, GNSS from the smart helmet with gyroscope data from the smartphone in the trouser pocket and, second, only the smart helmet (i.e., GNSS and gyroscope data from the helmet). The experiments are conducted offline with real data. We do not consider any communication delays, as these are not large compared to the delay of the GNSS measurement. We tune the hyperparameters of the Kalman filter, i.e., the process- and measurement noise, using a grid search. We depict the results of our investigation in Fig. 20. The fusion of the GNSS measurements obtained from the smart helmet and the gyroscope measurements of the smartphone in the trouser pocket can greatly improve the orientation estimation. Furthermore, we observe that the orientation estimation based on the smart helmet and smartphone works differently well at different velocities. This can be explained since the cyclist might look around at slow velocities, e.g., when waiting at a traffic light, which can be mistaken as a change in orientation of the bicycle. The smartphone is less prone to such misinterpretation when it is kept in the trouser pocket. Although, the orientation of the smart helmet is a very helpful source of information to predict the intended cycling direction.

4.3.2 Inter-Device Cooperation for Basic Longitudinal Movement Detection

In this section, we present a case study for longitudinal basic movement detection using multiple smart devices. For this purpose, we consider the smartphone carried in the trouser pocket, the smartwatch at the wrist, and the device at the helmet. The results of the case study presented in the following have been published in [22]. In this case study, we restrict ourselves to data originating from the inertial sensors, i.e., we do not consider GNSS measurements. For comparison, we train classifiers for each of the three considered devices. These are our baseline models. For classification, we apply XGBoost classifiers [19], followed by an isotonic regression for probability calibration [47]. To assess the trade-off between robustness and detection time, we consider Pareto fronts. Therefore, we evaluate different hyperparameters using a randomized search with 250 trials. In this respect, we apply ten-fold cross-validation over the test subjects.

For fusing, we considered three different methods to combine the measurements of the three smart devices: (a) fusion of the feature spaces of all devices (feature stacking); (b) fusion at the decision-level of the basic movement detections (classifier stacking); (c) a hybrid approach combining the fusion of the feature spaces and the decision-level fusion. In the case under consideration, we assume that the fusion of the measurements and predictions of the smart devices is performed in a centralized manner on the smartphone. The choice of the smartphone as the point of fusion is based on the premise that today’s smartphones have the necessary computing power, enabling more complex calculations to be performed here. However, this is only an example; the fusion could also be carried out on any other device. In this case study, we do not consider communication delays, i.e., we assume that the communication delay between the devices is negligible. As the devices are all worn at different locations on the body, they also measure different aspects of the motions performed by the cyclist. To prevent loss of information, we have decided against fusing these individual features (e.g., averaging) and instead decided to stack the feature spaces. We reduce the dimensionality of this feature space by applying a two-stage feature selection procedure. Based on the selected features, we then train a classifier to detect the longitudinal basic movements.

The fusion at decision-level is based on the trained classifiers of the individual smart devices. For each smart device, we train a dedicated classifier. These are referred to as base classifiers. Their outputs (i.e., predicted probabilities of the individual classes) constitute a new feature space. Subsequently, we train a new classifier based on this feature space. In literature, this approach is also known as classifier stacking or stacking ensemble [73]. We obtain the predictions of the base classifiers used for training the stacked classifiers using cross-validation.

The third and last approach is a hybrid approach. This hybrid approach uses the stacked feature space of all smart devices and, additionally, the predicted probabilities, as described before. The feature space is again reduced by applying the two-stage feature selection procedure.

Overall, when first evaluating individual smart devices, we observed that the smart helmet performs rather poorly in terms of the scores considered. Altogether, we can conclude that the classifiers based on the data from the smartwatch mounted at the wrist provide the best detection results. The results of longitudinal basic movement detection using multiple cooperating smart devices indicate that the combination of data originating from multiple smart devices leads to both faster and more robust longitudinal basic movement detection. Although, the results show that the different fusion paradigms yield considerably different results in some cases. The decision-level fusion multiple-devices classifiers have smaller detection delays than the other approaches. The detection delays are in a range from 0.194 to 0.38 s. The hybrid approach achieves detection delays between 0.24 and 0.72 s. Thus, the hybrid approach is regarding the detection speed slower but reaches higher scores. The feature stacking approach usually performs slightly worse than the hybrid approach both in terms of detection delay as well as its score. Further detailed consideration and extensive evaluation regarding the use of multiple smart devices for basic movement detection is provided in the work of Depping [22].

4.4 Cooperative Basic Movement Detection

Another approach concerns the use of cooperation to improve basic movement detection. These cooperatively determined basic movements can then be used for trajectory forecasting, i.e., for the parameterization of the forecasting models. In this regard, we examine different approaches:

Stacking of Feature Spaces: In feature space stacking, we assume that the agents exchange preprocessed features with each other. These features originating from different sensors are combined and used for basic movement detection. We realize fusion by concatenating the feature spaces of different sensors. This is, for example, the concatenation of orthogonal expansion coefficients (describing the past cyclist’s trajectory) with Fourier coefficients (describing the acceleration profile derived from the smart device inertial sensors).

Stacking Ensemble: In the stacking ensemble fusion methodology, we fuse basic movement predictions employing a machine learning ensemble. These basic movement predictions, which originate from the basic movement detection models of other agents, are combined using a dedicated machine learning model. The combination of a stacking ensemble and a stacking of feature spaces is referred to as a hybrid model.

Probabilistic Fusion: Another method that we examine for cooperative basic movement detection is the independent likelihood fusion (ILP) fusion. This is a probabilistic fusion technique (similar to the Bayes filter) which is based on the assumption that the measurements of the sensors are independent of each other given the current state. It combines basic movement prediction originating from different agents.

Coopetitive Soft-Gating Ensemble (CSGE): The Coopetitive Soft Gating Ensemble (CSGE) [25] is an ensemble technique that is used to fuse forecasts of different base models. The CSGE has three different weighting aspects, i.e., global-, local-, and time-dependent, which are used to compute an overall weight for each ensemble member. We modified the original CSGE to cope with the special requirements of the task at hand, i.e., handling delayed or missing predictions.

Orthogonal Polynomials: This approach is a classifier fitted on the cooperatively acquired orthogonal expansion coefficients.

We used real data acquired at the urban research intersection in Aschaffenburg to evaluate and compare the different approaches. In this context, we examine the cooperation among three agents: a research vehicle, a sensor-equipped infrastructure, and a cyclist, i.e., a smart device carried in the cyclist’s trouser pocket. We evaluate the results of the cooperative approaches from the perspective of a non-cooperatively ego vehicle. The vehicle-based approach is, therefore, our baseline against which we compare the cooperative approaches. We observe that almost all fusion methods outperform the baseline for almost all considered agent configurations.

Fig. 21
A clustered bar graph presents the F 1 scores of C S G E, orthogonal polynomial, feature sticking, I L P, stacking ensemble, and hybrid for vehicle infra and smart devices, vehicle and smart device, infra and smart device, and vehicle and infrastructure.

Micro average F\(_1\) score for cooperative longitudinal basic movement detection. Results of cooperative longitudinal basic movement detection for different agent configurations. The colored bars represent different fusion types. The baseline, i.e., the ego vehicle only, is given by the black, dashed line [7]

As we can deduce from Fig. 21, cooperation is nearly always advantageous. Especially remarkable is the performance of the ILP approach. This method is almost parameter-free and performs better or at least as well as other methods with significantly more parameters. The CSGE shows the most significant improvement with up to 30% compared to the baseline. Hence, we can increase the F\(_1\) score for basic movement detection significantly through cooperation. However, not only the detection performance is getting better, but also the mean detection time improves by up to 30% [7]. In addition, it is important to note that cooperative basic movement detection is currently the only cooperation method that effectively allows the integration of smartphones. Although this is also possible with the other methods, the use of the smartphone position often has a negative effect on the fusion result due to the poorer position estimation. The practical implementation of the cooperation techniques with current communication protocols is possible. Still, depending on the type of cooperation, it is not as straightforward to realize as with the probabilistic trajectory fusion technique.

4.5 Cooperative Trajectory Forecasting Using the CSGE

In this section, we outline an approach for cooperative cyclist trajectory forecasting using the CSGE. The underlying idea is that agents share predictions about their future trajectory. The trajectory forecasts are then combined using the CSGE. The approach described in this section fuses deterministic trajectory forecasts. The fused forecast is the starting point of the probabilistic trajectory forecast. We look at the fusion from the perspective of an ego vehicle, i.e., the fusion is conducted on the vehicle. The approach can be considered as decision-level fusion. From the perspective of sensor configuration, the approach can be classified as competitive fusion. The CSGE has three parameters, i.e., the soft gating parameters, which determine the weights of the individual ensemble members according to three influencing factors. We use the ASAEE as the target function to optimize these parameters of the CSGE. Moreover, we assume that the ensemble members are already trained. We also pretend that there is a dataset not yet used for training the ensemble members, which can be used for the CSGE training. We use ten-fold cross-validation to create this ensemble training dataset. The agents share their trajectory forecasts in the cyclist’s ego-frame. The usage of this coordinate system has the advantage that errors in the absolute positioning (e.g., in the global coordinate system) of the respective agent do not influence the actual trajectory forecast. This allows us to include trajectory forecasts of agents with poor absolute positioning. This is the case, for example, with smart devices whose absolute positioning is not comparable to that of modern infrastructure- or vehicle-based approaches. Nevertheless, smart device-based trajectory forecasts can be helpful in some situations, e.g., when the field-of-view of the infrastructure or vehicle cameras is occluded. The CSGE natively supports the outage of a sensor or ensemble member. Similar to the CSGE approach for cooperative basic movement detection, we only have to re-compute the respective weights. The introduction of a new ensemble member can be handled similarly. The prerequisite for this is that the corresponding error estimates, i.e., global, local, and lead time-dependent errors, are available. However, in both cases (i.e., outage and introduction of a new ensemble member), we cannot guarantee that the soft gating parameters are still optimal.

4.5.1 Extending the CSGE for the Fusion of Delayed Trajectory Forecasts

Additionally, we proposed an extension of the CSGE for the fusion of trajectory forecasts that allows the integration of delayed trajectory forecasts. We investigate the extension in a case study considering the fusion of vehicle- and infrastructure-based trajectory forecasts. The fundamental idea of our modeling is analogous to the one used to integrate time-delayed basic movement predictions. The provider of the forecast always provides an estimate of the forecast quality, i.e., the expected error. The receiver uses this as a starting point and tries to model the increased expected error due to the delay. We distinguish three different types of expected errors, i.e., global, local, and lead time-dependent errors. Hereby, the challenge we face with delayed forecasts is that the cyclist’s ego-frame changes over time. This offset is not only temporal but is also spatial, i.e., a simple temporal shift of the forecast is not sufficient. In addition to the temporal shifting, we must also simultaneously translate and rotate the cyclist’s ego-frame. Hence, we cannot simply compare and fuse two trajectory forecasts of two agents without considering the time and spatial alignment of the ego-coordinate frames first. We have two possibilities for this purpose. First, the vehicle itself can estimate the change of the cyclist’s ego-frame, i.e., the translation and the rotation, and apply these to the received trajectory forecast. For this purpose, the vehicle must estimate the current and the past (i.e., the time of the creation of the trajectory forecast) position and orientation of the cyclist. Subsequently, the vehicle can use these estimates to determine the translational and rotational shift. The second possibility we investigated is to use the trajectory forecast itself to estimate the change in terms of the cyclist’s ego-frame and then use this estimation to translate and rotate the forecast accordingly. This method has the advantage that we can even fuse two trajectory forecasts if we cannot reconstruct exactly the past position and orientation at the time of the creation of the trajectory forecast. By artificially shifting the forecasting origin, our maximum lead time changes as well. We compensate for this by extrapolating the forecast based on its local trend and then padding it again.

4.5.2 Case-Study Delayed Trajectory Forecasts

In another case study, we examine the handling of delayed messages in the case where a vehicle receives delayed infrastructure-based trajectory forecasts and fuse these with its trajectory forecasts using the CSGE. We use the previously described modeling of the delays for the different weighting aspects of the CSGE. We assume that only the messages from the infrastructure are delayed. A delay on the side of the vehicle (e.g., due to data processing) is not considered. The results of this analysis are given in Fig. 22. We see that the improvement due to the combination of the trajectory forecasts diminishes with increasing delay. We observe a slow convergence towards the ASAEE of purely vehicle-based trajectory forecasting methodology. From this, we can conclude that the fusion of trajectory forecasts is advantageous for a maximum delay of approximately 1 s.

Fig. 22
A line graph of A S A E E improvement over baseline versus delay plots an exponentially decreasing trend of C S G E. A horizontal dashed line at 0, represents the vehicle baseline.

CSGE forecasting performance improvements over the vehicle baseline for different delays [7]

4.5.3 Comparing Different Approaches for Cooperative Intention Detection

In the following, we compare the presented approaches to cooperative intention detection, i.e., cooperation on the data-level using the probabilistic trajectory fusion method, cooperation on the level of basic movements using various approaches, and cooperation on the level of trajectory forecasts using the CSGE. In our comparison, we examine the cyclist trajectory forecasting results of the different approaches using the example of three cooperating agents: vehicle, infrastructure, and smart devices carried by the cyclist. As a baseline, we use the forecast based on a non-cooperatively acting ego vehicle. The results of our investigation are depicted in Fig. 23. We see that the cooperative methods almost all perform better than the baseline in terms of the median ASAEE. In addition, the spread is also considerably smaller. Trajectory Fusion CSGE has the lowest ASAEE. Furthermore, the ASAEE of the infrastructure-based approach is particularly striking. This result underlines the potential of using infrastructure-based technologies for C-ITS in general and cyclist intention detection in particular.

Fig. 23
A box plot plots ASAEE for different approaches, trajectory fusion C S G E, Orthogonal polynomial, B M fusion C S G E, B M fusion stacking, B M I L P, vehicle baseline, infrastructure, and smart device.

Box plot showing the ASAEE for different approaches to cyclist trajectory forecasting. All cooperative approaches involve the combination of data originating from three different agents, i.e., an intelligent vehicle, sensor-equipped infrastructure, and smart device carried by the cyclist itself [7]

Additionally, we performed a statistical analysis of the results to show whether there is a statistically significant difference between the performances of the cooperative methods and the baseline. The trajectory fusion CSGE approach ranks first. It is significantly better than all other approaches except the infrastructure-based trajectory forecasting approach. All cooperative approaches outperform the baseline, although the difference regarding the average rank is not statistically significant. It is not surprising that the ranks of cooperative methods for basic movements are not significantly different from the baseline. This is because the actual trajectory forecasts only use the ego vehicle data and the cooperatively determined basic movements. Nevertheless, the superior average rank shows the potential of cooperative basic movement detection. For future work, the cooperative basic movement prediction may be supplemented by cooperation based on trajectory forecasts.

4.6 Cooperative Probabilistic Trajectory Fusion Using Orthogonal Polynomials

Assuming that road users make at least partial use of the same set of features, e.g., the absolute velocity or angular velocity, the cyclist’s trajectory is approximated using polynomials with orthogonal basis functions [23]. This representation is abstract, independent of the sensor’s cycle time, and robust against noise due to implicit data smoothing. The feature-level fusion is realized using weighted polynomial approximation. We are exploiting specific properties of the orthogonal polynomials and the approximation technique: (1) fast incremental approximations are possible (update mechanisms are available [23]), and (2) information can be weighted individually. The former keeps the runtime short, and the latter allows us to fade out outdated information or emphasize more recent information. Furthermore, by additionally modeling the posterior distribution over the polynomial coefficients in a Bayesian approach, we obtain a fully probabilistic model of the trajectory. New measurements are integrated by modeling the likelihood, i.e., implementing a sequential update methodology. We obtain the weighting of information originating from different road users through a measurement model. The measurement model describes the likelihood of an observation given the currently estimated polynomial coefficients. We derive the weight of each measurement by combining a global weight (e.g., how good is a measurement of an agent’s sensor globally) and a situation-dependent weight (e.g., how good is a measurement of an agent’s sensor in the current situation). Moreover, due to the usage of a polynomial approximation instead of a state-space model-based approach, e.g., a recursive Bayesian filter, we can cope with situations where, e.g., due to communication problems, the information does not arrive in the correct temporal order (out-of-sequence fusion). The coefficients of the orthogonal expansion of the approximating polynomial are optimal estimators of the average, slope, curvature, and change of curvature of the approximating trajectory in the considered time window [23]. Hence in terms of the cyclist’s trajectory, the coefficients are optimal estimators of the velocity, acceleration, and jerk. As shown in [26], these are useful features for detecting the intentions of VRUs. We use these coefficients as features for basic movement detection and trajectory forecasting. A schematic of this cooperative intention detection approach is depicted in Fig. 24.

Fig. 24
A flow diagram. A scattered plot of X lon versus time that presents a rising approximate trajectory using polynomials maps to transformation ego coordinate frame, orthogonal expansion coefficients, basic movement detection, and trajectory forecasting.

Cooperative cyclist intention system using trajectory fusion from the view of a single agent, e.g., vehicle. The position and orientation estimates (indicated by the blue crosses and gray triangles) received via collective perception messages (CPM) or collective awareness messages (CAM) are fused probabilistically using a polynomial approximation with orthogonal basis polynomials. Subsequently, the orthogonal expansion coefficients are transformed into the ego-frame. These coefficients are used for basic movement detection and trajectory forecasting [7]

For evaluation, we utilize the data from real cyclists driving in real traffic at the research intersection. We recorded the cyclists’ trajectories using a wide-angle stereo-camera system (i.e., an intelligent sensor-equipped infrastructure), a camera-equipped vehicle, and a smartphone carried by the cyclists. In the first place, we consider the evaluation of the position and orientation estimation derived from the probabilistic trajectory fusion. Therefore, we evaluate the approximating polynomial at the current time. We compare the probabilistic trajectory fusion approach to a Kalman filter for the fusion of the position measurements showing that our probabilistic approach is on par with the Kalman filter. Furthermore, we study the probabilistic trajectory fusion approach’s behavior under message and measurement delays. We can show that the use of position and orientation estimates supplied by the infrastructure is beneficial, even for larger delays from an ego vehicle’s perspective. The fused estimate does not worsen and up to a delay of about 0.7% always leads to an improvement. In another experiment with simulated vehicles, we showed that the approach scales well to larger vehicle collectives (cf. Fig. 25). Since this method only relies on the exchange of positions or velocities between the agents, it can be well implemented using existing standards such as CAM or CPM.

Fig. 25
2 3-D surface plots of the number of vehicles versus delay versus A E E. Both depict a curved surface. Graph A is titled average Euclidean position error. Graph B is titled as average Orientation R M S E.

The average position (AEE) and orientation error (RMSE) for different numbers of vehicles and delays [7]

5 Prospects

We want to conclude this outline of our contribution to VRU safety by detection, tracking, basic movement detection, and trajectory forecasting with a short summary of our main findings.

First, we do not see the detection of objects as a solved problem. Despite significant improvements due to the success of data learning in the past couple of years, the resulting models still lack generality, reliability, and trustworthy confidence approximations. We introduce additional annotations of the data we collected to be able to determine types of data that cause poor results. The so-called context information is a basis for further research fields. The tasks may include a thorough determination of relevant context, concepts to gather data with respect to a specific context efficiently, and an evaluation that the model trained on the enhanced database is able to outperform the original model in any case.

Second, basic movements are an intuitive way of judging the current and short time future behavior of a VRU. More than that, they contribute greatly in a methodological way to the probabilistic trajectory prediction to reduce the future confidence regions in a multimodal approach, a single end-to-end learning approach with a single resulting distribution can not.

Third, trajectory forecasts must be made with probabilistic estimations of future VRU appearances. Only in that way is a safe and efficient coexistence of VRUs and autonomous cars possible. It will include ethics to find a way of dealing with how much risk is acceptable in the case of intersecting confidence regions.

Finally, an infrastructure to share information between traffic participants and to supply additional static knowledge is essential to exceeding the limitations of single sources, i.e., solely ego-vehicle sensors, and to be able to perform in a way that is acceptable for autonomous driving. Each data source could contribute beneficially in every processing step until trajectory prediction. Even relatively imprecise smart device data increased the tracking and trajectory forecast performance in cases of occluded infrastructure or ego-vehicle sensors. Altogether, we consider as a result of our project the proof of a concept that can estimate and predict the unsteady behavior of VRUs and thus make VRUs accessible to autonomous cars. The degree of realistic conditions and real-time performance capabilities has not been reached so far, to the best of our knowledge.