1 Introduction

Autonomous driving is one of the big challenges in society and currently of great interest in research. Autonomous vehicles are a promising approach to reduce traffic jams and the number of accidents and furthermore increase the comfort for the drivers respectively passengers. To achieve market readiness, autonomous vehicles have to be safe, this requires a complete and correct perception of the environment.

Different sensors like LiDAR, RADAR, and cameras are available for perception; of them, cameras are most frequently employed [107]. Since camera sensors are vision-based, they are affected by weather circumstances such as rain or fog [107]. The effect of adverse weather on the frequency of car crashes is not to be neglected as shown by the National Highway Traffic Safety Administration; on average over the past 10 years, adverse weather is responsible for 21% of the car accidents in the United States [106].

Therefore, these characteristics must be taken into account when developing perception algorithms in order to obtain a robust perception, which is crucial for a safe system. Car manufacturers are able to capture data using their own test vehicles, however, in research often publicly accessible data sets are utilized. Most of them contain no or only few data under adverse weather conditions. One method is to create artificial weather conditions and use those to enhance the data set.

Two major issues come with simulating weather conditions. In order to gain a benefit, it is first necessary to simulate a wide range of potential conditions, such as rain, snow, and fog. A second problem is that these simulations must be physically accurate.

Harsh weather affects not just the image itself but also neural network-based object detectors, whose performance is highly dependent on their training [108]. Hence, neural networks must be trained under different weather conditions to achieve a safe and robust perception.

Even with robust neural networks, the perception is limited due to occlusion and sensor ranges. At this point the so-called cooperative or collective perception (CP) comes into play. Using multiple distributed vehicles to perceive objects locally and share these detections with other vehicles via Vehicle-to-Everything (V2X) communication helps to acquire information even about vehicles which cannot be perceived locally.

In Sects. 1.11.4 we describe the concept of our work as well as the simulation framework “RESIST”. Section 2 considers our physically-correct image-based weather augmentations. The following section presents a new way to evaluate safety of object perception systems. Section 4 presents optimization approaches for local and cooperative perception. Finally, in Sect. 5 we present our conclusion and give an outlook about further research topics.

Fig. 1
A flow of six concepts. It begins with scenario generation, followed by weather simulation, local perception, V 2 V communication, cooperative perception, and safety evaluation.

Images from [19, 108, 110, 112]

Overview of the project’s total concept. We focus on b weather simulation (see Sect. 2), the improvement of c local perception (see Sect. 4.2) and e cooperative perception (see Sect. 4.3). Furthermore, we investigate f how to evaluate safety for object perception (see Sect. 3.2). a Scenario generation is out of the scope of this work due to space limitation. d V2V communication is part of the workflow but not a focus of our research and covered in more detail in Chap. 6.

1.1 Concept

A complete perception of the environment under all circumstances is crucial for autonomous driving. Hence, it must be robust against environmental influences like rain and fog as well as physical restrictions such as limited sensor ranges. This work mainly considers the environmental perception with camera sensors since this sensor type is one of the most frequently used [107]. Our goal is a safe and complete perception of the environment. An overview of our work is shown in. We consider two different ways to achieve this goal: Local perception and cooperative perception. First we consider the local perception of an ego vehicle and investigate strategies to enhance the vision-based perception. The proposed robustness enhancement is based on training CNNs for object detection on a more comprehensive data set including weather augmentations. Thus, a significant part of our research covers the physically correct simulation of the weather conditions rain, snow and fog. Realistic weather simulations allow to augment existing data sets and increase the data variety for the training of neural networks. Also, the influence of weather on object detection itself must be investigated in detail. Some findings are transferred to LiDAR sensors by our group, since these are also vision-based (see work of Teufel et al. [100]). Moreover, the robustness improvement of RADAR sensors is investigated in our group by Zlavik [95]. The work presents a noise modulated pulsed radar system which outperforms commercial state-of-the-art radar systems. Additionally, with compressive sensing the effort for signal acquisition is reduced by 70% [95]. Instead of sensor specific optimization, other approaches from our team use more generic deep learning techniques to optimize robustness. Rusak et al. [84] demonstrate that a simple but properly tuned training with additive Gaussian and Speckle noise generalizes surprisingly well to unseen corruptions. Michaelis et al. [64] extend the ImageNet-C robustness benchmark from image classification to object detection in order to provide an easy-to-use benchmark to assess how object detection models perform when image quality degrades.

Even with significant improvements in the local perception, a full perception of the environment is not possible due to physical restrictions regarding sensor ranges as well as covered line-of-sights through infrastructural elements or buildings. Thus, as a second improvement strategy, we investigate cooperative perception and an optimization approach to determine the validity and trustworthiness of collectively perceived information before performing a fusion. Cooperative perception aims to increase the sensor ranges through distributed perception and helps to see traffic participants that are occluded by e.g., buildings.

Since the goal is to achieve safety, we also have to consider how to evaluate safety. Therefore, we present a novel metric to evaluate safety for local and cooperative perception systems which incorporates important factors such as velocity and the object class.

1.2 Related Work

Most simulations of rain are made for computer games and only a few simulations consider physical correctness [6]. Therefore, Hospach et al. [42] proposed a realistic rain simulation based on falling, white-colored triangles. They use alpha-blending to simulate different intensities; but this approach does not consider effects like refraction. The approach of Wang et al. [115] uses ray tracing for the rendering of raindrops. Therefore, they have to know the exact position of the light source. Sato et al. [87] are using a single hemisphere in front of the camera; but this approach ignores the real distance of objects towards the camera. Furthermore, they had to use various simplifications to achieve real-time capability. Many more publications consider the rendering of realistic fluid dynamics of water droplets on different surfaces [47, 51, 52, 116]. A further study of Garg and Nayar considers the shape of falling water drops [29]. Moreover, interesting physical properties of rain can be learned from [36, 119]; both works regard the detection and removal of raindrops on images. A comprehensive work about the physical correct simulation of rain and fog was presented by Hasirlioglu [38].

For water spray of vehicles driving on wet roads (in the following: road spray) fewer works exist. The size of road spray was investigated by Kooij et al. [56]. Beginning with the work of Kamm and Wray [50] different researchers considered the movement of the road spray [31, 32, 44, 49]. Slomp et al. [96] presented a fast and efficient rendering method for water droplets based on OpenGL.

To simulate snow, a simple approach without considering depth or falling speed was introduced by Wang and Wade [117]; they produced a texture with 2D-snowflakes that surrounds the camera. Another approach is presented by Zhou and Libaicheng [103], they propose a method to draw falling snowflakes. Since these snowflakes are visible every time, it lacks in realism regarding the rotation of flakes. While the falling snowflakes as well as how the snow covers the ground and even the process of snow melting is investigated very well [24, 69, 86, 114], the movement of the flakes while falling is considered less. This was covered by Langer and Zhang [59] who used Fourier transformations to add noise; this results in snow-like artifacts in dependence to a virtual depth.

To augment fog, very basic fog simulations are presented by Sellers et al. [91] and Aleshin et al. [3]. Further authors generate noise to simulate different fog densities [120]. Even new works only use simple light attenuation formulas such as presented by Sakaridis et al. [85]. This fog simulation was used to augment the well known Cityscapes data set [14] and create foggy Cityscapes. More realistic and advanced methods are presented by Dumont [20] as well as Jensen and Christensen [46]. Both approaches use Monte-Carlo-driven methods with multiple rays per pixel to get a realistic virtual result. Another more advanced method was presented by Biri and Michelin [9] who even integrated wind into their simulation. More realistic fog data can be produced with synthetic fog in fog chambers, such as shown by Colomb et al. [12]. They have built a 30 \(\times \) 5.5 m fog chamber, which allows fog simulation for some static scenarios.

For the vision-based object detection mostly CNN-based methods are used. The effects of blurring, image compression and different types of noise on object detection were investigated by Dodge and Karam [18] and Costa et al. [15]. Both works show that noise or image corruptions lead to a lower accuracy in object detection and classification. The same result was shown by Nazaré et al. [70]. Since the accuracy of neural networks depend on the training, it is a common way to extend existing data sets by image transformations such as geometric and color transformations as proposed by Montserrat et al. [66]. The approaches in [15, 18, 70] mainly consider generic errors but no realistic environmental influences. A more realistic data set extension was presented by Hasirlioglu and Riener [39] who proposed a rain simulation to investigate the effect of weather on the detection performance. The strong effect of synthetic rain on the object detection accuracy was also shown by Müller et al. [68]. Tian et al. [102] proposed DeepTest; a methodology to evaluate neural networks for autonomous driving to detect erroneous behavior by augmenting the data. Similar to DeepTest, Pei et al. [73] proposed DeepXplore to evaluate neural networks; additionally they did an optimization with augmented data and achieved a higher detection accuracy. Further works [5, 25, 53] considered using General Adversial Networks (GANs) to augment data. Luc et al. [61] used GANs and achieved a reduction of overfitting for semantic segmentation. Karacan et al. [53] created synthetic environmental conditions through a combination of GANs and semantic image information. It is necessary to point out that for the augmentations of [5, 25, 53, 73, 102] there is no proof of realism.

As aforementioned, the local perception is not only affected by environmental conditions but also limited due to sensor ranges and occlusion. These are problems which can be addressed by cooperative perception (CP).

Two initial works to CP were presented by Rauch et al. [77, 78]. They present different approaches, how to handle and fuse information of distributed vehicles. Methods for multiple-object tracking and CP using camera and LiDAR were proposed by Obst et al. [72] and Kim et al. [54]. To evaluate the capability and advantages of CP, a correct V2V communication model must be used. An approach of modeling the reception probability and communication delay was presented by Torrent-Moreno et al. [104]. However, this approach does not consider environmental influences such as weather and buildings which cover the line-of-sight between sender and receiver. More advanced models, which are parameterizable and consider different environments such as buildings at an intersection, are proposed in [2, 62] or by Boban et al. [10]. Nowadays, the European Telecommunications Standards Institute (ETSI) works on a standard for a message format and exchange frequency for information about the ego vehicle and detected objects in cooperative perception [21, 22]. These work-in-progress standards and the rules for the message generation are reviewed by different researchers [17, 30, 101]. Since the simulations of [21] lack in realism due to missing delays and simplified sensor models. Allig and Wanielik [4] extended this simulation setup by more realistic vehicle dynamics and sensor models. Another simulation approach is presented by Schiegg et al. [89]. A real-world demonstration of the capabilities of CP was done by Shan et al. [94]. Next to simulations and real-world demonstrations there exist some analytical models for CP as presented in [45, 88].

To evaluate object detection, in common benchmarks like COCO [60] or KITTI [34] simple performance indicators like precision, accuracy, average precision (AP), and mean average precision (mAP) [16, 23, 74] are used. Since this does not satisfy the safety constraints of autonomous vehicles, this is not sufficient to evaluate object detection systems. Stiefelhagen et al. [97] proposed a slightly more comprehensive metric, using the Intersection over Union (IoU) [83] and the distance between track estimation and real position. A metric considering real-time aspects is proposed by Kim et al. [58]. The metric considers the detection time of video surveillance systems, which is also a factor for autonomous vehicles. A model to achieve safety is the Responsible-Sensitive Safety (RSS) model. Shalev-Shwartz et al. [93] proposed a guideline that mathematically describes how safety can be achieved in autonomous driving. The RSS model has become well known but does not include any metric to evaluate safety.

1.3 Data Sets

Comprehensive data is crucial for development in the field of autonomous driving. Basically, the data can be split into two groups: real-world data and simulation data. Here we present the data sources used for our experiments; therefore, it should be pointed out that it is not a complete overview over data sources for object detection in the field of autonomous driving.

As real-world data, the KITTI data set by Geiger et al. [34] was used. KITTI consists of different benchmarks for 2D- and 3D object detection as well as tracking. The KITTI data set was recorded in Karlsruhe (Germany) using a stereo-camera setup and a LiDAR sensor [34]. The data set consists mostly of inner-city recordings. The 2D object detection benchmark consists of about 15,000 images with over 80,000 labeled objects [34]. Further examples of real-world data sets are Cityscapes [14] and the Waymo data set [98]. A more comprehensive data set regarding adverse weather was presented by Bijelic et al. [8]. The presented DENSE data set contains real-world recordings including different weather conditions such as rain, snow or fog as well as recordings from a fog chamber. For a more sophisticated evaluation we created our own 2D image data set with heavy rain scenes. Therefore, we collected images of challenging rainy road scenes from our archive of self-conducted test drives and from dashcam videos on YouTube. This resulted in a very diverse data set of international road scenes. In the following, we call it realrain data set [112]. It contains 2062 images with 9551 labeled objects. The objects are labeled according to the KITTI label format. The realrain data set contains 7368 cars, 626 vans, 955 trucks, 395 pedestrians, 205 cyclists and one tram. The scenes are well spread from urban to freeway scenarios and contain heavy rain, mist and drops on the windshield representing challenging environmental conditions for vision-based object detection systems.

Since publicly available real-world data sets are limited, they possibly do not cover all scenarios which should be tested during development. This disadvantage can be solved by using realistic and parameterizable simulation frameworks. An exemplary commercial simulation framework is Vires VTD [1]. VTD provides different simulation scenarios such as rural road, freeway sections or an inner-city intersection. A more extensive, highly realistic (see Fig. 2) and open source simulator called CARLA was presented by Dosovitskiy et al. [19]. CARLA is based on the Unreal Engine and provides a set of different maps, containing many inner-city scenarios as well as rural sections and multiple freeway sections. Besides the maps, CARLA provides a wide range of different vehicles (bicycle, motorbike, truck, van, different types of cars) and pedestrians which can be spawned at different locations. Each vehicle can be equipped with different sensors such as camera, RADAR and LiDAR. More information about available sensors can be found in [19]. CARLA also includes weather variations as well as day and nighttime. Furthermore, if a specific route is to be driven, the vehicles can be controlled by a user. Scenarios can also be described using the OpenScenario standard, which CARLA can execute.

Fig. 2
A set of four photographs of a roadside view in four different weather seasons. a. Original image. b. Rainy season. c. Snow view. d. Fog view.

Example image generated with CARLA simulator [19] with the proposed weather augmentations (original, rain, snow, fog) from Sect. 2

1.4 RESIST Framework and Workflow

For algorithm development and simulation a configurable and deterministic pipeline is necessary. Therefore, we use the RESIST framework developed in our team by Müller et al. [67] with the improvements by Volk et al. [108]. RESIST is a QT-based C++ framework, which allows combining different plugins to a perception pipeline with different inputs and an evaluation. The framework’s main focus lies on local and cooperative perception with simulation of the weather conditions rain, snow, and fog as proposed in Sect. 2. RESIST can read a wide range of data sets such as KITTI [34] or Cityscapes [14]. Moreover, the simulation frameworks Vires VTD [1] and CARLA [19] can be used as input for the sensor data. This allows a comprehensive evaluation of perception algorithms using a comprehensive range of data sets. This sensor data is used to simulate the perception using realistic camera-models. Various well known vision-based object detection algorithms like Faster-RCNN [82], RRC [81] and YOLOv3 [79] are implemented in the framework, which allows a comparison between different architectures. For the object tracking, a Kalman filter [118] with different models such as constant velocity, constant acceleration or constant turn rate can be used.

RESIST is also capable to simulate cooperative perception. To simulate CP, RESIST includes a comprehensive communication channel simulation and processing delays [109]. The transmission of locally detected objects is done by V2X communication. A V2X channel simulation based on the analytical model of IEEE 802.11p by Sepulcre et al. [92] is integrated into RESIST. For the CP the focus lies in the perception and less on the V2X communication; but to gather valid results a correct communication model is necessary.

In the area of CP, different algorithms for matching and fusion are integrated. For the matching of measurements to existing tracks Hungarian matching [57], Nearest Neighbor or Expectation Maximization can be used with different cost metrics such as euclidean distance or IoU. For the Track-to-Track fusion there are also various algorithms available, such as covariance intersection [48], Kalman filter [118] or a simple mean fusion.

To evaluate algorithms a comprehensive evaluation plugin exists. This plugin allows an evaluation of a defined environment with different metrics such as precision, recall, mAP or the safety metric [110] presented in Sect. 3.2.

In conclusion RESIST is a comprehensive framework for a realistic simulation of local and cooperative object perception with physically correct vision-based weather simulation.

2 Simulation of Environmental Conditions

Simulating realistic weather influences allows extending existing data sets, which mainly consists of images with clear weather. Therefore, we present different weather augmentations for image data in this section.

2.1 Rain

The simulation of realistic rain is based on two approaches developed in our team by Hospach et al. [42] for simulation of falling rain and the simulation of raindrops on the windshield by von Bernuth et al. [6]. By combining these two steps it is possible to achieve a photorealistic simulation of rain. The rain simulation workflow is illustrated in Fig. 3. Examples of the proposed simulation are illustrated in Figs. 4 and 5. The first step is the reconstruction of the 3D scene with a depth image containing the scene depth for each pixel. Afterwards the falling rain as already introduced by Hospach et al. [42] is applied. The reconstructed 3D scene is used to distribute rain streaks in the space between camera and background, respecting the well known Marshall Palmer distribution [63]. The simulation of rain streaks respects camera parameters such as focal length, field of view, aperture, pixel size and shutter speed. Hence, the length of the simulated rain streaks varies depending on the configured shutter speed and the sharpness is depending on the aperture and the distance of the simulated rain streak from the camera. As next step, raindrops on the windshield are generated with the approach presented by von Bernuth et al. [6]. Raindrops are distributed on a virtual windshield and ray tracing is used for a physically correct rendering of these raindrops. Finally, the brightness of an image can be altered to achieve a realistic setting. This can be necessary if rain shall be simulated on a sunny image. By reducing the overall brightness of the image the simulated rain looks more realistic.

Fig. 3
A flow diagram of the rain simulation model. It begins with input data, followed by camera parameters, scene reconstruction, depth map, generating rain variant, rain variant configuration, rain layers for falling rain, raindrop simulation on the virtual windshield, and finally images with simulated rain.

Image from [112]

Rain simulation workflow from reading the input data over the scene reconstruction to the defined rain simulation.

Fig. 4
A set of four photographs of a house view with different rain. a. Image of the scene without rain. b. Image of the scene with real rain. c. Image of the scene with synthetic rain. d. Real rain versus synthetic rain. A part is focused on viewing the brightness.

Image from [112]

Comparison of our synthetic rain and brightness augmentation technique against real rain.

The proposed rain simulation can be parametrized with six parameters. Falling rain is parametrized by the rain intensity \(r_i\) and the rain angle \(r_a\) of the vertical rain streaks. The simulation of raindrops on the windshield uses \(r_i\) as well as the additional parameter drop count \(d_{count}\) which specifies the number of drops resting on the virtual windshield. The mean drop radius \(d_\mu \) and the standard deviation \(d_\sigma \) specify the drop size distribution on the windshield. With parameter \(r_b\) the brightness of the image can be adapted.

Compared to other solutions such as applying a simple rainfilter mask as in [40], our approach allows a more realistic rain simulation by taking the current environment such as scene depth together with sensor characteristics into account. Additionally, our approach allows simulating variations of different rain instances by adapting the six presented parameters.

To show the visual realism of the presented synthetic rain model we compared the same scene without rain (Fig. 4a), with real rain (Fig. 4b) and with synthetic rain (Fig. 4c). The same image extract is enlarged in (Fig. 4d) for better visibility. The real rain image (Fig. 4b) as well as the synthetic rain (Fig. 4c) have identical rain streaks, blur effects and drops, showing that the used rain model produces similar optical effects as real rain. An additional comparison of an original KITTI image compared to the same image with our synthetic rain augmentation is illustrated in Fig. 5.

Fig. 5
A set of two photographs of a roadside view. a. Original image extract from K I YT T I dataset. b. Applied synthetic rain.

Image from [112]

Synthetic rain augmentation technique on KITTI dataset [34].

In addition to the qualitative realism evaluation before, a quantitative evaluation is performed as well. Measurements of images containing real rain have shown that rain has a significant influence on basic image processing metrics like for example Harris features [42]. Edge detection based algorithms (SURF, Canny, Harris, Sobel) allow a deliberate generalization to validate the realism of this rain simulation. Therefore, these basic image processing algorithms are applied to validate our rain simulation. The influence of real rain on these features will be compared to the influence of simulated rain for the exact same scene. If simulated rain as well as real rain have similar effects on these features we state that our model is realistic. Two sets of images of a well-structured scene containing edges and corners for the algorithms to detect were recorded for validation. The first set of images was recorded under heavy real rain (RefReal). The rain intensity was averaged over the period of recording this set of images. The rain intensity of RefReal was 52 mm h\(^{-1}\). The second set of images (RefClean) has been recorded immediately after the rain had stopped. RefClean was used as input for the rain simulation with intensities of 10, 40, 70 and 100 mm h\(^{-1}\). The simulated rain will be called SimX with X specifying the simulated rain intensity. The effects of SimX and RefReal on Harris features were then compared. Therefore, the 20 best Harris features of seven randomly chosen frames of RefReal and SimX have been compared against RefClean. For RefClean 16.27 correspondences were identified correctly, while for RefReal only 15.71 correct correspondences were found. Sim40 was closest to RefReal with an average of 15.57 correct correspondences. For Sim70 and Sim100, 13.81 and 13.14 correspondences have been found respectively. Another simulation run without rain streaks, Sim0, has shown that the simulation does not produce unwanted side effects and has exactly the same value as RefClean with 16.27 correspondences. Further validation results were in close agreement to the presented example for Harris features. This shows that the presented model for simulating synthetic rain variations produces similar effects compared to real rain. For more details on validation we refer to [41].

2.2 Road Spray

In contrast to rain as presented in Sect. 2.1, road spray represents a rather locally occurring noise. It occurs behind the wheels of a vehicle driving on a surface that is covered with water. However, as road spray occurs directly behind a driving vehicle, it covers large parts of the vehicle, making it more difficult to detect by vision-based object detection algorithms. Therefore, realistic simulation of road spray is important for performance characterization.

To simulate the droplets, physical properties were used to calculate the trajectory of each droplet. When looking at the 2D case, neglecting the lateral distribution of the droplets, the drops have an initial velocity equal to the rotation velocity of the wheel [113]. After the spray is detached from the wheel, air resistance and gravity slow down the droplets until they reach the road surface again. For the 2D case the trajectory of a single droplet represents a curve given projectile motion. To transfer this simulation into 3D space, jitter was added to the droplet positions for every time step. The standard deviation of the jitter was increased the longer the time of flight of a single droplet was in simulation. A result of the 3D positioning of droplets is illustrated in Fig. 6.

Fig. 6
A scatterplot depicts the drop distribution. The differently shaded scatters denote the longitudinal distance at different positions. The scatters are maximum between 0.8 and 1.4 meters.

Image from [113]

Example drop distribution behind an imaginary wheel positioned at the origin. To maintain visibility, this plot reduced the number of drops. Colors indicate the longitudinal distance from the origin and aid spatial vision. The axes dimension is in [m].

As wheel positions are known and the drop positions are calculated, the droplets are rendered as spheres. The mean diameter of droplets was set to 200 \(\upmu \)m with a standard deviation of 10 \(\upmu \)m. This is just large enough for the droplets to influence visible light geometrically. Instead of using ray tracing for refraction and reflection calculations, reflection and refraction vectors were precalculated. Therefore, many of these vectors were calculated depending on the distance of a droplet to the camera and the location within a droplet where a ray would have hit it. With this look-up vectors the location where the reflected ray would hit the environment is the last thing to be calculated for rendering. This was solved by generating an approximated cubemap of the 2D input image. Droplets too small to qualify for geometric reflection and refraction were generally considered to be fog. Instead of rendering those large number of micro droplets the sky color is assumed to be the color sampled by an up-pointing reflection vector and is mixed to the droplet color. The result of the presented rain spray simulation can be seen in Fig. 7. For more details of road spray simulation, we refer to the work from our team by von Bernuth et al. [113].

Fig. 7
Two photographs display the half side of a car's back view. The lower end of the car is highlighted in both photographs. a. Simulated spray distribution. The background road and, the car's wheel are partially hidden. b. Real spray distribution.

Image from [113]

Qualitative comparison of real spray (taken from the realrain data set [112]) on the left, and our simulated spray on the right. Because of the lack of clean spray data sets, we can only compare the occlusion of the lower end of the vehicle. Here, we can observe similar behavior: parts of the wheel are not visible, as well as part of the rear end and parts of the rear lights. The spray color blends in with the background and the color of the street; it reaches the same height as the real spray.

2.3 Dust

Camera sensors are affected by different types of dust throughout the year, making object detection more difficult by partially obstructing the field of view. Dirt on the windshield ranges from pollen in the spring to dirt thrown onto the windshield from the tires of vehicles in front, to tire wear particles.

Our proposed simulation of dust consists of two steps as presented by Hospach [41]. First, dust particles are distributed on a virtual windshield in front of the camera sensor. The size, number, transparency and color of the particles is configurable as well as the distance and angle of the virtual windshield. Afterwards a filter mask with the influence on each sensor pixel is calculated respecting the geometry of the particles as well as the camera parameters. In contrast to the rain simulation in Sect. 2.1 or the snow simulation in Sect. 2.4 a complete scene reconstruction is not necessary as dust is a rather static environmental influence restricted to the windshield. Hence, the filter mask can be precalculated once and applied to a complete video stream, saving computation time. The calculated filter mask is then applied pixel wise to the input data in the second step. The result of the dust simulation on an image from CARLA [19] is illustrated in Fig. 8.

Fig. 8
A set of two photographs of a roadside view. a. Original image. b. Visible dust particles at a distance of 50 meters to the sensor.

Original image generated with CARLA on the left and image with dust on the right with 60 simulated particles and a distance of 50 mm distance to sensor

The dust simulation was validated comparing the influence on HARRIS and SURF features as well as the number of edge points found by Canny edge detection [41]. The experimental setup was as follows: five black, round paper particles distributed on a glass pane were recorded with a real camera. Additionally, a single particle was recorded at different distances from the glass pane to investigate different particle sizes and edge blur effects. These real dust recordings are called RefReal in the following. The same scene without particles denoted as RefClean was recorded as baseline and input for dust simulation. Afterwards the baseline image was augmented by dust simulation denoted as SimX, where X stands for the number of simulated particles. SimX is then compared against RefReal. If the influence of RefReal to basic image processing algorithms is similar to RefSim we have shown that the dust simulation produces equal effects. For evaluation based on Harris features: 20 and eight correct correspondences have been found for RefClean and RefReal, respectively. Sim10000 was closest with an average of 9.35 found correspondences. Sim20000 resulted in 5.25 correct correspondences and the simulation run with the lowest number of dust particles Sim1000 resulted in 16.6 found correspondences. This shows that the higher the number of simulated particles the lower the number of found Harris features gets. The found SURF features decrease as well with increasing amount of dust particles [41] for SimX. The results show similar effects to RefReal, which also reduces the average number of features found. Other simulation results with Canny edge detection were in close agreement. For more details on dust validation we refer to [41].

2.4 Snow

Similar to the simulation of rain (see Sect. 2.1), the first step of the snow simulation is the reconstruction of the 3D scene. Either stereo images to calculate the depth image, a camera image together with LiDAR data or simulation data from e.g., CARLA with a perfect depth image can be used for 3D scene reconstruction. After scene reconstruction, snowflakes have to be distributed in front of the camera sensor. For snow simulation the first step is to determine the number of snowflakes which shall be simulated per volume:

$$\begin{aligned} N_s &= \frac{M_s}{{2}\,\textrm{mg}}. \end{aligned}$$
(1)

\(M_s\) represents the mass concentration in air according to Koh and Lacombe [55]:

$$\begin{aligned} M_s &= 0.30 \cdot R_s,\end{aligned}$$
(2a)
$$\begin{aligned} M_s &= 0.47 \cdot R_s. \end{aligned}$$
(2b)

Equation (2a) represents the mass concentration for dense snow such as in snow storms whereas (2b) represents the regular snow mass concentration. \(R_s\) is the snow precipitation rate in [mm h\(^{-1}\)]. After having the number of snowflakes per volume specified with (1), the size of the simulated snowflakes has to be determined. With a given snowflake diameter D in [ mm] and \(R_s\), the frequency of a snowflake having diameter \(N_D\) can be calculated as follows [35, 90]:

$$\begin{aligned} N_D &= N_0 \cdot e^{-\Lambda D}, \end{aligned}$$
(3a)
$$\begin{aligned} N_0 &= {2.50e3\,.} \cdot R_s^{-0.94} &[\textrm{m}^{-1} \textrm{m}^{-3}],\end{aligned}$$
(3b)
$$\begin{aligned} \Lambda &= 2.29 \cdot R_s^{-0.45} &[\textrm{mm}^{-1}]. \end{aligned}$$
(3c)

For each snowflake an appropriate diameter is assigned using a piece-wise defined probability distribution function weighted by \(N_D\). Each snowflake is either represented by a flat crystal or a three-dimensional crystal constructed out of three flat ones. The orientation of each flake is randomly chosen based on velocity vectors given by gravity, the velocity of the car onto which the camera sensor is attached and additional wind speeds.

The result of the snow simulation can be seen in Fig. 9. Here, a comparison with real snow is illustrated showing the realism of the proposed simulation approach. For more details on our approach of snow simulation, we refer to [7].

Fig. 9
A set of four photographs display the real and simulated snowflakes. a. Real snow. b. Simulated snow.

Images from [7]

Visual comparison of real and simulated snowflakes. The images on the left were taken during snowy weather. On the right, snow was simulated onto images of the exact same scene that were taken on days without any snow fall.

2.5 Fog

Similar to rain, fog consists of little water droplets. However, the amount of water droplets per volume is extremely high (\(10^5\) times higher than for rain), and the droplets are very small (\(10^3\) times smaller compared to rain) [76]. Therefore, a simulation based on 3D reconstruction with trillions of particles and ray tracing would be extremely expensive considering computing power and time. Hence, the fog simulation will use light attenuation algorithms.

When light traverses fog its rays are partially scattered or absorbed when hitting the small water droplets. It can be assumed that each ray passes a fixed number of fog particles for a specific traveled distance. When passing through fog the amount of scattered or absorbed light can be described by the first term of (4), where \(I_i\) describes the incident light intensity, \(\alpha _{ext}\) in \([{\,\mathrm{\text {m}^{-1}}}]\) an extinction factor and d in \([{\,\mathrm{\text {m}}}]\) the distance the light travels through fog. Given the i-th pixel color \(I_i\) of an image and a sky color \(I_s\), every pixel with depth d is assigned its new color [7]

$$\begin{aligned} I = I_i e^{ - \alpha _{ext} d } + I_s (1 - e^{ - \alpha _{ext} d }). \end{aligned}$$
(4)

In Fig. 10 the resulting fog simulation on an image from Cityscapes data set is depicted. It can be seen that depending on the distance of a given pixel within the image the scattering and absorbing effects of fog differ. Distant objects are harder to spot than closer ones, as they are affected more by the fog. This results in a realistic fog simulation which takes the environment into account. For more information and results we refer to the work of von Bernuth et al. [7].

Fig. 10
A set of two photographs of a roadside view. a. original image. b. Applied with the simulated fog. The visual is not as clear as the original.

Image from [7]

The upper image is from the Cityscapes data set [14], the lower image shows the image with our fog simulation applied.

3 Evaluation Metrics for Object Perception

To rate and compare object detection systems, different metrics exist. These metrics consider the accuracy of the perceived bounding boxes and indicate the perception rate. An overview is given in Sect. 3.1. As aforementioned, autonomous vehicles must be safe. Since performance and safety do not always correlate, a new metric to evaluate the safety of perception systems is presented in Sect. 3.2.

3.1 Common Metrics for Perception Evaluation

In existing benchmarks like COCO [60] or KITTI [34], simple performance measures such as precision, accuracy, recall, and mean Average Precision (mAP) are used to evaluate object detection [16, 23, 74]. These metrics are calculated on the number of true positive (TP) or false positive (FP) detections. The classification of TP/FP is based on the IoU of detection and ground truth (GT) bounding box. The IoU, is a well known metric in the field of object detection [83]. For calculation the area of intersection and union of detection D and the corresponding GT G is used as described by Rezatofighi et al. [83]:

$$\begin{aligned} \textrm{IoU} = \frac{|D \cap G|}{|D \cup G|}. \end{aligned}$$
(5)

IoU is used by object detection benchmarks like COCO [60] or Pascal VOC [23]. The threshold value to classify an object as TP can be parameterized; different threshold values like 0.5 in Pascal VOC or 0.7 in KITTI are used. The aforementioned metrics concentrate on analyzing a single frame and are applicable to both 2D and 3D bounding box-based object detection. However, none of these measures can evaluate object-tracking techniques; they only take into account tagged GT objects.

The performance metrics precision (P) and recall (R) [74] include the true negative (TN) and false negative (FN) results to describe the percentage of correct detection and how accurate the detections are:

$$\begin{aligned} \textrm{P} = \frac{ TP }{ TP + FP }, \quad \textrm{R} = \frac{ TP }{ TP + FN }. \end{aligned}$$
(6)

The accuracy (A) [74] can be calculated as :

$$\begin{aligned} \textrm{A} = \frac{ TP + TN }{ TP + TN + FP + FN }. \end{aligned}$$
(7)

The average precision (AP) is equal to the area of the corresponding precision recall curve (see (8)). Similarly the average accuracy (AA) is defined. The mean average precision (mAP) describes the precision averaged over all available classes.

$$\begin{aligned} \textrm{AP} = \int _{0}^{1} P(R) dR \end{aligned}$$
(8)

The Classification of Events, Activities and Relationships (CLEAR) defined different metrics to evaluate object detection,—tracking and head-pose estimation. For the detection/tracking evaluation, the Multiple-Object-Detection and Multiple-Object-Tracking precision (MODP/MOTP), and accuracy (MODA/MOTA) were defined [97].

With \(m_t\) as misses, \( fp _t\) as amount of FPs and \(g_t\) as number of GT objects at time t and the IoU of each object as well as \(N^{\textrm{mapped}}_t\) as number of mapped object sets at t, MODA and MODP are defined as [97]:

$$\begin{aligned} \textrm{MODA}(t)=1-\frac{\sum _t (m_t + fp _t)}{\sum _t g_t}, \quad \textrm{MODP}(t)=\frac{\sum _{i=1}^{N^{\textrm{mapped}}_t} \textrm{IoU}_{i}}{N^{\textrm{mapped}}_t}. \end{aligned}$$
(9)

The tracking metrics include additional parameters; \( mme _t\) as number of mismatches between GT and tracking hypothesis, \(d_{i,t}\) as deviation between tracking hypothesis and GT as well as \(c_t\) as number of matches. Using these parameters MOTA and MOTP are defined as [97]:

$$\begin{aligned} \textrm{MOTA}(t)=1-\frac{\sum _t (m_t + fp _t + mme _t)}{\sum _t g_t}, \quad \textrm{MOTP}(t)=\frac{\sum _{i,t} d_{i,t}}{\sum _t c_t}. \end{aligned}$$
(10)

The CLEAR metrics are used in the KITTI Multiple-Object-Tracking benchmark [34].

The higher level of detail in the CLEAR metrics gives them a significant edge over more fundamental performance indicators like precision and accuracy. As opposed to the binary method of computation based on TP and FP quantity, using the IoU or distance to determine the accuracy scores, allows a better statement about the precision.

3.2 Safety Metric

Since the semantics of a scenario are not taken into account by current performance measures, it is necessary to utilize a metric that assesses the real-world safety of an object perception system.

Fig. 11
A digital photo of an aerial view of a roadside scenario with safety metrics. The objects are squared with three shades denoting ego vehicle, detected, and not detected.

Exemplary scenario showing the necessity for a metric to evaluate safety

This can be shown by the scenario in Fig. 11. Based on the detections, the given perception system achieves a precision of 100% and a recall of 86% since 12 of 14 objects are correctly perceived. These results appear to be good, but the undetected vehicle in front of the ego vehicle or the one in the bottom right corner of the intersection could lead to an accident.

The goal is the development of a metric that allows to evaluate safety of various perceptual techniques in various traffic situations and weather conditions. The outcome must be a single value inside a specified range for this use. Therefore, we propose the “Comprehensive Safety Metric (CSM)”.

The composition of the individual safety metric components and their relationship is presented in Fig. 12. It demonstrates the method through which our strategy integrates many factors to produce a single safety-metric score that makes it simple to compare the perception algorithms.

For the assessment of safety, three criteria to consider were defined:  

Quality:

The effectiveness of perception is crucial for subsequent activities, such as trajectory planning.

Relevance:

It is important to recognize any objects that may be related to a collision. We must therefore discriminate between objects that are relevant and those that are not.

Time:

Time is always an important consideration in a real-time system. Less reaction time and fewer driving maneuvers are feasible as a result of longer detection durations.

Fig. 12
A workflow of safety metric components. It includes detection and tracking. Detection begins with M O D A and M O D P, followed by detection safety while tracking begins with M O T A and M O T P, followed by tracking safety, W subscript D S subscript D plus W subscript T S subscript T, and finally safety metric score S.

Image from [110]

Process overview of the single components and their relation to one another to determine a safety metric score S. Red areas around ego (black) indicate safety critical areas.

3.2.1 Basis of the Safety Metric

The accuracy of object perception is extremely important when assessing autonomous driving safety. Further activities, such as motion planning, will be carried out based on the perception. Low-quality detection or tracking may result in incorrect planning, which may put the occupants of the vehicle and other road users in danger.

Thus, perception quality is one main safety factor and will be used as basis of the CSM. To combine accuracy and precision we use the CLEAR metrics [97] (see Sect. 3.1). The choice of CLEAR metrics was based on the completeness of the metric, as it combines accuracy and precision for detection as well as tracking.

One issue with the MOTP score emerges when utilizing the CLEAR criteria to assess safety. A better tracking is indicated by a lower MOTP score. Contrary to the safety metric score, which equates a higher number to better safety, this is not the case. To invert the MOTP indication, an advanced mapping to a MOTP safety metric score \(\textrm{MOTP}_s \in [0,1]\) is defined. With \(T_u\) as upper and \(T_l\) as lower threshold, \(\textrm{MOTP}_s\) can be determined by using:

$$\begin{aligned} f_{norm}(x) = {\left\{ \begin{array}{ll} 1 & x < T_l,\\ 1 - \frac{x - T_l}{T_u - T_l} & T_l \le x \le T_u,\\ 0 & \textrm{otherwise}. \end{array}\right. } \end{aligned}$$
(11)

For our experiments it holds that \(T_l={0.8\,\mathrm{\text {m}}}\), as this value corresponds to a step width of a vulnerable road user (VRU) to avoid a collision. By similar reasoning we set \(T_u={2.5\,\mathrm{\text {m}}}\), which roughly corresponds to a misjudgment that could lead to a collision. A linear function is used because MOTP is metrically scaled.

The threshold values of \(f_{norm}\) can be parameterized based on the application domain and the accompanying requirements. This increases the variability and makes the metric applicable for the assessment of various systems.

Precision and accuracy are equally important to us for the suggested safety measure, so we use the accuracy and precision score of detection and tracking to generate a second safety metric basis rating. The detection safety (\(S_D\)) and the tracking safety (\(S_T\)) are defined as:

$$\begin{aligned} S_D = \frac{\textrm{MODA}+\textrm{MODP}}{2} ,\quad S_T = \frac{\textrm{MOTA}+\textrm{MOTP}_s}{2}. \end{aligned}$$
(12)

This evaluation is just a baseline and further values must be evaluated to cover the three safety criteria, which were introduced in Sect. 3.2.

3.2.2 Distance-Based IoU Verification

A second parallel assessment is carried out before the CLEAR metrics are computed. For objects closer to the ego vehicle, the perception must be more precise. The shorter amount of time to react during motion planning is the basis for this harsher criterion for closer objects. We need to differentiate the perception quality, since these things exhibit a higher safety criticality.

The distance-based IoU verification uses the cover \(C_o\) of GT object G. For a detected object o with detection D the cover is defined as:

$$\begin{aligned} C_o =\frac{ {|D_o \cap G_o|} }{ |G_o| }\,. \end{aligned}$$
(13)

Using \( C_o \), a safety function \(f_s\) is defined as:

$$\begin{aligned} f_s(C_o)= {\left\{ \begin{array}{ll} \frac{1+ mC +(1- mC )\sin (\pi (C_o-\frac{1}{2}))}{2} & C_o \in [ mC , 1],\\ 1 & C_o \in (1, oT ],\\ \frac{1+\cos (\frac{\pi }{ mO - oT }(C_o- oT ))}{2} & C_o \in (oT, mO ],\\ 0 & \textrm{otherwise}. \end{array}\right. } \end{aligned}$$
(14)

This function guarantees a minimum detection precision \( mC \). Between the thresholds \( mC \) and \( mO \), trigonometric functions are used for a smooth distance-based scaling factor depending on the precision of the detection. \( oT \) defines a threshold how much larger an object is allowed to be detected without lowering the detection precision. If \(C_o\) is larger than \( oT \), \( mO \) represents the upper bound up to which \(f_s\) reduces the precision towards zero.

The distance-based score is calculated by function \(g : [0,1]^2 \rightarrow [-1,1]\), where

$$\begin{aligned} g(x, y) = x - (1-x) \cdot (1-y). \end{aligned}$$
(15)

The function \(g( f_s(C_o), d_o)\) must be transformed to [0, 1] to be used as a precision factor. The transformation is described by

$$\begin{aligned} f_v = \frac{g( f_s(C_o), d_o ) + 1}{2}. \end{aligned}$$
(16)

For each detected object o the IoU gets multiplied by \(f_v\). This additional consideration leads to a stricter rating, which should be preferred in context of safety.

3.2.3 Consideration of the Collision Relevance

The second criteria to assess the perception safety is the relevance of an object. A possibly safety critical object has a higher importance than a non safety critical object.

First, it must be defined when an object must be considered as safety critical. An object is safety critical if its distance to the ego vehicle is less than a corresponding safety distance. To calculate the safety distance, we use the approach of the “Responsible-Sensitive Safety” (RSS) model [93]. The RSS model is an attempt to formalize the human judgment in different road scenarios in a mathematical sense. The RSS model consists of 34 definitions of different safety distances, times, and procedural rules. These rules specify how an autonomous vehicle should behave and provide a mathematical description of a safe conduct.

We use the longitudinal safety distance with same direction of movement \(d_{long,s}\), with opposite direction of movement \(d_{long,o}\) and the lateral safety distance \(d_{lat}\) [93, Definition 1, 2, 6].

To evaluate the collision relevance of an object, the future position must be predicted. With the ego velocity \(v_0\) and the weather-dependent brake acceleration a, the prediction time frame \(t_p\) is defined as:

$$\begin{aligned} t_p= 1.1 \cdot \frac{v_0}{a}. \end{aligned}$$
(17)
Fig. 13
A photograph of a roadside view. The dark shade-filled boxes denote the objects that are collision-relevant. A graphical illustration of the scenario is provided aside.

Image from [110]

Schematic identification of collision relevant objects from KITTI raw data set [33]. The right image represents the bird’s eye view of the camera image on the left. Blue boxes illustrate ground truth annotations, light blue boxes represent the predicted object positions. Red filled objects are collision relevant and white ones are not. The corresponding collision relevant objects in the camera image are marked in red.

For each time step in the position prediction phase, it is verified whether the distance between ego and the object is higher than the corresponding safety distance. The object is marked as safety critical if this is not the case and the perception system did not perceive it.

Figure 13 shows this process schematically. The red area in the bird’s eye view marks the safety critical area identified by lateral and longitudinal RSS safety distances. The collision relevant objects are marked in red. If they are not perceived, they are considered safety critical, as shown in Fig. 13.

To rate the relevance in context of safety, we need to approximate the effect of a missing detection and a hypothetical resulting collision. The first step is an approximation of the impact velocity, in case of an in fact collision.

Since safety in automated driving affects not only the vehicle occupants but also other road users, these must also be taken into account. Road users can be categorized into VRUs and road users with a crush collapsible zone, like cars, vans or trucks.

The combination of impact velocity and the road user category c of the collision relevant object leads to a collision score \(s_{c,ro}\) for a relevant non-detected safety critical object ro. To assess \(s_{c,ro}\), a classification of the impact velocity with four levels is defined. The level definition is based on the common accident categories used in Germany. These categories are defined by the Ministry of the Interior of the state North Rhine-Westphalia in Germany as UK 1 (fatality)—UK 3 (minor injuries only) [65]. Furthermore, an additional category UK 5 is used to include collisions with material damage only [65].

More about the effects of vehicle impact velocity in a collision can be found in the publications of Frederiksson et al. [27] and Han et al. [37].

The defined categories with their collision scores \(s_{c,ro}\) are:

$$\begin{aligned} s_{c,ro} :={\left\{ \begin{array}{ll} 0.9 & \text {no or almost no effect (UK5)},\\ 0.75 & \text {risk of minor injuries (UK3)},\\ 0.5 & \text {risk of serious violation (UK2)},\\ 0 & \text {high probability of fatality (UK1)}.\\ \end{array}\right. } \end{aligned}$$
(18)

In our approach, \(s_{c,ro}\) is used as a factor for a single frame. A collision that is rated as having a high chance of fatalities is unacceptable and receives a score of 0. The case with almost no effect is worse than no accident, thus a factor of 0.9 is defined. \(s_{c,ro}\) must not be too strict, otherwise no accurate differentiation of the final safety value would be possible.

For a single frame the worst case \(s_{c,ro}\) is calculated and used as factor \(f_c\) on \(S_T\) and \(S_D\).

3.2.4 Evaluation of Perception Time

The time is the third requirement for a safety-critical real-time perception system. The longer object identification takes, the less time there is to avert a life-threatening situation. The time requirements of the proposed safety metric is covered by the soft real-time approach of Kim et al. [58].

For the CSM, the perception time \(t_{d,o}\) of object o is defined as time from falling below the safety distance (see Sect. 3.2.3) until its perception.

A weighted perception time is used to convert the detection time to a perception time factor. The introduction of the weighting was necessary, since the problem becomes more dangerous the longer it takes to identify it. The mean perception time is used to categorize long and short durations for this purpose. Let m be the number of all weights and \(\overline{t_d}\) the mean perception time. The weighted perception time \(t_{dw}\) with m as number of all weights is defined as:

$$\begin{aligned} t_{dw} =\frac{1}{m} \sum _{o} {\left\{ \begin{array}{ll} t_{d,o} & t_{d,o} \le \overline{t_d},\\ 2\cdot t_{d,o} & \textrm{otherwise}.\\ \end{array}\right. } \end{aligned}$$
(19)

Similar to Kim et al. [58], the CLEAR scores are mapped by a function depending on \(t_{dw}\). The mapping of \(t_{dw}\) to \(f_t\) is done with (11). The parameter \(T_l\) is set to 0.1 s, as tolerable delay for the detection. \(T_u\) is set to ego braking time \(t_b\). If \(t_{dw}>t_b\), \(f_t\) has to be 0, since an emergency braking would not be possible anymore.

3.3 Comprehensive Safety Metric Score

The result of the CSM has the requirement of an easy comparability. Hence, the safety metric score is a single value \(S \in [0,1]\), where 1 describes the maximum safety. Like the previous described performance metrics, S is determined for each frame of a scenario. Therefore, \(S_D\) and \(S_T\) including the evaluation of collision relevance and perception time are combined.

To achieve a high variability in the CSM, \(S_D\) and \(S_T\) can be weighted with \(w_{D}, w_{T} \in [0,1]: w_{D}+w_{T}=1\). The safety score S is defined as:

$$\begin{aligned} S = w_D S_D + w_T S_T. \end{aligned}$$
(20)

The comprehensive safety is not a percentage value, in contrast to precision or accuracy, which results in a non-intuitive interpretability. It is necessary to specify a categorization of S (see Table 1) in order to improve interpretability. The five-level defined classification is based on the evaluation of the individual CLEAR metrics values as well as the specified influences of collision relevance and detection time analysis.

Table 1 Rating of the safety metric score. Table from [110]

This classification offers a quick and easy performance comparison safety evaluation of different test scenarios and perception systems.

3.4 Data Set Evaluation with the Safety Metric

Initially, we motivated the safety metric by the scenario shown in Fig. 11. The resulting precision of 100% and a recall of 86% indicate a very good perception. Depending on the velocity of the vehicles in an inner-city scenario, the result of the CSM would be in the range of 0.4 and therefore indicating minor to serious injuries which are far away from a safe state.

Table 2 Evaluation results for object detection with YOLOv3. Table from [110]

Table 2 shows the results of an image-plane object detection using YOLOv3 [79] on three VTD scenarios (freeway, crossing and rural) [1] and the KITTI raw data set [33]. As we can see, the precision is over 80% for the virtual scenarios but recall and mAP are rather low with about 30% for freeway and crossing. The significant gap between the mAP for KITTI and the simulated scenarios can be explained by the number of objects and their positioning. Multiple objects are occluded and thus cannot be perceived correctly. For the state-of-the-art performance, these results seem acceptable but the CSM has a result of 0.14/0.20 for freeway/crossing. Using the corresponding classification of Table 1, this indicates a high risk of fatality due to undetected relevant objects. For the rural scenario the safety score S is higher than the corresponding recall and mAP. Even if the recall is not perfect, we can observe that the perception is close to a safe state. This is based on the scenario of a rural road with our ego vehicle following two further vehicles. Single misdetections do not have an influence, since the distance between the objects is big enough that there is no significant risk of an accident. For the KITTI raw data set, the safety score S, the recall and the mAP are quite similar, but the interpretation of these values is quite different. While a recall or mAP of 50–60% seems good, a safety score of 0.48 indicates that some missing detections could lead to accidents with a probability of injuries, which is not acceptable.

Further results for Faster-RCNN, RCC and a Birds-Eye 3D detection can be found in [110].

4 Optimization of Object Perception

This section thematizes the optimization of local and cooperative perception. First, the need of robustness improvement is motivated by showing the influence of weather on vision-based object detection. Sections 1.3 and 4.2 present the used data sets and introduce our proposed robustness enhancement for local perception. Concluding, the advantages of cooperative perception are introduced and an environment-aware optimization approach for the data fusion in CP is presented.

4.1 Influence of Weather on Perception

Object detection relying on camera sensors is prone to adverse weather conditions such as heavy rain or difficult lighting conditions. Therefore, vision-based object detection in particular needs to be resilient to adverse and varying weather conditions. In order to determine its resilience and robustness, the capabilities of vehicle-local perception under varying weather conditions are investigated. In the following vision-based perception will be referred to as perception.

For robustness assessment of perception, two different neural networks (Faster-RCNN [82] and YOLOv3 [79]) will be evaluated. Both networks are trained with the KITTI data set [34] and the quality of object detection is assessed with the well known average precision metric (AP) as presented in Sect. 3.1. To evaluate resilience against adverse weather conditions, a realistic synthetic rain augmentation is used to modify the KITTI data set. The augmentation consists of two steps, the generation of falling rain [43] followed by rendering raindrops on the windshield [6]. The exact process of simulating rain is explained in Sect. 2.1. The rain augmentation technique consists of various parameters to adjust the simulated rain. For evaluation, the same parameter ranges as used for the optimization from Table 3 were used. However, the ranges of rain intensity and image brightness have been adapted to cover a large variation in the evaluation phase:

  • rain intensity \(r_{i}\) \([{0}{\,{\mathrm{mm\, h}^{-1}}},{80}{\,{\mathrm{mm\, h}^{-1}}}]\)

  • brightness \(r_{b}\) \([{25}\%, {200}\%]\)

The exact parameter values were randomly chosen in between the above defined parameter ranges. The networks which were initially trained on the original and not augmented KITTI data set are then evaluated on the distinct test set of KITTI which was not used for training. The test set is augmented with synthetic rain augmentations and the perception capabilities are investigated.

Fig. 14
A set of two three-dimensional graphs depict the map versus rain rate and brightness. a. Faster R C N N. The maximum range is between 50 and 80. b. Yolo V 3. The maximum range is between 10 and 30. The values are approximate.

Image from [112]

Mean average precision depending on varying rain and brightness intensities for a Faster-RCNN and b YOLOv3.

The result to identify the influence of synthetic rain variations on mAP of Faster-RCNN is illustrated in Fig. 14a and b for YOLOv3 respectively. The achieved mAP is plotted for different rain intensities and brightness levels. Drop radii are implicitly included in the varying rain intensities. The angle of the falling rain is not plotted separately as it had fewer influence compared to rain rate and brightness.

Faster-RCNN achieved a mean mAP of 45.45% while YOLOv3 achieved a mean mAP of 33.74% over all rain and brightness variations. The networks were not separately trained for cars, pedestrians and cyclist only as usually done for the KITTI benchmark. We rather used all present KITTI labels for training. Hence, the mAP of 50.42% (Faster-RCNN) and 48.42% (YOLOv3) without augmentation are not to be confused with the online available results. Additionally, the online available AP values are given per class, and we average the AP over all classes regarding the number of objects per class.

Increasing rain intensities and brightness values below 100% drastically lower mAP of the investigated neural networks. For Faster-RCNN, the most critical situation was observed for 80 mm h\(^{-1}\) rain intensity, 0\(^{\circ }\) rain angle and 25% brightness, which resulted in a drop by 94.21% compared to not augmented KITTI. YOLOv3 had the worst detection rates at 80 mm h\(^{-1}\) rain intensity, \(-30^{\circ }\) rain angle and 25% brightness, which led to a detection drop of 99.61%. Hasirlioglu and Riener [39] found similar results in their investigation about the influence of rainy weather on the object detection performance. The investigation shows that neural networks are not robust against adverse weather conditions. Data sets such as KITTI lack weather-influenced scenarios. Therefore, it is not possible to obtain robust networks just by training on them.

4.2 Optimization of Local Perception

Vehicle-local perception is strongly affected by adverse weather conditions such as heavy rain (see Sect. 4.1). To optimize perception capabilities of vision-based object detection, we introduce a methodology that uses realistic augmentation techniques as presented in Sect. 2 to diversify existing data sets with adverse weather conditions. This makes neural networks more robust by having as diverse training data as possible. An overview of our proposed workflow is illustrated in Fig. 15.

Fig. 15
A workflow of evaluation of rain variations, starts with input images, followed by scene reconstruction, generating rain variants, rain layers for falling rain, raindrop simulation on virtual windshield, and images with simulated rain. The simulation of synthetic rain includes augmentation, optimization through re-training of C N N, and evaluation.

Image from [112]

Workflow of local robustness optimization and evaluation by simulating rain variations.

The first step is to extend the KITTI training set [34] with augmented data. Next the training of Faster-RCNN [82] and YOLOv3 [79] is performed again on this new and diversified data set. The KITTI training set was split as before in a training set consisting of 6800 images and a test set containing 468 images. Rain augmentation is performed for the whole training set of 6800 images and added to the original training data set resulting in a training data set of 13600 images. Hence, only half of the images from the training data are augmented while the other half are not. This prevents overfitting to adverse weather conditions and the neural networks will have still good performance on the original data set. To validate the effectiveness of the proposed data augmentation through synthetic rain and brightness variation, additional data augmentation methods were compared against our approach. Therefore, the neural networks were also trained with a data set extended by Gaussian noise (GN), Salt-and-Pepper noise (SPN) and a combination of GN and SPN.

A large variation of different augmentations (see Table 3) has been used to extend the training data set. Six parameters for synthetic rain have been chosen as in the evaluation (see Sect. 4.1). For GN two parameters, for SPN one parameter and for the combination three parameters specify the noise intensity. The selected parameter ranges were chosen as follows: The evaluation has identified that only a brightness below 100% has a strong negative effect on the neural networks. For a higher brightness an increasing rain rate affects the neural networks less. The intervals for brightness augmentation and rain intensity have therefore been set to the ranges found as critical in the evaluation phase. The lower bound of the brightness augmentation was set to 40% as this has shown to be more effective compared to lower brightness values. The lower bound of rain intensity was raised to 30 mm h\(^{-1}\), as challenging situations only occurred above this rain intensity.

Table 3 Parameter ranges for data augmentation in the optimization phase of our workflow. Table from [112]

Similar to the evaluation phase, the exact parameter values for every augmentation technique were randomly chosen for each image within the specified parameter ranges to generate a training set of various conditions, except for \(d_{\mu }\) and \(d_{\sigma }\).

\(d_{\mu }\) and \(d_{\sigma }\) are calculated according to the randomly chosen rain intensity with equations according as introduced in Sect. 2.1. The random number generator was seeded to be able to generate reproducible results.

4.2.1 Results for Optimization of Local Perception

To evaluate the presented perception optimization approach the realrain data set (see Sect. 1.3) was used. This data set was solely used for validation and not for training. The perception capabilities in terms of AP and AA were investigated for the baseline, GN and SPN augmentation techniques and our optimization. A qualitative comparison is illustrated in Fig. 16.

Fig. 16
Four photographs. a. Examples include rain streaks and blur. The objects are squared denoting the evaluation of G T, R C N N baseline, and optimization of G N and S P N. b. Examples include rain streaks, blurs, and raindrops on the windshield.

Images from [112]

Comparison of GT (blue), Faster-RCNN baseline (red), optimization with GN and SPN (yellow) and our optimization with rain and brightness variations (green) on example images taken from our realrain data set.

Quite remarkable is the fact, that only with our optimization approach the CNN was able to detect the vehicle obstructed with raindrops in Fig. 16b. A complete overview of the results is presented in Table 4. It can be seen that our optimization performs best for YOLOv3 as well as for Faster-RCNN considering AP. With our approach, the unoptimized detection for Faster-RCNN was improved by 4.37% points (p.p.) and by 7.33 p.p. for YOLOv3. The second-best optimization in comparison achieved an improvement of 1.65 p.p. for Faster-RCNN and 2.18 for YOLOv3. Looking at AA instead of AP it can be seen that AA decreased by 1.67 p.p. for Faster-RCNN but on the other hand gets improved by 0.53 p.p. for YOLOv3. Optimization with SPN performs best for AA but worst when it comes to AP. When it comes to safety under adverse weather conditions not perceiving an obstacle is more severe than false positive detections which e.g., could result in additional breaking maneuvers. Therefore, the AP metric is more relevant than the AA metric for assessing perception performance because it considers recall as well as precision.

Furthermore, we compare our two optimized networks to the more robust neural network RRC [81]. RRC achieves a mean AP of 74.60% on the KITTI test set. This is a lower mean AP value compared to the online available results on the KITTI benchmark website as RRC was trained on all present KITTI labels and not separately for cars, cyclists and pedestrians. However, on the realrain data set RRC only achieves an AP of 12.97%. This shows that even more robust networks are incapable of handling adverse weather conditions such as heavy rain. Both networks which were optimized with rain variations achieve similar performance like RRC in AP on the realrain data set, although the unoptimized versions perform drastically worse.

Table 4 Average precision and accuracy results for Faster-RCNN, YOLOv3 and RRC on the evaluation of our realrain data set and the original KITTI test set. Table shortened from [112]

A disadvantage of many data augmentation techniques for enlarging training data sets is the decrease of performance on the original data set. Hence, we evaluated the performance on the original KITTI data set as well. The results are present in Table 4. It can be observed that our optimization approach with synthetic rain variations almost has no negative effect on the performance on the original KITTI data set. For Faster-RCNN the AP got lowered by 0.47 p.p. and for YOLOv3 AP got decreased by 0.63 p.p. Comparing our approach with the augmentation with GN the performance got increased for Faster-RCNN and decreased for YOLOv3. The remaining augmentation techniques including SPN lowered AP slightly for Faster-RCNN but significantly for YOLOv3.

The presented approach shows that using realistic synthetic rain variations to extend existing data sets for the training of neural networks can improve the robustness of these networks against adverse weather conditions. It has been shown that the performance on the completely different realrain data set could be improved while maintaining the performance on the original data set.

4.3 Cooperative Perception

Cooperative Perception describes a process in which the perception is done across multiple distributed vehicles. Information about locally perceived objects is transmitted via V2X communication between different vehicles. The ETSI defined two message formats for this purpose. The first message is the Cooperative Awareness Message (CAM) [22] which contains the state (position, velocity, orientation) of the ego vehicle. The second message type is the Collective Perception Message (CPM) [21] which contains the ego state as well as the states of the locally perceived objects. The ego must align all information of the local perception and the data from the cooperative vehicles to its ego vehicle coordinate system; afterwards all information must be matched before a fusion can be executed. The fusion is necessary to combine different information about the same object as exact one valid state per object is necessary.

Fig. 17
A set of illustrations of five different concepts of cooperative perception. a. Scenario generation. b. Weather simulation. c. Local perception. d. V 2 V communication. e. Cooperative perception.

Image from [108]

Process of cooperative perception including a weather simulation (b).

The advantages of CP are manifold. The main advantage is the increase of the perception range. Local perception can be limited through weather conditions (see Sect. 4.1), limited sensor ranges and occlusion. The CP, as shown in Fig. 17, enables the perception of objects that cannot be perceived locally. The ego vehicle (blue) can only locally detect the gray vehicle in front; the other objects are occluded by a building. The cooperative vehicle (red) can detect the gray vehicle in front of it and send this detection together with its state to the ego. The ego now knows about the existence of two further objects behind the corner. Furthermore, CP can lead to multiple detections of the same vehicle, which allows a more precise estimation of an object’s state.

The advantage of CP under different weather conditions was investigated by Volk et al. [108]. Their results can be used to quantify the above described advantage.

As shown in Table 5, they achieved remarkable results. For a freeway scenario without any rain CP could increase the mAP from 10.63 to 28.27% with 40% cooperative vehicles. At higher rain rates of about 70–90 mm h\(^{-1}\) the local perception was not able to detect any object while the CP still achieved a mAP of about 24%. Similar results could be observed for a rural and an intersection scenario. The rural scenario only consists of two vehicles except the ego; one of the further vehicles is a cooperative vehicle.

Table 5 Comparison of mean average precision of local perception (LP) against cooperative perception (CP) over different rain rates on a rural, intersection and freeway scenario. CP40 refer to cooperative equipment rate of 40%. Table shortened from [108]

4.3.1 Optimization of Cooperative Perception

Cooperative perception complicates the measurement-to-track assignment problem, as well as data tracking and fusion. There are two basic methodologies for tracking and fusion. The first is to have a centralized tracking component that directly handles sensor data [77]. The second method, known as Track-to-Track Fusion (T2TF), employs decentralized tracking components and fuses preprocessed sensor data available as tracks (state vector and corresponding covariance/confidence). T2TF has the advantage of providing more information about object dynamics and compensating V2X transmission latencies for CP [77].

Covariance Intersection (CI) of Julier and Uhlmann [48] was one of the first fusion approaches considering unknown correlations.

The CI to determine a fused state \(\hat{\textrm{x}}_g\) with covariance matrix (CM) \(\textrm{P}_g\) for two track states \(\hat{\textrm{x}}_i, \hat{\textrm{x}}_j\) with their CMs \(\textrm{P}_i,\textrm{P}_j\) is defined by Julier and Uhlmann [48] as

$$\begin{aligned} \textrm{P}_g^{-1}&=\omega \textrm{P}_i^{-1}+ (1-\omega ) \textrm{P}_j^{-1}, \end{aligned}$$
(21)
$$\begin{aligned} \hat{\textrm{x}}_g&=\textrm{P}_g (\omega \textrm{P}_i^{-1}\hat{\textrm{x}}_i+ (1-\omega ) \textrm{P}_j^{-1}\hat{\textrm{x}}_j), \end{aligned}$$
(22)
$$\begin{aligned} \omega &= \arg \min \det \textrm{P}_g,\ \omega \in [0,1]. \end{aligned}$$
(23)

Improvements of the CI regarding the sequential fusion of multiple data and the approximation of \(\omega \) were presented by Cong et al. [13] as well as Niehsen [71] and Fränken and Hüpper [26].

The CI has some disadvantages; for more than two tracks it was proven by Reinhardt et al. [80] that the CI does not necessarily deliver the optimal result. Furthermore, the CI does not consider inconsistent inputs. To address the problem of inconsistent inputs, Covariance Union (CU) was presented [105]. If the deviation between two inputs exceeds a defined threshold they are considered as inconsistent [11].

In addition to CI and CU, there exist many more approaches for the T2TF. More information about T2TF can be found in [11, 75, 77, 111]

However, the CI can not fulfil the performance requirements for CP in autonomous driving. As a result, the robust but suboptimal CI must be optimized so that only accurate and trustworthy data contribute to the cooperatively perceived environmental model. A pre-evaluation analyzes the capabilities of local perception systems so that the T2TF algorithm can evaluate the trustworthiness and validity of cooperatively transmitted data before fusing it. Therefore, the assumption is made that the local perception system of each vehicle is known.

Figure 18 shows a schematic overview of our proposed optimization pipeline. The pre-evaluation is used to determine the reference data \(\textrm{R}_{\textrm{conf}}\) for the confidence and \(\textrm{R}_{\textrm{cov}}\) for the covariance. The reference data, combined with corresponding tracks used in a track-validation module; this performs the suggested validation before the CI is used for T2TF.

Fig. 18
A flow chart depicts the pre-evaluation of perception, It is divided into confidence, covariance, and matched tacks, followed by track validation, and finally track-track fusion.

Image from [111]

Process overview from the pre-evaluation of a local perception over the validation of the received tracks in the track-validation to the T2TF fusion.

Pre-Evaluation of Local Perception

For our approach we assume that the local perception system \(l_v\) (sensor configuration and processing pipeline) of a cooperative vehicle v is known. Additionally, the current weather condition e including its intensity must be known. Adverse weather is considered since it has a significant influence on the perception capabilities [112]. The pre-evaluation investigates the local perception systems by their perception accuracy, measured with the CMs and the perception capabilities in terms of confidence which is measured by the recall.

The local perception is analyzed under varying weather conditions and the objects are clustered in distance bins d of size \(s_{bin}={5\,\mathrm{\text {m}}} \); this approximately corresponds to the length of an average car.

The results of a local perception are analyzed to get realistic and comparable confidences and CMs, to determine if a track from a cooperative vehicle seems plausible and is considered as valid to fuse it. Based on this evaluation two weather-related lookup-tables of the local perception capabilities for each specific perception system \(l_v\) are built. These lookup-tables are \(\textrm{R}_{\textrm{conf}}(l_v,e,d)\) (abbr. \(\textrm{R}_{\textrm{conf}}\)) and \(\textrm{R}_{\textrm{cov}}(l_v,e,d)\) (abbr. \(\textrm{R}_{\textrm{cov}}\)).

The recall [74] at a distance bin d for \(l_v\) is used as confidence. The IoU [83] must be greater 0.5 for a classification as true positive.

A cloudy day is used as baseline for the evaluation. To include adverse weather condition, a local perception under foggy condition with different densities from 0.01 \(\upmu \textrm{m}^{-3}\) to 0.15 \(\upmu \textrm{m}^{-3}\) is performed. To achieve reliable results each weather condition is executed for 10 runs with random positioning of the vehicles. Even for 10 runs it can occur that no objects were present at specific distances. Hence, no reference data can be calculated. To avoid missing values in \(\textrm{R}_{\textrm{conf}}\) as well as \(\textrm{R}_{\textrm{cov}}\), linear interpolation is used to determine missing values.

Optimization Strategies

Based on the pre-evaluation, our proposed approach validates collectively received tracks by comparison to \(\textrm{R}_{\textrm{conf}}\) and \(\textrm{R}_{\textrm{cov}}\). This enables sophisticated validation of perceived data in order to improve the resilience of unoptimized data fusion methods to harsh weather conditions as well as forged data. Two different validation approaches will be investigated. First a selection of tracks to reduce the number of tracks is presented. Second an advanced filtering approach based on the pre-evaluation is investigated.

Track Selection

Reinhardt et al. [80] have proven that the CI is not necessarily optimal for more than two tracks. As a result, one optimization strategy considers reducing the number of tracks used for fusion to two. For the selection of the tracks used for fusion, two approaches based on confidence and CM are considered. The first strategy only takes the two tracks with the highest confidence into account. The second strategy uses the two estimates with the smallest trace of their CM. The two advantages of this approach are the simplicity and a reduction of noise from inaccurate estimations; but the method only works if more than two estimations exist and can not avoid forged data.

Track Filtering

The second investigated optimization is filtering based on validation using pre-evaluated reference data. This technique addresses fusion precision as well as security; it’s split into confidence-based and CM-based validation.

A received detection has an assigned confidence, describing its trustworthiness. Distant objects are perceived less precisely [108]; thus lower confidence values are expected. An attacker is interested to make sure that the forged information are considered for fusion; thus they are sent with a high confidence which can be implausible. The corresponding reference confidence from \(\textrm{R}_{\textrm{conf}}\) can be used for validation. If reference and received confidence differ more than a defined threshold the received information is either inaccurate or maybe forged and thus considered as invalid and not used for fusion. Therefore, we assume a standardized assessment for confidence values.

A similar approach is possible using the CMs of the received state estimations. The CMs of \(\textrm{R}_{\textrm{cov}}\) are used to validate the received CM by trace or element wise. The CM’s main diagonal consists of the variances of the track-state items. Higher variances stand for a more inaccurate estimation such as for occluded or far distant objects. To avoid an inaccurate fusion, inaccurate estimates with a high variance must be discarded, even if their influence is small through the calculation of \(\omega \).

If the received information’s trace exceeds the reference trace by a threshold \(t_{\textrm{trace}}\), the received information is deemed incorrect and discarded before the fusion. However, as some variation is acceptable, the threshold should not be set too low.

Not only the trace can be used for validation but also an element wise validation on the main diagonal is possible. To do so a threshold vector \(t_{\textrm{elem}}\) with the size n of the main diagonal of the CM must be defined. Mathematically the validation process of the two mentioned techniques for filtering inaccurate estimations can be formulated as:

$$\begin{aligned} {{\,\textrm{tr}\,}}(\textrm{P}_s) -{{\,\textrm{tr}\,}}(\textrm{R}_{\textrm{cov}}) &> t_{\textrm{trace}}, \text { or} \\ \textrm{P}_s(i,i) - \textrm{R}_{\textrm{cov}}(i,i) &> t_{\textrm{elem}}(i) \text { for } i=1,2,\ldots , n. \end{aligned}$$

If one of the conditions applies, the track \(s(\hat{x}_s, \textrm{P}_s)\) is considered inaccurate and discarded.

The two advantages of the element wise approach are the higher flexibility and more detailed validation. The element wise approach allows a more specific filtering based on the requirements of the current system. Additionally, errors of single values can be detected.

To achieve an influence as high as possible an attacker would send forged data with a significant low CM. Contrary to the filtering of inaccurate estimations \(\textrm{R}_{\textrm{cov}}\) must not exceed the received information by more than \(t_{\textrm{trace}}\) or \(t_{\textrm{elem}}\).

Mathematically this can be described with the two following conditions:

$$\begin{aligned} {{\,\textrm{tr}\,}}(\textrm{R}_{\textrm{cov}})- {{\,\textrm{tr}\,}}(\textrm{P}_s) &> t_{\textrm{trace}}, \text { or} \\ \textrm{R}_{\textrm{cov}}(i,i)-\textrm{P}_s(i,i) &> t_{\textrm{elem}}(i) \text { for } i=1,2,\ldots , n. \end{aligned}$$

If one of the conditions is evaluated as true, the track must be considered as possibly forged and therefore will be discarded.

Results for Optimization of Cooperative Perception

First results showed a precision increase for the detection. Table 6 shows an extract of the precision results for a Vires VTD freeway scenario with 36 vehicles in total for different optimization strategies. With 16.7% (CR1) and 30.6% (CR2), two different equipment rates for cooperative vehicles are investigated. To test the robustness of the approach under realistic environmental conditions, a fog simulation with three different densities is incorporated. Baseline describes the regular CI fusion using all tracks. We can observe a precision of about 55–69%. For the track selection strategies based on the confidence (2TracksConf) and the covariance matrix (2TracksCov), we can observe that the precision drops for no or low fog (0.01 \({\upmu \textrm{m}^{-3}}\)). For medium (0.07 \({\upmu \textrm{m}^{-3}}\)) and dense (0.13 \({\upmu \textrm{m}^{-3}}\)) fog, the precision is similar to the original CI fusion. Using a confidence deviation threshold of 0.2 for the track filtering leads to a minor increase for a cloudy day. For the different fog densities no effect on the precision can be observed. Filtering tracks by trace with \(t_{\textrm{trace}}=4.0\) leads to slightly better results for the cloudy day and low fog with an increase of about 2.5 p.p. For medium and dense fog the precision could be increased significantly by up to 9.5 p.p. for CR1 at dense fog. The element wise track filtering strategy with threshold \(t_{\textrm{elem}}=\{1.5,0.8,0.2,2.0,1.0,0.3\}\) achieved the best precision scores. For the cloudy day as well as for all fog densities, there is a significant increase of precision. Using this strategy the precision can be increased by at least 16.8 p.p. and up to 28.4 p.p. for CR2 and a fog density of 0.07 \({\upmu \textrm{m}^{-3}}\).

Table 6 Overall precision [%] for a cloudy day, varying fog densities and different rates of cooperative vehicles. CR1 refer to 16.7% cooperative vehicles, CR2 refer to 30.6% cooperative vehicles. Adapted from [111]

Further details about the confidence and covariance based optimization of the covariance intersection fusion and more results can be found in Volk et al. [111].

5 Conclusion and Outlook

In this chapter we presented environment-aware approaches for robustness enhancement of local and cooperative perception. Vision-based object detection must be robust against harsh weather to ensure safety. To enhance existing data sets, which lack adverse weather scenarios, we presented physically correct image-based simulations for rain (including raindrops in the windshield and road spray of driving vehicles), snow and fog. With our proposed RESIST framework [67], a workflow to investigate different local and cooperative object perception setups exist. A wide range of possible input sources allows a comprehensive evaluation of the implemented algorithms. In addition, the realistic weather augmentations were used to study the effects of different weather conditions of varying intensity on vision-based object detection, showing a significant decrease in average accuracy as rain intensity increased. This leads to the statement that state-of-the-art neural networks are not robust against harsh weather. However, it was shown that training neural networks on data sets containing images with our proposed weather augmentations leads to an increase of the perception performance of up to 7.33 p.p. for the YOLOv3 network.

Additionally, we have shown that even with robust neural networks, the local perception is limited by different factors. To overcome these disadvantages, we considered vision-based cooperative perception. Gathering information with multiple vehicles allows perceiving objects outside the local sensor range or occluded in difficult inner-city scenarios. For different scenarios it was shown that cooperative perception can increase the mean average perception by about 18 p.p. compared to a local perception without any influence of adverse weather. Considering adverse weather it could be shown that a cooperative perception is possible to achieve a mean average precision of about 23% while the local perception was not able to detect any object. Hence, the cooperative perception increases safety.

Moreover, we have shown that state-of-the-art evaluation metrics for object perception do not necessarily satisfy the safety constraint. Hence, we considered additional factors such as velocity and object class for the evaluation of object perception systems to determine the safety with a comprehensive safety metric.

Besides all achievements some further research topics are still open.

Extend Weather Simulation to Further Sensors

The influence of weather circumstances on vision-based object detection was investigated and presented in detail. To achieve a safe system, autonomous vehicles must have a redundancy in sensors to balance the advantages and disadvantages of different sensor types. LiDAR is a promising technology for object detection because it is highly accurate and has a high sensor range. Since LiDAR sensors emit light waves, they are affected by weather as well. Rain, snow or fog can scatter the light waves such that false detections occur or the sensor range decreases. A first approach to weather simulations for LiDAR perception has been proposed by Teufel et al. [100]. Similar to the camera-based object detection, the optimization of LiDAR-based object detection has been investigated to increase robustness of LiDAR perception [99].

Safety Metric for Environment Perception

Object Perception is only one part of the perception for an autonomous vehicle. There are more subsystems such as lane detection, traffic sign recognition or motion planning. To achieve a safe autonomous vehicle all subsystems are required to be safe. Thus, the safety must be evaluated. Since lane detection and traffic sign recognition are part of the perception, the proposed safety metric can be extended to these tasks. For both tasks some requirements exist; e.g., a lane detection should at least cover the distance required for an emergency brake to enable safety.

Optimization of Cooperative Perception

An inaccurate local perception could lead to deviations of a state estimation of a cooperatively perceived object. As a worst case, the estimation error increases that much that the benefit of cooperative perception disappears. Thus, only valid and accurate information should be considered for fusion. Additionally, for cooperative perception, the communication channel should not be overloaded by transmitting erroneous information. Therefore, validation strategies at sender and receiver should be investigated to improve communication channel usage as well as fusion accuracy. Moreover, the concept of cooperative perception has been extended to lane detection by Gamerdinger et al. [28].