Keywords

1 Introduction and Framework

In this work, the focus is set on depth estimation of binocular and monocular camera systems for autonomous driving in urban areas. For monocular systems with a single input image the issue persists that depth estimation leads to ambiguous results. To overcome this issue predictive models can be used to learn the relationship between images and depth [5]. In stereo vision, two rectified input images provide a remedy for the issue as epipolar geometry can be used to generate disparity maps. The disparity map can than be transferred into depth maps by the given geometric camera relation. Predictive models are capable of learning the stereo matching in an end-to-end fashion while increasing accuracy [6]. Two predictive models define key points of this work. Monodepth2 [5] represents the mono camera-based model whereas 2D-MobileStereoNet (2D-MSNet) [6] is the chosen model for stereo matching. Both models are characterized by low hardware requirements and are therefore suitable for vehicle-related application. The baseline of the analysis is a geometric approach which uses camera extrinsics and inverse projection to determine metric object depth [3]. For all depth estimation approaches we use a Python Framework, the original algorithm repositories and the Kitti tracking multi-object dataset [4] as image source. The 2D object detection information is taken from the dataset and used to locate and classify a filtered object class subset consisting of vehicles, pedestrians and bicycles.

2 Methodology

To infer metric object depth, any lens distortion must be removed from input images. Afterwards the image is processed by the stated depth estimation algorithms in Sect. 1. As an evaluation basis, we use trained models. For Monodepth2 the model “mono+stereo \(640\times 192\)” is selected and for 2D-MSNet the model “SF + DS + KITTI2015”. The resulting disparity maps from the deep-learning models are converted to metric depth maps by scaling known from the camera setup. The given ground truth object detection results are used in parallel to generate a list of reference points and bounding box properties per image. Using these object reference points as index for depth maps results in metric depth per object. Figure 1 visualizes exemplary inference output for both deep-learning based approaches and details the the different referent points chosen. The lower reference point is only used for the rule-based inverse perspective mapping (IPM) since the used algorithm projects the pixel coordinates into the camera reference frame. The assumption that each reference point is located on the road plane solves the ambiguity of the inverse projection. Since the rule-based distance estimation is affected by the vehicles dynamics, we use the given pitch and roll angles by the dataset to compensate vehicle movement [3]. After depth estimation for each of the 21 data splits, we filter truncated and occluded objects and calculate the absolute error based on ground truth LiDAR depth maps per chosen distance intervals.

The simulated braking distances represent an automatic emergency brake (AEB) in Euro NCAP Car-to-Pedestrian Nearside Adult (CPNA) [2] and Car-to-Car Rear stationary (CCRs) [1] test scenarios with a maximum velocity of 50 [km/h]. The simulation is based on a two-track model of the institute of Automotive Engineering which is statistically validated with real driving data. A Magic Formula tire model of Version 5.2 with parameters of current summer and winter tires and road friction coefficients of \( {\mu _{R}} = 0.3,0.6,1\) for snow, wet and dry road conditions are considered. The trigger distance is set to the braking distance to fulfill the collision unavoidable criteria [7]. Both the performance measures of the depth estimation and the results of the two-track model form the basis for the evaluation. Whereas current publications focus strongly on optimization, the outlined methodology closes the gap between the algorithmic evaluation on depth metrics and real applications like autonomous driving in urban scenarios. The results enable the definition of accuracy requirements and the impact assessment of estimation errors.

Fig. 1.
figure 1

Depth map samples for image number 150 of split 20 in the Kitti Tracking dataset [4]. Above the depth map of the Monodepth2 inference is shown, below the results of the 2D-MSNet. A ground truth bounding box is highlighted with both reference points (center: Monodepth2 and 2D-MSNet, center bottom: IPM). In both color variations, darker colors represent distance. Lighter colors emphasize proximity.

3 Evaluation

For the full training dataset and the object classes car, van, cyclist and pedestrian mean absolute error values of the metric depth are given per ground truth distance intervals of 10 m in Fig. 2. Exemplary objects evaluated with the IPM algorithm have a mean abs. error of 14.96 [m] for the interval which spans from 40 [m] to 50 [m]. The corresponding number of estimations (images \(\times \) objects) is detailed on the right. As the distance towards the objects increases, the number of objects decreases which reflects the extra urban and urban environments of the data set. In summary the 2D-MSNet shows the lowest error values across all data, as the image data from the left and right cameras are used to solve the correspondence problem. Both monocamera-based approaches have comparable error values up to 30 [m]. Whereas the error value of the IPM approach stagnates from a distance interval of 50 [m] at approx. 15 [m] under the conditions used, the error values of Monodepth2 continue to increase.

In regards to the application of autonomous driving, depth underestimation leads to an earlier braking event, overestimation means a late event. The former can cause self-inflicted rear-end collisions, the latter is safety-critical for front collisions. Considering the collision unavoidable criteria for autonomous emergency braking, full braking triggers when the metric object depth corresponds to the required braking distance [8]. Hence absolute error statistics must be added to braking distances and lead to a collision with a simulated collision velocity \({v_{Collision}}\). Figure 3 shows the simulated velocity profile over the braking distance \({s_{B}}\) for a road friction coefficient of \( {\mu _{R}} = 0.3\) and winter tires. The braking distance shifted by the 75% error qunatile \(s_{B}+\epsilon _{s_{B}, q_{0.75}}\) is detailed for the Monodepth2 algorithm. The quantile is calculated based on a narrower distance interval which spans over the simulated braking distance of 35.9 [m] with a tolerance \(\pm 1\) [m]. The added error leads to a simulated collision velocity \({v_{Collision}}\) of 23.55 [km/h]. Based on the ISO/DIS 26262-3 standard and severity levels of an automotive safety integrity level classification, this qualifies as a severity level S2 (\(<40\) [km/h], severe injuries). Furthermore, the given speed limits of the standard can be used to calculate the required error reduction \(\varDelta _{s_{B},S}\) for the next lower severity level depending on the tire type and the road friction coefficient. For the parameters of Fig. 3 the severity level can be reduced to S1 (\(v_{Severity}=20\) [km/h]) if the \(\epsilon _{s_{B},q_{0.75}}\) is reduced by \(\varDelta _{s_{B},S} = 2.22\) [m]. The simulated braking distance with applied error reduction \(s_{B} + \epsilon _{s_{B},q_{0.75}} - \varDelta _{s_{B},S}\) is visualized with a dashed line. Since the standard does not detail the collision velocity for a severity level S0, the level is only achievable if a crash is prevented. So the required error reduction equals the qunatile \(\epsilon _{s_{B},0.75}\) for S0.

Fig. 2.
figure 2

Mean absolute error per distance interval is shown on the left side. On the right number of estimation per distance interval is shown.

Fig. 3.
figure 3

Simulated velocity profile over braking distance \(s_{B}\) and braking distance shifted by error qunatile \(s_{B} \ + \ \epsilon _{s_{B},q_{0.75}}\) for winter tires and a road friction coefficient of \( \mu _{R} = 0.3\) (condition class: snow). The \(75\%\) quantile is based on the Monodepth2 algorithm. Collision velocity borders per severity level are indicated by horizontal lines.

Table 1 summarizes simulations results for all parameter combinations and algorithms as well as the requirements for error reduction. The lower error values of the 2D-MSNet over the entire data set are reflected in the resulting severity classes.

Table 1. Accuracy requirements formulated as \(\varDelta _{s_{B},S}\) for the resulting severity class for selected depth estimation algorithms. Lowest absolute error quantiles and delta values given in bold over all algorithms per tire type.

4 Conclusion and Future Work

The proposed evaluation method of depth estimation algorithms as a function of real applications shows based on low severity levels that depth estimation of the 2D-MSNet algorithm can already be used in real driving scenarios. The IPM results motivate further usage for the road condition classes “dry” and “wet”. Moreover accuracy requirements for achieving a lower severity level can be formulated based on collision velocity limits given by standards as shown in Table 1. These detail an application-based foundation for further algorithm optimization and self-trained neural networks. A decisive factor of the proposed methodology is the assumption that a braking event is triggered as soon as a collision can no longer be prevented. Future analysis will target the impact on comfort-oriented braking systems. Since the used training dataset does not incorporate the weather conditions of the simulated braking distance and chosen road friction coefficients in image data, future work will take into account the impact of weather conditions on the depth estimation and object detection.