Keywords

1 Introduction

Autonomous driving systems rely on advanced perception, decision-making, and control technologies, utilizing various sensors such as cameras, and LiDAR to perceive the surrounding environment. While 3D object detection provides accurate descriptions of objects in the environment, achieving optimal results often requires the fusion of multiple sensors [1, 4]. However, due to the low cost of monocular cameras, they have become widely used at this stage. This study specifically focuses on the detection of 3-dimensional objects from videos captured by monocular cameras.

Current methods for monocular-based 3D object detection rely on depth inference from single images, demanding extensive depth annotation for model training [2]. Challenges arise from domain differences in training data, including variations in lighting, camera parameters, and road topography. Overcoming these domain disparities is crucial for deploying models trained on existing domain knowledge directly to new domains without incurring additional costs. To address this issue, we propose a transfer learning solution incorporating data alignment, 3D object detection, and dynamic result correction. Validation on the Near-Miss Incident Database demonstrates enhanced detection accuracy and robustness while significantly reducing the cost of annotating new domain data. Integration with OBD data aids in reconstructing information about traffic participants in hazardous scenarios, facilitating a quantitative analysis of accident data.

Fig. 1.
figure 1

Over pipeline of our proposed method.

2 Methodology

The pipeline of our proposed method is illustrated in Fig. 1. This approach achieves robust 3D object detection on new data lacking annotated information. Specifically, vanishing point detection infers camera angles, and preprocessing takes into account the camera's pitch angle and vanishing point position for new data. Subsequently, deep learning algorithms perform depth estimation and 3D object detection in monocular videos.

2.1 Image Detection Technologies for Relative Position Estimation

In the field of autonomous driving and safe driving, recognizing and understanding the environment and targets around the vehicle is a crucial issue. The environment includes vehicles, pedestrians, bicycles, road lines, and traffic signals. Especially, predicting information such as the speed and distance of target vehicles is expected to prevent car accidents. Currently, various technologies are used to detect this information, such as radar and ultrasonic sensors. However, these sensors are expensive and require complex installation, posing challenges for widespread adoption. Therefore, using onboard cameras, which allow for low-cost and easy installation, has been attracting attention recently.

2.2 2D and 3D Bounding Box Detection

When detecting targets such as vehicles from images, 2D Bounding Box detection technology is often considered first. While 2D Bounding Box detection is a mature technology that can achieve high-precision detection results in many fields, it only provides the upper-left and lower-right corner coordinates of the target in the image, making it difficult to understand detailed information such as the exact distance, rotation angle, and size of the target. Thus, 3D Bounding Box detection is introduced, which maps the exact position of the target in the world coordinate system based on rotation direction information and predicts how the target will move next.

2.3 Depth Encoder for 3D Bounding Box Prediction

3D object detection using monocular cameras has long been considered a challenging task. Most existing methods follow conventional 2D object detectors and predict 3D attributes based on features around the object's center. However, using only local features cannot understand the entire 3D spatial structure of the scene and ignores the depth relationships between objects in the image. We modified MonoDETR [3], a monocular image recognition framework using a depth-induced transformer, to perform 3D object detection from monocular images. MonoDETR adds a Depth Encoder to the conventional Visual Encoder to estimate depth information, achieving end-to-end 3D object detection from 2D images.

2.4 Transfer Learning to Reduce Manual 3D Labeling

In machine learning, preparing training data is essential. The algorithm uses this data to learn the parameters needed for prediction. Thus, the quality of training data is crucial for the algorithm to make accurate predictions. Annotating 3D Bounding Boxes is a challenging task that requires accurately capturing the three-dimensional shape of objects, which is much more difficult than 2D images.

Using public databases increases the diversity of datasets and improves the model's generalization performance. The KITTI dataset, widely used in the field of autonomous driving, includes tasks such as vehicle detection, 3D object detection, object tracking. KITTI, with high-quality manual annotations, serves as an ideal dataset for developing autonomous driving technologies. However, for the 3D Bounding Box detection task, differences in the cameras used, such as variations in pitch angle, focal length, and assumed height, can lead to suboptimal performance with the same model.

Our approach involves aligning the KITTI and TUAT near-miss databases based on their vanishing points. Specifically, we calculate the vanishing points for both databases, align Composition of the frames accordingly, and then crop and scale the TUAT near-miss database videos to closely match the frames of the KITTI database. Our experiments demonstrate that this preprocessing significantly enhances the accuracy of the predicted 3D bounding boxes and the precision of the target's pitch angle.

Subsequently, the 3D object detection model trained on the KITTI database is applied to the TUAT database. We input the images from the TUAT near-miss database into the model for object detection. During this process, the annotated bounding boxes from the KITTI database are used to annotate the images in the TUAT near-miss database. This eliminates the need for manual creation of bounding boxes, thereby improving workflow efficiency.

2.5 Dynamic Correction of Target Trajectories by LSTM

The purpose of this study is to track the trajectories of road targets. However, fluctuations in the detection results over time can cause the targets to exhibit abnormal acceleration. This can negatively impact the decision-making behavior of autonomous vehicles. For example, consider a pedestrian walking normally, but due to fluctuations in the detection results, the predicted target appears to be accelerating across the road, forcing the autonomous vehicle to take emergency evasive actions. Therefore, it is crucial to address how to obtain reliable trajectories from image detection.

The simplest method is to filter out outliers, but in real-time processing tasks, simple filtering algorithms tend to maintain the status quo slowly and may miss genuine abrupt changes. To know the actual position and speed of the target vehicle, the ego vehicle's actual speed must be known. The TUAT near-miss database records the ego vehicle's speed information. In this study, we propose an algorithm for dynamically correcting tracking results. As shown in Fig. 1, by utilizing the previous moment's target detection results and the vehicle's speed, along with the current frame's detection results and current speed, we employ a Long Short-Term Memory (LSTM) network to construct a recurrent network that corrects the current results in real-time. This approach outputs the necessary corrections to the current detection results, improving detection accuracy and stabilizing the target trajectory over time.

Fig. 2.
figure 2

Experiment results. (a) The 3D bounding box detection result and traffic scene restoration. (b) Relative distance from object car. (c) The velocity of ego car and object car.

3 Experimental Results

The 3D bounding box detection results on the TUAT database are illustrated in Fig. 2(a), showcasing not only the image category and 2D box coordinates but also the object's rotation angle, size, and distance from the camera. To assess the model's effectiveness in estimating the relative motion states of vehicles within the TUAT Near-Miss Incident Database, we conduct both visual and quantitative tests. We use a specific example to elucidate the results. In a video depicting a right-turning vehicle at an intersection encountering a stationary black opposing vehicle that accelerates and enters the intersection, both vehicles engage emergency brakes. The trajectory of the detected relative position of the black vehicle is presented in Fig. 2(b), initiating at a distance of 54 m. As the video vehicle turns right, the X-direction relative distance decreases, and simultaneously, the Y-direction lateral distance also diminishes. Furthermore, the velocity changes of the estimated ego vehicle and the target vehicle are depicted in Fig. 2(c). The video vehicle decelerates in stages from the 6th second, while the target vehicle initiates acceleration from the 8th second. Both vehicles apply emergency brakes at the 10th second to avoid collision, validating the effectiveness of the proposed method in alignment with perceptual observations from video analysis.

4 Validation of Estimated Results

To quantitatively validate the estimated distances, we used Google Earth Pro to measure actual distances. By mapping the detected positions to the map and comparing them with the algorithm's estimated distances, we verified the accuracy of our method. For example, the Google Earth measurement showed a distance of 49.01 m, while our algorithm estimated 49.11 m, with an error of less than 0.1 m. Similar validation was performed for objects at closer distances, demonstrating the method's accuracy.

Fig. 3.
figure 3

The errors in estimated results varying with distance.

To further validate the effectiveness of the proposed method, we need to statistically verify the detection results across multiple vehicles. However, the TUAT database only contains the OBD data of the host vehicle and the video data from the onboard camera, lacking accurate positional information of surrounding vehicles in the road environment. How can we verify the accuracy of the predicted results? We target stationary vehicles in the road environment, using the predicted relative distance and the host vehicle's speed to estimate the speed of the road targets. By comparing the predicted results with a distance of zero, we can assess the validity of our method. Additionally, we examine the impact of errors based on the actual distance of the targets.

We selected 50 stationary vehicles distributed on both sides of the road. The experimental results are illustrated in the Fig. 3, showing the errors in estimated results varying with distance. The graph plots the error values (in meters) on the y-axis against the distance (in meters) on the x-axis. There are two sets of data presented: the “Original Error” (blue line with circular markers) and “Our Error” (orange line with circular markers).

From the graph, it is evident that the error values for the original method increase significantly as the distance increases, reaching up to approximately −5 m at 35 m distance. In contrast, the error values for our proposed method remain relatively stable around zero across all distances.

This comparison highlights the effectiveness of our method in maintaining low and consistent error values, demonstrating improved accuracy over the original method, especially as the distance increases. This stability is crucial for reliable target tracking in autonomous driving systems, ensuring more precise detection and reduced likelihood of erroneous behavior in decision-making processes.

5 Summary and Future Directions

This study introduces a refined transfer learning framework for robust 3D object detection in the Near-Miss Incident Database, which lacks annotated information. The proposed method effectively enhances detection accuracy and robustness while significantly reducing the cost of annotating new domain data. The integration with OBD data further facilitates the reconstruction of traffic participant information in hazardous scenarios, providing valuable insights for the development of Advanced Driver Assistance Systems (ADAS).

Future work will focus on extending the range of detectable objects and improving the accuracy of 3D detection and tracking algorithms. Enhancing the system's performance under various environmental conditions such as nighttime and adverse weather will also be a priority. Additionally, further research will explore the integration of other sensor modalities to complement monocular camera data, aiming to achieve more comprehensive and reliable perception capabilities for autonomous vehicles.