Automatic Vehicle Ego Body Extraction for Reducing False Detections in Automated Driving Applications

Hogan, Ciarán; Sistu, Ganesh

doi:10.1007/978-3-031-26438-2_21

Ciarán Hogan⁷ &
Ganesh Sistu⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1662))

Included in the following conference series:

Irish Conference on Artificial Intelligence and Cognitive Science

8176 Accesses
1 Citations

Abstract

Fisheye cameras are extensively employed in autonomous vehicles due to their wider field of view, which produces a complete 360-degree image of the vehicle with a minimum number of sensors. The drawback of having a broader field of view is that it may include undesirable portions of the vehicle’s ego body in its perspective. Due to objects’ reflections on the car body, this may produce false positives in perception systems. Processing ego vehicle pixels also uses up unnecessary computing power. Unexpectedly, there is no literature on this relevant practical problem. To our knowledge, this is the first attempt to discuss the significance of autonomous ego body extraction for automobile applications that are crucial for safety. We also proposed a simple deep learning model for identifying the vehicle’s ego-body. This model would enable us to eliminate any pointless processing of the car’s bodywork, eliminate the potential for pedestrians or other objects to be mistakenly detected in the car’s ego-body reflection, and finally, check to see if the camera is mounted incorrectly. The proposed network is a U-Net model with a Res-Net50 encoder pre-trained on ImageNet and trained for binary semantic segmentation on vehicle ego-body data. Our training data is an internal Valeo dataset with 10K samples collected by three separate car lines across Europe. This proposed network could then be integrated into the vehicles existing perception system by extracting the ego-body contour data and supplying this to the other algorithms which then ignore the area outside the contour coordinates. The proposed network can run at set intervals to save computing power and to check if the camera is misaligned by comparing the new contour data to the previous data.

You have full access to this open access chapter, Download conference paper PDF

A Modified YOLO Model for On-Road Vehicle Detection in Varying Weather Conditions

IRUVD: a new still-image based dataset for automatic vehicle detection

Article 17 June 2023

Review of Vehicle Detection Systems in Advanced Driver Assistant Systems

Article 12 March 2019

Keywords

1 Terminology

The following is a short explanation of technical terms that will be used throughout this paper:

Region of Interest (ROI): For the purpose of this paper ROI relates to the active scene of the cameras view, where the vehicles own body is not visible (i.e road, pedestrians, other vehicles etc). This is a common computer vision term used to describe the useful areas of an image to be fed to algorithms.

Semantic Segmentation: Semantic segmentation is the labelling or classifying of every pixel in a image.

Vehicle Ego-Body: This refers to a vehicle’s own body which is visible in wide field of view cameras like fisheye. (Derived from the Latin meaning of ego which is “I” ).

Mask: Mask is a term used to describe a binary image that defines where a particular object or region is in the image. Each mask represents a class.

Contour Values: Contour values refer to the coordinates of the boundary line between the Region of interest and the vehicle ego-body.

Convolutional Neural Network (CNN): Convolution neural network is a type of feed-forward neural network used in tasks like image analysis, natural language processing, and other complex image classification problems

Fisheye: Fisheye refers to wide field of view cameras that usually covers angles of about 180\(^\circ \).

Intersection over Union (IoU): IoU is an evaluation metric typically used for segmentation tasks. IoU is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth.

2 Introduction

In recent years the improvement of deep learning techniques like convolutional neural networks and recurrent neural networks have resulted in rapid growth in the area of autonomous driving. Deep learning models play a vital role in the operation of autonomous vehicles but they are not without their faults and limitations [1].

First of all, false detection of objects, road markings, curbs, and pedestrians in the reflection of the bodywork of the vehicle can cause serious problems in autonomous vehicles. This occurs when the vehicle ego-body acts as mirror like surface and the network then falsely detects the reflection of an object or a pedestrian on this surface. Examples of these false detections can be seen in Fig. 1.

This could lead to emergency braking and could result in the car being rear ended and the passengers seriously injured or, could potentially be fatal if this occurred at motorway speeds. We have already seen that too many reputable car manufacturers have had issues exactly like this recently [14, 15]. In 2019 one large reputable car manufacturer announced a National Highway Traffic Safety Administration or ‘NHTSA’ investigation and recall of one of its vehicles due to the automatic emergency braking engaging when there was no obstruction in the path of the vehicle [15]. Accidents and injuries have been reported by customers related to this issue which is definitely not desirable and could result in the loss of lives and cost a company billions.

Secondly, each vehicle has different camera positions and configurations. Each SVS (Surround View System) in vehicles have four different camera views as seen in Fig. 2 and in each camera view the vehicle ego body is in different locations. This means if finding the ROI (Region of Interest) by finding the positions of the ego-body manually it would have to be done four times and then would have to be done for every vehicle model in which the cameras are installed which would be tedious and leave room for human error. This would then have to be repeated for every vehicle manufacturer that utilises the cameras. It is also hard to pinpoint where exactly the camera is going to be positioned in its housing by the manufacturer/assembler and there could be some variability from vehicle to vehicle of the same model.

Over the lifetime of the camera different issues can arise. First, the camera may move in its housing which changes the ego-body position within the camera view. Secondly, cameras could even become fully misaligned meaning that the may camera have to be re-positioned and re-calibrated.

The main objective of this paper is to tackle the issues mentioned above and mainly the problem of false detections on the vehicle’s ego-body. In this paper we propose to solve these issues by detecting where the vehicle ego-body is in each image using semantic segmentation and using post processing we can extract the coordinates of the boundary between the ROI and the ego-body. With these known coordinates we can supply them to the main perception algorithms so they can focus on just the ROI.

3 Literature Review

Semantic Segmentation: Semantic segmentation plays a very important role for scene understanding in autonomous driving. Semantic segmentation involves classification of every pixel of an image into their relevant classes.

Yogamani et al. (2018) paper [2] carried out a comparative study of Real time semantic segmentation algorithms for autonomous driving. The study compared the performance of the combinations of different encoder and decoders. The encoders and decoders they trialled were SkipNet, MobileNet, ShuffleNet, UNet, ResNet18 and Dilation Frontend. The experiments were carried out using the Cityscapes dataset [12] and the mIoU scores for each of the relevant classes were recorded. One of the main takeaways from the experiment carried out in the paper was that the “UNet decoding method provides more accurate segmentation results” [2].

Fisheye: Currently there are very few studies which attempt to perform semantic segmentation directly on fisheye images using Deep Learning techniques and virtually no studies that could be found that use semantic segmentation for vehicle ego body detection/ROI extraction using raw fisheye images. This is mainly due to two reasons, firstly managing strong distortion in fisheye images and, secondly the lack of a large scale fisheye native dataset available [3]. In the past most studies based on fisheye datasets had to manually construct their own datasets by taking existing datasets and projecting the images and labels to fisheye format [3].

In 2019, Valeo released the Woodscape dataset which is the first extensive public automotive fisheye dataset including over 10,000 semantic segmented and annotated images for public usage [11], along with a paper [4]. In the paper the authors detail the distinct advantage of using fisheye cameras in automotive applications, because of their wide field of view they can get a full 360\(^\circ \) surround view of the vehicle with a minimal number of sensors.

In a paper by Deng et al. [5] they propose CNN based semantic segmentation for urban traffic scenes using fisheye images. They first constructed a fisheye dataset constructed from the well known Cityscapes dataset [12]. To handle the complex scene in the fisheye image, local, global and pyramid local region features are integrated by an overlapping pyramid pooling (OPP) module. They found that as the OPP module allows arbitrary-sized input it keeps good translation invariant property and shows better performance than sub-region pyramid pooling module. In this study they also implemented zoom augmentation in which they change the focus length of the fisheye generated image and this showed improvement for generalization of the system.

Mariotti and Eising (2021) in their paper [13] “Spherical formulation of geometric motion segmentation constraints in fisheye cameras” attempt to solve the problem of motion detection for fisheye cameras by reformulating the problem in spherical coordinates which can address both the non-linearity and the large field of view. To solve the problem of motion segmentation using fisheye cameras, four geometric constraints were unified, namely, epipolar, positive depth, positive height and anti-parallel for the detection of moving obstacles in the scene. The results presented, based on dense optical flow, show that the geometric approaches described are effective at detecting arbitrary moving objects. They concluded that the integration of the geometric constraints as described in this paper into a neural framework would yield optimal results.

Detecting Reflections: There are not many solutions available right now to automatically find reflections in an image. Problematic mirrors may typically be disregarded in an applied computer vision by manually drawing the ROI. Labelling the ROI requires human effort and relies on the camera and mirror maintaining a fixed perspective. There has been some work on the automatic detection of reflecting planes using geometric models of the image and its reflection [6, 7], but this was not explored in the context of segmentation and had a number of barriers to practical use.

In 2019 a paper [8] was published where the authors attempt to solve the problem of false positive detections due to reflections using segmentation. In the paper they propose the use of semantic segmentation for better scene understanding and in order to reduce instance segmentation false positives. They found that in their Mask R-CNN model the fusion of both segmentation types decreased false positives in images by over half and that this method was not just limited to actual mirrors but can be applied to other glossy surfaces also. They also found that using this method the precision increased from 71% to 83% and meanwhile the increase in false negatives was very small.

Literature Review Conclusion. The problem at hand seems to be novel and has not been discussed in any literature or articles. The false detection of pedestrians and objects in the ego-body reflection could be a serious problem which needs addressing. From examining the papers mentioned about segmentation of fisheye images we have the advantage of having a native fisheye dataset rather than having to generate fisheye images from normal images. Also research shows that semantic segmentation should be relatively easy using a robust architecture like UNET.

4 Implementation

Data Processing: The dataset consists of 13,184 images and masks in total, a 54:46 train/validation split was implemented. The data used to train the model consists of 7134 native fisheye internal Valeo images and ego-body masks and the data for testing consists of 6050 native fisheye internal Valeo images and masks. The data contains images and ego-body masks from all vehicle surround view cameras: front view, rear view, mirror view left and mirror view right. The segmentation masks in the dataset are in RGB format. They were converted to ‘One hot’ encoding where the ego-body mask was encoded 1 and the ROI was encoded 0 as we are performing binary semantic segmentation. The images and masks in dataset are a mix of three resolutions: 1280\(\,\times \,\)966 px, 1280\(\,\times \,\)1536 px and 1280\(\,\times \,\)1632 px. Transforms were then applied to each image and mask where they were resized to a resolution of 640\(\,\times \,\)480 px and also normalised. Data augmentation was also implemented on the training data: rotation, horizontal flip, vertical flip and blur were all employed in the implementation to help improve model performance.

Architecture: The proposed architecture is based on a UNET model with a ResNET50 encoder pre-weighted on ImageNet. UNET is a semantic segmentation architecture that was developed originally for biomedical image segmentation. UNET consists of two paths, contracting and expanding. The contracting path (encoder) is made up of convolutional and max pooling layers for down-sampling while the expanding path (decoder) is for precise localisation using transposed convolutions for up-sampling. Finally, the output of the network produces a binary encoded semantic segmentation map [9].

In the implementation we propose some slight changes to the architecture. The original UNET encoder was replaced by a ResNet50 encoder pre-weighted on ImageNet to improve model accuracy. Residual networks or ResNet is a Convolutional Neural Network (CNN) architecture, made up of a series of residual blocks (ResBlocks) with skip connections differentiating ResNets from other CNNs [10].

The overall purpose of the proposed network is to use semantic segmentation to extract the location of the vehicle ego body, the generated mask is then post processed in order to extract the contour values. This information will then be provided to the other perception algorithms.

Evaluation Metrics: Segmentation tasks require their own set of specific evaluation metrics as other metrics like pixel accuracy can give misleading information for segmentation tasks due to class imbalance.

Dice loss was chosen to measure the models loss. Dice loss is a loss function adapted from Dice Coefficient. Dice coefficient or F1 score, in simple terms is used to calculate the similarity between two images. The equation for the Dice coefficient D is shown in Fig. 4, where pi and gi stand for pairs of corresponding pixel values for the prediction and ground truth, respectively. In a boundary detection scenario, pi and gi values are either 0 or 1, indicating whether or not the pixel is a boundary. The Dice loss is then calculated by 1-(Dice coefficient).

IoU, as seen in Fig. 5, is the area of union between the predicted segmentation and the ground truth divided by the area of overlap between the predicted segmentation and the ground truth. This metric ranges from 0–1 (0–100%) with 0 representing no overlap and 1 representing perfectly overlapping segmentation.

5 Evaluation and Results

The pre-trained ResNet50 encoder and UNet decoder was run for 20 epochs on the pre-processed data with a batch size of 4. Model parameters were optimised using the Adam optimizer with a learning rate of 0.0001. As mentioned previously the dataset contains 13,184 images and masks in total, a 54:46 train/validation split was implemented. Figure 6 below shows the IoU score plot over the 20 epochs, showing that the model performs well in both the training and validation sets with high IoU scores between 0.9750 and 0.981. The model was run for a greater number of epochs but there was minimal increase in IoU and minimal decrease in dice loss for epochs greater than 20, so it was decided 20 epochs was adequate for this proof of concept project.

Figure 7 shows the dice loss plot over the 20 epochs as we can see loss drops quickly over the first few epochs and settles at the 0.01 mark. The high IoU score and low dice loss is possibly correlated to the low number of classes to segment and the large smooth boundary between the ROI and the ego-body which make it easier for the network to perform segmentation.

The network model with the best validation IoU over the 20 Epochs achieved a IoU score of 0.981 and a dice loss of 0.01. This is a excellent IoU score which means highly accurate segmentation masks are being output from the network.

Table 1. Best network model

Full size table

Figure 8 shows the models inference run on unseen data. From left to right is the original image, the ground truth mask, the predicted mask and then the predicted mask overlay on the original image. From comparing the predicted masks and the ground truths in Fig. 8 it can observed that they are very close in appearance and there are not many misclassified pixels to be seen in the predicted mask.

6 Conclusion and Future Work

A simple binary semantic segmentation model was proposed in the paper to recognise the location of the vehicle ego-body in fish-eye format. Finding this information can be very useful and could potentially solve problems like false detections on the vehicle ego-body which would then improve overall vehicle safety, camera misalignment detection and reduce the amount of manual work it would take to find the ROI. The proposed model performed sufficiently well and the predicted masks it produces are of a high quality.

Future work would be to integrate the proposed system into a vehicle’s main perception system. The system would be integrated like in Fig. 3 where the proposed network performs the semantic segmentation on the camera input, this is then post processed where the contours are extracted from the output and these contours are then passed on to the other perception algorithms which now have coordinates on the ROI that they should be focusing on. The proposed network could be run in set intervals or specific times to save computing power e.g. each time the car starts, when the car is shut off and in 2 min intervals while the car is on. Running the proposed network in short intervals and when the car is running, starts and shuts down serves the purpose of checking if the camera is misaligned. The system can store the previous contour values from the network and compare them with the new contour values and if they have changed over a certain threshold the system throws an error telling the driver that the camera is misaligned.

References

Kuutti, S., Bowden, R., Jin, Y., Barber, P., Fallah, S.: A survey of deep learning applications to autonomous vehicle control. IEEE Trans. Intell. Transp. Syst. 22(2), 712–733 (2021). https://doi.org/10.1109/TITS.2019.2962338
Article Google Scholar
Yogamani, S., Siam, M., Gamal, M., Abdel-Razek, M., Jagersand, M., Zhang, H.: A comparative study of real-time semantic segmentation for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 587–597 (2018). https://doi.org/10.1109/CVPRW.2018.00101
Saez, A., Bergasa, L.M., Romeral, E., Lopez, E., Barea, R., Sanz, R.: CNN-based fisheye image real-time semantic segmentation. In: IEEE Intelligent Vehicles Symposium (IV), 2018, pp. 1039–1044 (2018). https://doi.org/10.1109/IVS.2018.8500456
Yogamani, et al.: WoodScape: a multi-task, multi-camera fisheye dataset for autonomous driving. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9307–9317 (2019). https://doi.org/10.1109/ICCV.2019.00940
Deng, L., Yang, M., Qian, Y., Wang, C., Wang, B.: CNN based semantic segmentation for urban traffic scenes using fisheye camera. In: 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 231–236 (2017). https://doi.org/10.1109/IVS.2017.7995725
Bajcsy, R., Lee, S.W., Leonardis, A.: Detection of diffuse and specular interface reflections and inter-reflections by color image segmentation. Int. J. Comput. Vis. 17, 241–272 (1996). https://doi.org/10.1007/BF00128233
Article Google Scholar
DelPozo, A., Savarese, S.: Detecting specular surfaces on natural images. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007). https://doi.org/10.1109/CVPR.2007.383215
Owen, D., Chang, P.L.: Detecting reflections by combining semantic and instance segmentation. Umbo Comput. Vis. (2019). https://doi.org/10.48550/ARXIV.1904.13273
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28, https://doi.org/10.48550/arXiv.1505.04597
He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Microsoft Res. (2015). https://doi.org/10.48550/arXiv.1512.03385
Yogamani, S., et al.: WoodScape: a multi-task, multi-camera fisheye dataset for autonomous driving. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9307–9317 (2019). https://doi.org/10.1109/ICCV.2019.00940
Cordts, M.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016). https://doi.org/10.1109/CVPR.2016.350
Mariotti, L., Eising, C.: Spherical formulation of geometric motion segmentation constraints in fisheye cameras. IEEE Trans. Intell. Transp. Syst. 23(5), 4201–4211 (2022). https://doi.org/10.1109/TITS.2020.3042759
Article Google Scholar
Tesla Phantom breaking article. The Verge (2022). https://www.theverge.com/2022/6/3/23153241/tesla-phantom-braking-nhtsa-complaints-investigation. Accessed 20 Aug 2022
Nissan Emergency breaking failure article. CNET (2019). https://www.cnet.com/roadshow/news/nissan-rogue-nhtsa-brakes-investigation/. Accessed 20 Aug 2022

Download references

Author information

Authors and Affiliations

Department of Electronic and Computer Engineering, University of Limerick, Limerick, Ireland
Ciarán Hogan
Valeo Vision Systems, Tuam, Galway, Ireland
Ganesh Sistu

Authors

Ciarán Hogan
View author publications
You can also search for this author in PubMed Google Scholar
Ganesh Sistu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ciarán Hogan .

Editor information

Editors and Affiliations

Technological University Dublin, Dublin, Ireland
Luca Longo
Munster Technological University, Cork, Ireland
Ruairi O’Reilly

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hogan, C., Sistu, G. (2023). Automatic Vehicle Ego Body Extraction for Reducing False Detections in Automated Driving Applications. In: Longo, L., O’Reilly, R. (eds) Artificial Intelligence and Cognitive Science. AICS 2022. Communications in Computer and Information Science, vol 1662. Springer, Cham. https://doi.org/10.1007/978-3-031-26438-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-26438-2_21
Published: 23 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26437-5
Online ISBN: 978-3-031-26438-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Vehicle Ego Body Extraction for Reducing False Detections in Automated Driving Applications

Abstract

Similar content being viewed by others

A Modified YOLO Model for On-Road Vehicle Detection in Varying Weather Conditions

IRUVD: a new still-image based dataset for automatic vehicle detection

Review of Vehicle Detection Systems in Advanced Driver Assistant Systems

Keywords

1 Terminology

2 Introduction

3 Literature Review

4 Implementation

5 Evaluation and Results

6 Conclusion and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Automatic Vehicle Ego Body Extraction for Reducing False Detections in Automated Driving Applications

Abstract

Similar content being viewed by others

A Modified YOLO Model for On-Road Vehicle Detection in Varying Weather Conditions

IRUVD: a new still-image based dataset for automatic vehicle detection

Review of Vehicle Detection Systems in Advanced Driver Assistant Systems

Keywords

1 Terminology

2 Introduction

3 Literature Review

4 Implementation

5 Evaluation and Results

6 Conclusion and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation