Abstract
In recent years, depth estimation has witnessed significant advancements because of the development of deep learning. It's important to note that depth estimation tasks focus solely on predicting the depth of each pixel in an image and do not include object detection or object recognition. Depth estimation is the use of pixel transformations in the image to obtain distance information from each point in the scene to the camera to generate a depth map. Object detection is the process of classifying and localizing an image, given a picture, so as to identify the objects in the picture and determine their location. To overcome this limitation and integrate object detection into the depth estimation process, this paper proposes a novel self-supervised monocular depth estimation algorithm that leverages an attention mechanism. By combining object detection and depth estimation, a real-time multi-task model is designed to enable simultaneous detection and depth estimation of objects. The framework comprises four essential components: an object detection sub-network, a depth estimation sub-network, a lateral sharing unit, and an attention loss. These components work collaboratively to enhance distance estimation accuracy for objects and improve the object detection performance. Throughout experiments, it is evident that the proposed approach can effectively estimate distances to objects and enhances the accuracy of object detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lertrusdachakul, I., Fougerolle, Y.D., Laligant. O.: Dynamic (de)focused projection for three-dimensional reconstruction. Optical Eng. 50(11): 113201–113201–11 (2011)
Sun, M.J., Edgar, M.P., Gibson, G.M., et al.: Single-pixel three-dimensional imaging with time-based depth resolution. Nat. Commun.Commun. 7(1), 12010 (2016)
Gonzalez-Romo, N.I., Hanalioglu, S., Mignucci-Jiménez, G., et al.: Anatomic depth estimation and three-dimensional reconstruction of microsurgical anatomy using monoscopic high-definition photogrammetry and machine learning. Operative Neurosur. 10, 1227 (2022)
Chen, P.Y., Liu, A.H., Liu, Y.C., et al.: Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2624–2632 (2019)
Ren, H., El-Khamy, M., Lee, J.: Deep robust single image depth estimation neural network using scene understanding. In: CVPR Workshops, vol. 2, p. 2 (2019)
Aguilar, W.G., Quisaguano, F.J., Rodríguez, G.A., Alvarez, L.G., Limaico, A., Sandoval, D.S.: Convolutional neuronal networks based monocular object detection and depth perception for micro UAVs. In: Peng, Y., Kai, Y., Jiwen, L., Jiang, X. (eds.) Intelligence Science and Big Data Engineering: 8th International Conference, IScIDE 2018, Lanzhou, China, 18–19 August 2018, Revised Selected Papers, pp. 401–410. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-030-02698-1_35
Miclea, V.C., Nedevschi, S.: Monocular depth estimation with improved long-range accuracy for UAV environment perception. IEEE Trans. Geosci. Remote Sens.Geosci. Remote Sens. 60, 1–15 (2021)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1 pp. I-I. IEEE (2001)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
He, K,, Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Wang C.Y., Liao, H.Y.M., Wu, Y.H., et al.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391 (2020)
Girshick, R., Donahue, J,, Darrell, T., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Girshick, R:.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Ren, S., He, K., Girshick, R., et al.: Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Lin, T.Y, Dollár, P., Girshick, R., et al.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Hev. K., Gkioxari, G., Dollár, P., et al:. Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)
Redmon, J,, Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804. 02767 (2018)
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection, vol. 2004, p. 10934 (2020)
Li, A., Sun, S., Zhang, Z., et al.: A multi-scale traffic object detection algorithm for road scenes based on improved YOLOv5. Electronics 12(4), 878 (2023)
Reading, C., Harakeh, A., Chae, J., et al.: Categorical depth distribution network for monocular 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8555–8564 (2021)
Khan, F., Salahuddin, S., Javidnia, H.: Deep learning-based monocular depth estimation methods—A state-of-the-art review. Sensors 20(8), 2272 (2020)
Bugby, S.L., Lees, J.E., McKnight, W.K., et al.: Stereoscopic portable hybrid gamma imaging for source depth estimation. Phys. Med. Biol. 66(4), 045031 (2021)
Praveen, S.: Efficient depth estimation using sparse stereo-vision with other perception techniques. Coding Theory 111 (2020)
Li, B., Shen, C., Dai, Y., et al.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1119–1127 (2015)
Qi, X., Liao, R., Liu. Z., et al.: Geonet: geometric neural network for joint depth and surface normal estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 283–291 (2018)
Sheng, F., Xue, F., Chang, Y., et al.: Monocular depth distribution alignment with low computation. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 6548–6555. IEEE (2022)
Garg, R., Bg, V.K., Carneiro, G., Unsupervised, C.N.N.: For single view depth estimation: Geometry to the rescue. In: Computer Vision–ECCV 2016, 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part VIII 14, pp. 740-756. Springer International Publishing (2016). https://doi.org/10.1007/978-3-319-46484-8_45
Zhou, T., Brown, M., Snavely, N., et al.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)
Tao, B., Chen, X., Tong, X., et al.: Self-supervised monocular depth estimation based on channel attention Photonics. MDPI 9(6), 434 (2022)
Kuznietsov, Y., Stuckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6647–6655 (2017)
Jiao, J., Cao, Y., Song, Y., Lau, R.: Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, pp. 55–71. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_4
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems 27 (2014)
Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5667–5675 (2018)
Zou, Y., Luo, Z., Huang, J.-B.: Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part V, pp. 38–55. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_3
Ranjan, A., Jampani, V., Balles, L., et al.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12240–12249 (2019)
Casser, V., Pirk, S., Mahjourian, R., et al.: Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In: Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33(01), pp. 8001–8008 (2019)
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279 (2017)
Godard, C., Mac Aodha, O., Firman, M., et al.: Digging into self-supervised monocular depth estimation. Ïn: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838 (2019)
Guizilini, V., Ambrus, R., Pillai, S., et al.: 3D packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2485–2494 (2020)
Johnston, A., Carneiro, G.: Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4756–4765 (2020)
Acknowledgments
This project is supported by the Higher education teaching reformation project of Hubei province of China (2022231, 2022216), and the graduate teaching reformation project of Wuhan University of Science and Technology (Yjg202202).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Liu, R., Chen, X., Tao, B. (2024). Object Detection with Depth Information in Road Scenes. In: Sun, F., Meng, Q., Fu, Z., Fang, B. (eds) Cognitive Systems and Information Processing. ICCSIP 2023. Communications in Computer and Information Science, vol 1919. Springer, Singapore. https://doi.org/10.1007/978-981-99-8021-5_15
Download citation
DOI: https://doi.org/10.1007/978-981-99-8021-5_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8020-8
Online ISBN: 978-981-99-8021-5
eBook Packages: Computer ScienceComputer Science (R0)