YOLOMM – You Only Look Once for Multi-modal Multi-tasking

Campos, Filipe; Cerqueira, Francisco Gonçalves; Cruz, Ricardo P. M.; Cardoso, Jaime S.

doi:10.1007/978-3-031-49018-7_40

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14469))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

436 Accesses

Abstract

Autonomous driving can reduce the number of road accidents due to human error and result in safer roads. One important part of the system is the perception unit, which provides information about the environment surrounding the car. Currently, most manufacturers are using not only RGB cameras, which are passive sensors that capture light already in the environment but also Lidar. This sensor actively emits laser pulses to a surface or object and measures reflection and time-of-flight. Previous work, YOLOP, already proposed a model for object detection and semantic segmentation, but only using RGB. This work extends it for Lidar and evaluates performance on KITTI, a public autonomous driving dataset. The implementation shows improved precision across all objects of different sizes. The implementation is entirely made available: https://github.com/filipepcampos/yolomm.

This work is supported by European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project n 047264; Funding Reference: POCI-01-0247-FEDER-047264].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/filipepcampos/yolomm.

References

Behley, J., et al.: Towards 3D LiDAR-based semantic scene understanding of 3D point cloud sequences: the SemanticKITTI dataset. Int. J. Robot. Res. 40(8–9), 959–967 (2021). https://doi.org/10.1177/02783649211006735
Article Google Scholar
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmentation (2019)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving (2020)
Google Scholar
Chan, C.Y.: Advancements, prospects, and impacts of automated driving systems. Int. J. Transp. Sci. Technol. 6(3), 208–216 (2017). https://doi.org/10.1016/j.ijtst.2017.07.008, https://www.sciencedirect.com/science/article/pii/S2046043017300035. safer Road Infrastructure and Operation Management
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding (2016)
Google Scholar
Deschaud, J.E.: KITTI-CARLA: a KITTI-like dataset generated by CARLA Simulator. arXiv e-prints: arXiv:2109.00892 (2021)
Detlefsen, N.S., et al.: TorchMetrics - measuring reproducibility in PyTorch. J. Open Sour. Softw. 7(70), 4101 (2022). https://doi.org/10.21105/joss.04101
Article Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015)
Google Scholar
Heuer, F., Mantowsky, S., Bukhari, S.S., Schneider, G.: MultiTask-CenterNet (MCN): efficient and diverse multitask learning using an anchor free approach (2021)
Google Scholar
Lee, D.G., Kim, Y.K.: Joint semantic understanding with a multilevel branch for driving perception. Appl. Sci. 12(6), 2877 (2022). https://doi.org/10.3390/app12062877
Article Google Scholar
Liao, Y., Xie, J., Geiger, A.: KITTI-360: a novel dataset and benchmarks for urban scene understanding in 2D and 3D. Pattern Anal. Mach. Intell. (PAMI) 45, 3292–310 (2022)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936–944 (2017). https://doi.org/10.1109/CVPR.2017.106
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation (2018)
Google Scholar
Milioto, A., Vizzo, I., Behley, J., Stachniss, C.: RangeNet++: fast and accurate LiDAR semantic segmentation. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2019)
Google Scholar
Paek, D.H., Kong, S.H., Wijaya, K.T.: K-lane: lidar lane dataset and benchmark for urban roads and highways. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on Autonomous Driving (WAD) (2022)
Google Scholar
Sheeny, M., De Pellegrin, E., Mukherjee, S., Ahrabian, A., Wang, S., Wallace, A.: RADIATE: a radar dataset for automotive perception. arXiv preprint: arXiv:2010.09076 (2020)
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Vu, D., Ngo, B., Phan, H.: HybridNets: end-to-end perception network (2022)
Google Scholar
Wu, D., et al.: YOLOP: you only look once for panoptic driving perception. Mach. Intell. Res. 19, 1–13 (2022)
Article Google Scholar
Yu, F., et al.: BDD100K: a diverse driving dataset for heterogeneous multitask learning (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Engineering, University of Porto, Porto, Portugal
Filipe Campos, Francisco Gonçalves Cerqueira, Ricardo P. M. Cruz & Jaime S. Cardoso
INESC TEC, Porto, Portugal
Ricardo P. M. Cruz & Jaime S. Cardoso

Authors

Filipe Campos
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Gonçalves Cerqueira
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo P. M. Cruz
View author publications
You can also search for this author in PubMed Google Scholar
Jaime S. Cardoso
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ricardo P. M. Cruz .

Editor information

Editors and Affiliations

Polytechnic Institute of Coimbra, Coimbra Institute of Engineering, Coimbra, Portugal
Verónica Vasconcelos
Polytechnic Institute of Coimbra, Coimbra Institute of Engineering, Coimbra, Portugal
Inês Domingues
Polytechnic Institute of Coimbra, Coimbra Institute of Engineering, Coimbra, Portugal
Simão Paredes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Campos, F., Cerqueira, F.G., Cruz, R.P.M., Cardoso, J.S. (2024). YOLOMM – You Only Look Once for Multi-modal Multi-tasking. In: Vasconcelos, V., Domingues, I., Paredes, S. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2023. Lecture Notes in Computer Science, vol 14469. Springer, Cham. https://doi.org/10.1007/978-3-031-49018-7_40

Download citation

DOI: https://doi.org/10.1007/978-3-031-49018-7_40
Published: 27 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49017-0
Online ISBN: 978-3-031-49018-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

YOLOMM – You Only Look Once for Multi-modal Multi-tasking