Driver intention prediction based on multi-dimensional cross-modality information interaction

Xue, Mengfan; Xu, Zengkui; Qiao, Shaohua; Zheng, Jiannan; Li, Tao; Wang, Yuerong; Peng, Dongliang

doi:10.1007/s00530-024-01282-3

Driver intention prediction based on multi-dimensional cross-modality information interaction

Regular Paper
Published: 15 March 2024

Volume 30, article number 83, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Mengfan Xue¹^na1,
Zengkui Xu¹^na1,
Shaohua Qiao¹,
Jiannan Zheng¹,
Tao Li¹,
Yuerong Wang¹ &
…
Dongliang Peng¹

77 Accesses
Explore all metrics

Abstract

Driver intention prediction allows drivers to perceive possible dangers in the fastest time and has become one of the most important research topics in the field of self-driving in recent years. In this study, we propose a driver intention prediction method based on multi-dimensional cross-modality information interaction. First, an efficient video recognition network is designed to extract channel-temporal features of in-side (driver) and out-side (road) videos, respectively, in which we design a cross-modality channel-spatial weight mechanism to achieve information interaction between the two feature extraction networks corresponding, respectively, to the two modalities, and we also introduce a contrastive learning module by which we force the two feature extraction networks to enhance structural knowledge interaction. Then, the obtained representations of in- and outside videos are fused using a ResLayer-based module to get a preliminary prediction which is then corrected by incorporating the GPS information to obtain a final decision. Besides, we employ a multi-task framework to train the entire network. We validate the proposed method on the public dataset Brain4Car, and the results show that the proposed method achieves competitive results in accuracy while balancing performance and computation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Local and Global Contextual Features Fusion for Pedestrian Intention Prediction

Attention-based global context network for driving maneuvers prediction

Article 21 May 2022

Video-based driver action recognition via hybrid spatial–temporal deep learning framework

Article 22 January 2021

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Bonyani, M.; Rahmanian, M.; Jahangard, S. Predicting Driver Intention Using Deep Neural Network 2021.
Rezaei, M.; Klette, R. Look at the Driver, Look at the Road: No Distraction! No Accident! In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Columbus, OH, USA, June 2014; pp. 129–136.
Gite, S., Agrawal, H., Kotecha, K.: Early anticipation of driver’s maneuver in semiautonomous vehicles using deep learning. Prog Artif Intell. 8, 293–305 (2019). https://doi.org/10.1007/s13748-019-00177-z
Article Google Scholar
Yurtsever, E., Lambert, J., Carballo, A., Takeda, K.: A survey of autonomous driving: common practices and emerging technologies. IEEE Access. 8, 58443–58469 (2020). https://doi.org/10.1109/ACCESS.2020.2983149
Article Google Scholar
Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. Pattern Anal. Mach. Intell. 38, 14–29 (2016). https://doi.org/10.1109/TPAMI.2015.2430335
Article Google Scholar
Gite, S., Pradhan, B., Alamri, A., Kotecha, K.: ADMT: advanced driver’s movement tracking system using spatio-temporal interest points and maneuver anticipation using deep neural networks. IEEE Access. 9, 99312–99326 (2021). https://doi.org/10.1109/ACCESS.2021.3096032
Article Google Scholar
Jain, A.; Koppula, H.S.; Soh, S.; Raghavan, B.; Singh, A.; Saxena, A. Brain4Cars: Car That Knows Before You Do via Sensory-Fusion Deep Learning Architecture 2016.
Zhou, D.; Ma, H.; Dong, Y. Driving Maneuvers Prediction Based on Cognition-Driven and Data-Driven Method. In Proceedings of the 2018 IEEE Visual Communications and Image Processing (VCIP); IEEE: Taichung, Taiwan, December 2018; pp. 1–4.
Tonutti, M., Ruffaldi, E., Cattaneo, A., Avizzano, C.A.: Robust and subject-independent driving manoeuvre anticipation through domain-adversarial recurrent neural networks. Robot. Auton. Syst. 115, 162–173 (2019). https://doi.org/10.1016/j.robot.2019.02.007
Article Google Scholar
Rong, Y.; Akata, Z.; Kasneci, E. Driver Intention Anticipation Based on In-Cabin and Driving Scene Monitoring. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC); IEEE: Rhodes, Greece, September 20 2020; pp. 1–8.
Braunagel, C., Rosenstiel, W., Kasneci, E.: Ready for take-over? A new driver assistance system for an automated classification of driver take-over readiness. IEEE Intell. Transport. Syst. Mag. 9, 10–22 (2017). https://doi.org/10.1109/MITS.2017.2743165
Article Google Scholar
Jang, Y.-M.; Mallipeddi, R.; Lee, M. Driver’s Lane-Change Intent Identification Based on Pupillary Variation. In Proceedings of the 2014 IEEE International Conference on Consumer Electronics (ICCE); IEEE: Las Vegas, NV, USA, January 2014; pp. 197–198.
Amsalu, S.B.; Homaifar, A. Driver Behavior Modeling near Intersections Using Hidden Markov Model Based on Genetic Algorithm. In Proceedings of the 2016 IEEE International Conference on Intelligent Transportation Engineering (ICITE); IEEE: Singapore, August 2016; pp. 193–200.
Zheng, Y., Hansen, J.H.L.: Lane-change detection from steering signal using spectral segmentation and learning-based classification. IEEE Trans. Intell. Veh. 2, 14–24 (2017). https://doi.org/10.1109/TIV.2017.2708600
Article Google Scholar
Kim, I.-H., Bong, J.-H., Park, J., Park, S.: Prediction of driver’s intention of lane change by augmenting sensor information using machine learning techniques. Sensors. 17, 1350 (2017). https://doi.org/10.3390/s17061350
Article Google Scholar
Chen, H., Chen, H., Liu, H., Feng, X.: Spatiotemporal Feature Enhancement Aids the Driving Intention Inference of Intelligent Vehicles. IJERPH 19, 11819 (2022). https://doi.org/10.3390/ijerph191811819
Article Google Scholar
Gite, S.; Agrawal, H. Early Prediction of Driver’s Action Using Deep Neural Networks: International Journal of Information Retrieval Research 2019, 9, 11–27, doi:https://doi.org/10.4018/IJIRR.2019040102
Xing, Y.; Hu, Z.; Huang, Z.; Lv, C.; Cao, D.; Velenis, E. Multi-Scale Driver Behaviors Reasoning System for Intelligent Vehicles Based on a Joint Deep Learning Framework. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC); IEEE: Toronto, ON, Canada, October 11 2020; pp. 4410–4415.
Bonyani, M.; Rahmanian, M.; Jahangard, S.; Rezaei, M. DIPNet: Driver Intention Prediction for a Safe Takeover Transition in Autonomous Vehicles. IET Intelligent Trans Sys 2023, itr2.12370, https://doi.org/10.1049/itr2.12370.
Zhou, D., Liu, H., Ma, H., Wang, X., Zhang, X., Dong, Y.: Driving behavior prediction considering cognitive prior and driving context. IEEE Trans. Intell. Transport. Syst. 22, 2669–2678 (2021). https://doi.org/10.1109/TITS.2020.2973751
Article Google Scholar
O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks 2015.
Fan, Y.; Lu, X.; Li, D.; Liu, Y. Video-Based Emotion Recognition Using CNN-RNN and C3D Hybrid Networks. In Proceedings of the Proceedings of the 18th ACM International Conference on Multimodal Interaction; Association for Computing Machinery: New York, NY, USA, October 31 2016; pp. 445–450.
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In Proceedings of the Computer Vision – ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, 2016; pp. 816–833.
Stroud, J.; Ross, D.; Sun, C.; Deng, J.; Sukthankar, R. D3D: Distilled 3D Networks for Video Action Recognition.; 2020; pp. 625–634.
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features With 3D Convolutional Networks.; 2015; pp. 4489–4497.
Lin, J.; Gan, C.; Han, S. TSM: Temporal Shift Module for Efficient Video Understanding.; 2019; pp. 7083–7093.
Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc., 2014; Vol. 27.
Zhu, Y.; Lan, Z.; Newsam, S.; Hauptmann, A. Hidden Two-Stream Convolutional Networks for Action Recognition. In Proceedings of the Computer Vision – ACCV 2018; Jawahar, C.V., Li, H., Mori, G., Schindler, K., Eds.; Springer International Publishing: Cham, 2019; pp. 363–378.
Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional Two-Stream Network Fusion for Video Action Recognition.; 2016; pp. 1933–1941.
Li, Y.; Ji, B.; Shi, X.; Zhang, J.; Kang, B.; Wang, L. TEA: Temporal Excitation and Aggregation for Action Recognition.; 2020; pp. 909–918.
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2011–2023 (2020). https://doi.org/10.1109/TPAMI.2019.2913372
Article Google Scholar
Liang, Q., Xiang, S., Hu, Y., Coppola, G., Zhang, D., Sun, W.: PD2SE-net: computer-assisted plant disease diagnosis and severity estimation network. Comput. Electron. Agric. 157, 518–529 (2019). https://doi.org/10.1016/j.compag.2019.01.034
Article Google Scholar
Liu, Y.; Ni, K.; Zhang, Y.; Zhou, L.; Zhao, K. Semantic Interleaving Global Channel Attention for Multilabel Remote Sensing Image Classification 2022.
T, R.; Valsalan, P.; J, A.; M, J.; S, R.; Latha G, C.P.; T, A. Hyperspectral Image Classification Model Using Squeeze and Excitation Network with Deep Learning. Comput Intell Neurosci 2022, 2022, 9430779, doi:https://doi.org/10.1155/2022/9430779.
Perez-Rua, J.-M.; Martinez, B.; Zhu, X.; Toisoul, A.; Escorcia, V.; Xiang, T. Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention 2020.
Wang, Z.; She, Q.; Smolic, A. ACTION-Net: Multipath Excitation for Action Recognition.; 2021; pp. 13214–13223.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition.; 2016; pp. 770–778.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal Segment Networks for Action Recognition in Videos. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2740–2755 (2019). https://doi.org/10.1109/TPAMI.2018.2868668
Article Google Scholar
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks 2016.
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module 2018.
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning 2020.
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations 2020.
Wu, Z.; Xiong, Y.; Yu, S.; Lin, D. Unsupervised Feature Learning via Non-Parametric Instance-Level Discrimination 2018.
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.V.; Sung, Y.; Li, Z.; Duerig, T. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision 2021.
Bao, H.; Wang, W.; Dong, L.; Liu, Q.; Mohammed, O.K.; Aggarwal, K.; Som, S.; Wei, F. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts 2022.
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation 2014.
Zhang, Y., Cao, C., Cheng, J., Lu, H.: EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimedia 20, 1038–1050 (2018). https://doi.org/10.1109/TMM.2018.2808769
Article Google Scholar
Materzynska, J.; Berger, G.; Bax, I.; Memisevic, R. The Jester Dataset: A Large-Scale Video Dataset of Human Gestures.; 2019; pp. 0–0.
Goyal, R.; Ebrahimi Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense.; 2017; pp. 5842–5850.
Jiang, B.; Wang, M.; Gan, W.; Wu, W.; Yan, J. STM: SpatioTemporal and Motion Encoding for Action Recognition 2019.
Zhang, C.; Zou, Y.; Chen, G.; Gan, L. PAN: Towards Fast Action Recognition via Learning Persistence of Appearance 2020.
Wang, F.; Su, Y.; Wang, R.; Sun, J.; Sun, F.; Li, H. Cross-Modal and Cross-Level Attention Interaction Network for Salient Object Detection. IEEE Trans. Artif. Intell. 2023, 1–15, https://doi.org/10.1109/TAI.2023.3333827.
Wang, R., Wang, F., Su, Y., Sun, J., Sun, F., Li, H.: Attention-guided multi-modality interaction network for RGB-D salient object detection. ACM Trans. Multimedia Comput. Commun. Appl. 20, 1–22 (2024). https://doi.org/10.1145/3624747
Article Google Scholar
Wang, F., Wang, R., Sun, F.: DCMNet: Discriminant and Cross-Modality Network for RGB-D Salient Object Detection. Expert Syst. Appl. 214, 119047 (2023). https://doi.org/10.1016/j.eswa.2022.119047
Article Google Scholar
Ye, T.; Jing, W.; Hu, C.; Huang, S.; Gao, L.; Li, F.; Wang, J.; Guo, K.; Xiao, W.; Mao, W.; et al. FusionAD: Multi-Modality Fusion for Prediction and Planning Tasks of Autonomous Driving 2023.

Download references

Author information

Mengfan Xue and Zengkui Xu contributed equally to this work and should be considered co-first authors.

Authors and Affiliations

School of Automation, Hangzhou Dianzi University, Hangzhou, 310018, Zhejiang, China
Mengfan Xue, Zengkui Xu, Shaohua Qiao, Jiannan Zheng, Tao Li, Yuerong Wang & Dongliang Peng

Authors

Mengfan Xue
View author publications
You can also search for this author in PubMed Google Scholar
Zengkui Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shaohua Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Jiannan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Tao Li
View author publications
You can also search for this author in PubMed Google Scholar
Yuerong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dongliang Peng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Zengkui Xu, Shaohua Qiao and Jiannan Zheng wrote the main manuscript text, Dongliang Peng, Mengfan Xue and Tao Li revised and checked the innovative points of the article and Yuerong Wang designed a comparative learning module in the paper. All authors reviewed the manuscript.

Corresponding author

Correspondence to Dongliang Peng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by H. Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xue, M., Xu, Z., Qiao, S. et al. Driver intention prediction based on multi-dimensional cross-modality information interaction. Multimedia Systems 30, 83 (2024). https://doi.org/10.1007/s00530-024-01282-3

Download citation

Received: 16 May 2023
Accepted: 03 February 2024
Published: 15 March 2024
DOI: https://doi.org/10.1007/s00530-024-01282-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Driver intention prediction based on multi-dimensional cross-modality information interaction

Abstract

Access this article

Similar content being viewed by others

Local and Global Contextual Features Fusion for Pedestrian Intention Prediction

Attention-based global context network for driving maneuvers prediction

Video-based driver action recognition via hybrid spatial–temporal deep learning framework

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Driver intention prediction based on multi-dimensional cross-modality information interaction

Abstract

Access this article

Similar content being viewed by others

Local and Global Contextual Features Fusion for Pedestrian Intention Prediction

Attention-based global context network for driving maneuvers prediction

Video-based driver action recognition via hybrid spatial–temporal deep learning framework

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation