Abstract
Driver intention prediction allows drivers to perceive possible dangers in the fastest time and has become one of the most important research topics in the field of self-driving in recent years. In this study, we propose a driver intention prediction method based on multi-dimensional cross-modality information interaction. First, an efficient video recognition network is designed to extract channel-temporal features of in-side (driver) and out-side (road) videos, respectively, in which we design a cross-modality channel-spatial weight mechanism to achieve information interaction between the two feature extraction networks corresponding, respectively, to the two modalities, and we also introduce a contrastive learning module by which we force the two feature extraction networks to enhance structural knowledge interaction. Then, the obtained representations of in- and outside videos are fused using a ResLayer-based module to get a preliminary prediction which is then corrected by incorporating the GPS information to obtain a final decision. Besides, we employ a multi-task framework to train the entire network. We validate the proposed method on the public dataset Brain4Car, and the results show that the proposed method achieves competitive results in accuracy while balancing performance and computation.
Similar content being viewed by others
Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Bonyani, M.; Rahmanian, M.; Jahangard, S. Predicting Driver Intention Using Deep Neural Network 2021.
Rezaei, M.; Klette, R. Look at the Driver, Look at the Road: No Distraction! No Accident! In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Columbus, OH, USA, June 2014; pp. 129–136.
Gite, S., Agrawal, H., Kotecha, K.: Early anticipation of driver’s maneuver in semiautonomous vehicles using deep learning. Prog Artif Intell. 8, 293–305 (2019). https://doi.org/10.1007/s13748-019-00177-z
Yurtsever, E., Lambert, J., Carballo, A., Takeda, K.: A survey of autonomous driving: common practices and emerging technologies. IEEE Access. 8, 58443–58469 (2020). https://doi.org/10.1109/ACCESS.2020.2983149
Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. Pattern Anal. Mach. Intell. 38, 14–29 (2016). https://doi.org/10.1109/TPAMI.2015.2430335
Gite, S., Pradhan, B., Alamri, A., Kotecha, K.: ADMT: advanced driver’s movement tracking system using spatio-temporal interest points and maneuver anticipation using deep neural networks. IEEE Access. 9, 99312–99326 (2021). https://doi.org/10.1109/ACCESS.2021.3096032
Jain, A.; Koppula, H.S.; Soh, S.; Raghavan, B.; Singh, A.; Saxena, A. Brain4Cars: Car That Knows Before You Do via Sensory-Fusion Deep Learning Architecture 2016.
Zhou, D.; Ma, H.; Dong, Y. Driving Maneuvers Prediction Based on Cognition-Driven and Data-Driven Method. In Proceedings of the 2018 IEEE Visual Communications and Image Processing (VCIP); IEEE: Taichung, Taiwan, December 2018; pp. 1–4.
Tonutti, M., Ruffaldi, E., Cattaneo, A., Avizzano, C.A.: Robust and subject-independent driving manoeuvre anticipation through domain-adversarial recurrent neural networks. Robot. Auton. Syst. 115, 162–173 (2019). https://doi.org/10.1016/j.robot.2019.02.007
Rong, Y.; Akata, Z.; Kasneci, E. Driver Intention Anticipation Based on In-Cabin and Driving Scene Monitoring. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC); IEEE: Rhodes, Greece, September 20 2020; pp. 1–8.
Braunagel, C., Rosenstiel, W., Kasneci, E.: Ready for take-over? A new driver assistance system for an automated classification of driver take-over readiness. IEEE Intell. Transport. Syst. Mag. 9, 10–22 (2017). https://doi.org/10.1109/MITS.2017.2743165
Jang, Y.-M.; Mallipeddi, R.; Lee, M. Driver’s Lane-Change Intent Identification Based on Pupillary Variation. In Proceedings of the 2014 IEEE International Conference on Consumer Electronics (ICCE); IEEE: Las Vegas, NV, USA, January 2014; pp. 197–198.
Amsalu, S.B.; Homaifar, A. Driver Behavior Modeling near Intersections Using Hidden Markov Model Based on Genetic Algorithm. In Proceedings of the 2016 IEEE International Conference on Intelligent Transportation Engineering (ICITE); IEEE: Singapore, August 2016; pp. 193–200.
Zheng, Y., Hansen, J.H.L.: Lane-change detection from steering signal using spectral segmentation and learning-based classification. IEEE Trans. Intell. Veh. 2, 14–24 (2017). https://doi.org/10.1109/TIV.2017.2708600
Kim, I.-H., Bong, J.-H., Park, J., Park, S.: Prediction of driver’s intention of lane change by augmenting sensor information using machine learning techniques. Sensors. 17, 1350 (2017). https://doi.org/10.3390/s17061350
Chen, H., Chen, H., Liu, H., Feng, X.: Spatiotemporal Feature Enhancement Aids the Driving Intention Inference of Intelligent Vehicles. IJERPH 19, 11819 (2022). https://doi.org/10.3390/ijerph191811819
Gite, S.; Agrawal, H. Early Prediction of Driver’s Action Using Deep Neural Networks: International Journal of Information Retrieval Research 2019, 9, 11–27, doi:https://doi.org/10.4018/IJIRR.2019040102
Xing, Y.; Hu, Z.; Huang, Z.; Lv, C.; Cao, D.; Velenis, E. Multi-Scale Driver Behaviors Reasoning System for Intelligent Vehicles Based on a Joint Deep Learning Framework. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC); IEEE: Toronto, ON, Canada, October 11 2020; pp. 4410–4415.
Bonyani, M.; Rahmanian, M.; Jahangard, S.; Rezaei, M. DIPNet: Driver Intention Prediction for a Safe Takeover Transition in Autonomous Vehicles. IET Intelligent Trans Sys 2023, itr2.12370, https://doi.org/10.1049/itr2.12370.
Zhou, D., Liu, H., Ma, H., Wang, X., Zhang, X., Dong, Y.: Driving behavior prediction considering cognitive prior and driving context. IEEE Trans. Intell. Transport. Syst. 22, 2669–2678 (2021). https://doi.org/10.1109/TITS.2020.2973751
O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks 2015.
Fan, Y.; Lu, X.; Li, D.; Liu, Y. Video-Based Emotion Recognition Using CNN-RNN and C3D Hybrid Networks. In Proceedings of the Proceedings of the 18th ACM International Conference on Multimodal Interaction; Association for Computing Machinery: New York, NY, USA, October 31 2016; pp. 445–450.
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In Proceedings of the Computer Vision – ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, 2016; pp. 816–833.
Stroud, J.; Ross, D.; Sun, C.; Deng, J.; Sukthankar, R. D3D: Distilled 3D Networks for Video Action Recognition.; 2020; pp. 625–634.
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features With 3D Convolutional Networks.; 2015; pp. 4489–4497.
Lin, J.; Gan, C.; Han, S. TSM: Temporal Shift Module for Efficient Video Understanding.; 2019; pp. 7083–7093.
Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc., 2014; Vol. 27.
Zhu, Y.; Lan, Z.; Newsam, S.; Hauptmann, A. Hidden Two-Stream Convolutional Networks for Action Recognition. In Proceedings of the Computer Vision – ACCV 2018; Jawahar, C.V., Li, H., Mori, G., Schindler, K., Eds.; Springer International Publishing: Cham, 2019; pp. 363–378.
Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional Two-Stream Network Fusion for Video Action Recognition.; 2016; pp. 1933–1941.
Li, Y.; Ji, B.; Shi, X.; Zhang, J.; Kang, B.; Wang, L. TEA: Temporal Excitation and Aggregation for Action Recognition.; 2020; pp. 909–918.
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2011–2023 (2020). https://doi.org/10.1109/TPAMI.2019.2913372
Liang, Q., Xiang, S., Hu, Y., Coppola, G., Zhang, D., Sun, W.: PD2SE-net: computer-assisted plant disease diagnosis and severity estimation network. Comput. Electron. Agric. 157, 518–529 (2019). https://doi.org/10.1016/j.compag.2019.01.034
Liu, Y.; Ni, K.; Zhang, Y.; Zhou, L.; Zhao, K. Semantic Interleaving Global Channel Attention for Multilabel Remote Sensing Image Classification 2022.
T, R.; Valsalan, P.; J, A.; M, J.; S, R.; Latha G, C.P.; T, A. Hyperspectral Image Classification Model Using Squeeze and Excitation Network with Deep Learning. Comput Intell Neurosci 2022, 2022, 9430779, doi:https://doi.org/10.1155/2022/9430779.
Perez-Rua, J.-M.; Martinez, B.; Zhu, X.; Toisoul, A.; Escorcia, V.; Xiang, T. Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention 2020.
Wang, Z.; She, Q.; Smolic, A. ACTION-Net: Multipath Excitation for Action Recognition.; 2021; pp. 13214–13223.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition.; 2016; pp. 770–778.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal Segment Networks for Action Recognition in Videos. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2740–2755 (2019). https://doi.org/10.1109/TPAMI.2018.2868668
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks 2016.
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module 2018.
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning 2020.
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations 2020.
Wu, Z.; Xiong, Y.; Yu, S.; Lin, D. Unsupervised Feature Learning via Non-Parametric Instance-Level Discrimination 2018.
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.V.; Sung, Y.; Li, Z.; Duerig, T. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision 2021.
Bao, H.; Wang, W.; Dong, L.; Liu, Q.; Mohammed, O.K.; Aggarwal, K.; Som, S.; Wei, F. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts 2022.
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation 2014.
Zhang, Y., Cao, C., Cheng, J., Lu, H.: EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimedia 20, 1038–1050 (2018). https://doi.org/10.1109/TMM.2018.2808769
Materzynska, J.; Berger, G.; Bax, I.; Memisevic, R. The Jester Dataset: A Large-Scale Video Dataset of Human Gestures.; 2019; pp. 0–0.
Goyal, R.; Ebrahimi Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense.; 2017; pp. 5842–5850.
Jiang, B.; Wang, M.; Gan, W.; Wu, W.; Yan, J. STM: SpatioTemporal and Motion Encoding for Action Recognition 2019.
Zhang, C.; Zou, Y.; Chen, G.; Gan, L. PAN: Towards Fast Action Recognition via Learning Persistence of Appearance 2020.
Wang, F.; Su, Y.; Wang, R.; Sun, J.; Sun, F.; Li, H. Cross-Modal and Cross-Level Attention Interaction Network for Salient Object Detection. IEEE Trans. Artif. Intell. 2023, 1–15, https://doi.org/10.1109/TAI.2023.3333827.
Wang, R., Wang, F., Su, Y., Sun, J., Sun, F., Li, H.: Attention-guided multi-modality interaction network for RGB-D salient object detection. ACM Trans. Multimedia Comput. Commun. Appl. 20, 1–22 (2024). https://doi.org/10.1145/3624747
Wang, F., Wang, R., Sun, F.: DCMNet: Discriminant and Cross-Modality Network for RGB-D Salient Object Detection. Expert Syst. Appl. 214, 119047 (2023). https://doi.org/10.1016/j.eswa.2022.119047
Ye, T.; Jing, W.; Hu, C.; Huang, S.; Gao, L.; Li, F.; Wang, J.; Guo, K.; Xiao, W.; Mao, W.; et al. FusionAD: Multi-Modality Fusion for Prediction and Planning Tasks of Autonomous Driving 2023.
Author information
Authors and Affiliations
Contributions
Zengkui Xu, Shaohua Qiao and Jiannan Zheng wrote the main manuscript text, Dongliang Peng, Mengfan Xue and Tao Li revised and checked the innovative points of the article and Yuerong Wang designed a comparative learning module in the paper. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Communicated by H. Li.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xue, M., Xu, Z., Qiao, S. et al. Driver intention prediction based on multi-dimensional cross-modality information interaction. Multimedia Systems 30, 83 (2024). https://doi.org/10.1007/s00530-024-01282-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01282-3