Abstract
Existing driver action recognition approaches suffer from a bottleneck problem which is the trade-off between recognition accuracy and computational efficiency. More specifically, the high-capacity spatial-temporal deep learning model is unable to realize real-time driver action recognition on vehicle-mounted device. To overcome such limitation, this paper puts forward a novel driver action recognition solution suitable for embedded systems. The proposed ESDAR-Net is a multi-branch deep learning framework and directly processes compressed videos. To reduce the computational cost, a lightweight 2D/3D convolutional network is employed for spatial-temporal modeling. Moreover, two strategies are implemented to boost the accuracy performance: (1) cross-layer connection module (CLCM) and spatial-temporal trilinear pooling module (STTPM) are designed to adaptively fuse appearance and motion information; (2) complementary knowledge from the high-capacity spatial-temporal deep learning model is distilled and transferred to the proposed ESDAR-Net. Experimental results show that the proposed ESDAR-Net satisfies both high-accuracy and real-time for driver action recognition. The accuracy on SEU-DAR-V1, SEU-DAR-V2 reaches 98.7%, 96.5%, with learnable parameters of 2.19M, FLOPs of 0.253G, and speed of 27 clips/s on JETSON TX2.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are not publicly available due to the problem of portraiture right.
References
Abouelnaga Y, Eraqi HM, Moustafa MN (2017) Real-time distracted driver posture classification. arXiv preprint arXiv:1706.09498
Ahmed ST, Basha SM, Ramachandran M, Daneshmand M, Gandomi AH (2023) An edge-ai enabled autonomous connected ambulance route resource recommendation protocol (aca-r3) for ehealth in smart cities. IEEE Internet of Things Journal
Ahmed M, Masood S, Ahmad M, Abd El-Latif AA (2021) Intelligent driver drowsiness detection for traffic safety based on multi cnn deep model and facial subsampling. IEEE Trans Intell Transp Syst 23(10):19 743--19 752
Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6836–6846
Basha SM, Ahmed ST, Iyengar NCSN, Caytiles RD (2021) Inter-locking dependency evaluation schema based on block-chain enabled federated transfer learning for autonomous vehicular systems. In: 2021 Second International Conference on Innovative Technology Convergence (CITC), pp 46–51. IEEE
Boujemaa KS, Berrada I, Fardousse K, Naggar O, Bourzeix F (2021) Toward road safety recommender systems: Formal concepts and technical basics. IEEE Trans Intell Transp Syst, pp 1–20. https://doi.org/10.1109/TITS.2021.3052771
Cao M, Zheng L, Jia W, Liu X (2021) Joint 3d reconstruction and object tracking for traffic video analysis under iov environment. IEEE Trans Intell Transp Syst 22(6):3577–3591. https://doi.org/10.1109/TITS.2020.2995768
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 6299–6308
Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: 2009 IEEE Conf Comput Vis Pattern Recognit, pp 1932–1939. IEEE
Chen LW, Chen HM (2021) Driver behavior monitoring and warning with dangerous driving detection based on the internet of vehicles. IEEE Trans Intell Transp Syst 22(11):7232–7241. https://doi.org/10.1109/TITS.2020.3004655
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4):834–848. https://doi.org/10.1109/TPAMI.2017.2699184
Chen G, Choi W, Yu X, Han T, Chandraker M (2017) Learning efficient object detection models with knowledge distillation. In: Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc
Chen J, Ho CM (2022) Mm-vit: Multi-modal video transformer for compressed video action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1910–1921
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 2625–2634
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: Keypoint triplets for object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 6568–6577. https://doi.org/10.1109/ICCV.2019.00667
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in Neural Information Processing Systems 29:3468–3476. Curran Associates, Inc. http://papers.nips.cc/paper/6433-spatiotemporal-residual-networks-for-video-action-recognition.pdf
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: 2017 IEEE Conf Comput Vis Pattern Recognit (CVPR), pp 7445–7454. https://doi.org/10.1109/CVPR.2017.787
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proc IEEE Conf Comput Vis Pattern Recognit (CVPR)
Feng Y, Sun X, Diao W, Li J, Gao X (2021) Double similarity distillation for semantic image segmentation. IEEE Trans Image Process 30:5363–5376. https://doi.org/10.1109/TIP.2021.3083113
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 770–778
Hinton G, Vinyals O, Dean J et al (2015) Distilling the knowledge in a neural network. arXiv preprint 2(7). arXiv:1503.02531
Hoang Ngan Le T, Zheng Y, Zhu C, Luu K, Savvides M (2016) Multiple scale faster-rcnn approach to driver’s cell-phone usage and hands on steering wheel detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 46–53
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint hyperimagehttp://arxiv.org/abs/1704.04861arXiv:1704.04861
Hu Y, Lu M, Lu X (2018) Driving behaviour recognition from still images by using multi-stream fusion cnn. Mach Vis Appl. https://doi.org/10.1007/s00138-018-0994-z
Hu Y, Lu M, Lu X (2020) Feature refinement for image-based driver action recognition via multi-scale attention convolutional neural network. Signal Process Image Commun 81(115):697. https://doi.org/10.1016/j.image.2019.115697 . http://www.sciencedirect.com/science/article/pii/S0923 596519300980
Hu Y, Lu M, Xie C, Lu X (2021) Video-based driver action recognition via hybrid spatial-temporal deep learning framework. Multimedia Systems 27(3):483–501
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 4700–4708
Hu Y, Lu M, Lu X (2018) Spatial-temporal fusion convolutional neural network for simulated driving behavior recognition. In: 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV), pp 1271–1277. https://doi.org/10.1109/ICARCV.2018.8581201
Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360
Joe Yue-Hei Ng, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: 2015 IEEE Conf Comput Vis Pattern Recognit (CVPR), pp 4694–4702
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proc IEEE Conf Comput Vis Pattern Recognit (CVPR)
Koesdwiady A, Bedawi SM, Ou C, Karray F (2017) End-to-end deep learning for driver distraction recognition. In: Karray F, Campilho A, Cheriet F (eds) Image Analysis and Recognition. Springer International Publishing, Cham, pp 11–18
Kopuklu O, Kose N, Gunduz A, Rigoll G (2019) Resource efficient 3d convolutional neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp 0–0
Korbar B, Tran D, Torresani L (2019) Scsampler: Sampling salient clips from video for efficient action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in Neural Information Processing Systems, vol 25. Curran Associates Inc
Kuehne H, Jhuang H, Stiefelhagen R, Serre T (2013) Hmdb51: A large video database for human motion recognition. In: Nagel WE, Kröner DH, Resch MM (eds) High Performance Computing in Science and Engineering 12:571–582. Springer, Berlin Heidelberg, Berlin, Heidelberg
Liu H, Liu W, Chi Z, Wang Y, Yu Y, Chen J, Jin T (2022) Fast human pose estimation in compressed videos. IEEE Transactions on Multimedia, pp 1–1. https://doi.org/10.1109/TMM.2022.3141888
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conf Comput Vis Pattern Recognit (CVPR), pp 3431–3440. https://doi.org/10.1109/CVPR.2015.7298965
Lu M, Hu Y, Lu X (2019) Dilated light-head r-cnn using tri-center loss for driving behavior recognition. Image Vis Comput 90(103):800
Lu M, Hu Y, Lu X (2020) Driver action recognition using deformable and dilated faster r-cnn with optimized region proposals. Appl Intell 50(4):1100–1111
Maji S, Bourdev L, Malik J (2011) Action recognition from a distributed representation of pose and appearance. In: CVPR 2011, pp 3177–3184. IEEE
Masood S, Rai A, Aggarwal A, Doja MN, Ahmad M (2020) Detecting distraction of drivers using convolutional neural network. Pattern Recogn Lett 139:79–85
Ma N, Zhang X, Zheng HT, Sun J (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp 116–131
Mehta S, Rastegari M (2022) Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In: International Conference on Learning Representations. https://openreview.net/forum?id=vh-0sUt8HlG
Moslemi N, Azmi R, Soryani M (2019) Driver distraction recognition using 3d convolutional neural networks. In: 2019 4th International Conference on Pattern Recognition and Image Analysis (IPRIA), pp 145–151. IEEE
National Bureau of Statistics (2021) Traffic accident report. https://data.stats.gov.cn
Peng Y, Zhao Y, Zhang J (2019) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology 29(3):773–786. https://doi.org/10.1109/TCSVT.2018.2808685
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 779–788
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 4510–4520
Shou Z, Lin X, Kalantidis Y, Sevilla-Lara L, Rohrbach M, Chang SF, Yan Z (2019) Dmc-net: Generating discriminative motion cues for fast compressed video action recognition. In: Proc IEEE/CVF Conf Comput Vis Pattern Recognit, pp 1268–1277
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ (eds) Advances in Neural Information Processing Systems 27. Curran Associates Inc
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402. http://arxiv.org/abs/1212.0402
Tomar S (2006) Converting video formats with ffmpeg. Linux journal 2006(146):10
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 6450–6459
Wang P, Cao Y, Shen C, Liu L, Shen HT (2017) Temporal pyramid pooling-based convolutional neural network for action recognition. IEEE Transactions on Circuits and Systems for Video Technology 27(12):2613–2622. https://doi.org/10.1109/TCSVT.2016.2576761
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision, Springer, pp 20–36
Wu CY, Zaheer M, Hu H, Manmatha R, Smola AJ, Krähenbühl P (2018) Compressed video action recognition. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 6026–6035
Yan C, Coenen F, Zhang BL (2014) Driving posture recognition by joint application of motion history image and pyramid histogram of oriented gradients. In: Advances in Mechatronics, Automation and Applied Information Technologies, Advanced Materials Research 846:1102–1105. Trans Tech Publications. https://doi.org/10.4028/www.scientific.net/AMR.846-847.1102
Yang J, Liu J, Han R, Wu J (2021) Generating and restoring private face images for internet of vehicles based on semantic features and adversarial examples. IEEE Trans Intell Transp Syst, pp 1–11. https://doi.org/10.1109/TITS.2021.3102266
Yan C, Zhang B, Coenen F (2015) Driving posture recognition by convolutional neural networks. In: 2015 11th International Conference on Natural Computation (ICNC), pp 680–685. https://doi.org/10.1109/ICNC.2015.7378072
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1821–1830
Zhang C, Li R, Kim W, Yoon D, Patras P (2020) Driver behavior recognition via interwoven deep convolutional neural nets with multi-stream inputs. Ieee Access 8:191,138--191,151
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 2718–2726
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 6848–6856
Zhao C, Gao Y, He J, Lian J (2012) Recognition of driving postures by multiwavelet transform and multilayer perceptron classifier. Eng Appl Artif Intell 25(8):1677–1686. https://doi.org/10.1016/j.engappai.2012.09.018 . http://www.sciencedirect.com/science/article/pii/S0952 197612002564
Zhao CH, Zhang BL, He J, Lian J (2012) Recognition of driving postures by contourlet transform and random forests. IET Intell Transp Syst 6(2):161–168. https://doi.org/10.1049/iet-its.2011.0116
Zhao CH, Zhang BL, Zhang XZ, Zhao SQ, Li HX (2013) Recognition of driving postures by combined features and random subspace ensemble of multilayer perceptron classifiers. Neural Comput & Applic 22(1):175–184. https://doi.org/10.1007/s00521-012-1057-4
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: 2017 IEEE Conf Comput Vis Pattern Recognit (CVPR), pp 6230–6239.https://doi.org/10.1109/CVPR.2017.660
Zhao C, Zhang B, Lian J, He J, Lin T, Zhang X (2011) Classification of driving postures by support vector machines. In: 2011 Sixth International Conference on Image and Graphics, pp 926–930. https://doi.org/10.1109/ICIG.2011.184
Acknowledgements
The authors would like to thank the editor and the anonymous reviewers for their valuable comments and constructive suggestions. This work was supported in part by the National Natural Science Foundation of China (No. 62203012, No. 61871123 and No. 61901221), the Open Research Fund of AnHui Key Laboratory of Detection Technology and Energy Saving Devices (No. JCKJ2022A07), Anhui Polytechnic University of Technology Introduced Talent Research Startup Fund (No. 2022YQQ009) and the Youth Foundation of Anhui Polytechnic University (No. Xjky2022039).
Funding
The authors would like to thank the editor and the anonymous reviewers for their valuable comments and constructive suggestions. This work was supported in part by the National Natural Science Foundation of China (No. 62203012, No. 61871123 and No. 61901221), the Open Research Fund of AnHui Key Laboratory of Detection Technology and Energy Saving Devices (No. JCKJ2022A07), Anhui Polytechnic University of Technology Introduced Talent Research Startup Fund (No. 2022YQQ009) , the Youth Foundation of Anhui Polytechnic University (No. Xjky2022039), Anhui Province Higher Education Quality Engineering Project (No. 2022jyxm139 and No. 2022kcsz027), Anhui University Collaborative Innovation Project (No. GXXT-2020-0069) and Anhui Natural Science Foundation Project (2108085MF220).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of Interest
There is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, Y., Shuai, Z., Yang, H. et al. ESDAR-net: towards high-accuracy and real-time driver action recognition for embedded systems. Multimed Tools Appl 83, 18281–18307 (2024). https://doi.org/10.1007/s11042-023-15777-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15777-0