Skip to main content
Log in

A Dynamic Head Gesture Recognition Method for Real-time Intention Inference and Its Application to Visual Human-robot Interaction

  • Regular Papers
  • Robot and Applications
  • Published:
International Journal of Control, Automation and Systems Aims and scope Submit manuscript

Abstract

Head gesture is a natural and non-verbal communication method for human-computer and human-robot interaction, conveying attitudes and intentions. However, the existing vision-based recognition methods cannot meet the precision and robustness of interaction requirements. Due to the limited computational resources, applying most high-accuracy methods to mobile and onboard devices is challenging. Moreover, the wearable device-based approach is inconvenient and expensive. To deal with these problems, an end-to-end two-stream fusion network named TSIR3D is proposed to identify head gestures from videos for analyzing human attitudes and intentions. Inspired by Inception and ResNet architecture, the width and depth of the network are increased to capture motion features sufficiently. Meanwhile, convolutional kernels are expanded from the spatial domain to the spatiotemporal domain for temporal feature extraction. The fusion position of the two-stream channel is explored under an accuracy/complexity trade-off to a certain extent. Furthermore, a dynamic head gesture dataset named DHG and a behavior tree are designed for human-robot interaction. Experimental results show that the proposed method has advantages in real-time performance on the remote server or the onboard computer. Furthermore, its accuracy on the DHG can surpass most state-of-the-art vision-based methods and is even better than most previous approaches based on head-mounted sensors. Finally, TSIR3D is applied on Pepper Robot equipped with Jetson TX2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. J. Zhao and R. S. Allison, “Real-time head gesture recognition on head-mounted displays using cascaded hidden markov models,” Proc. of the IEEE International Conf. on Systems, Man, and Cybernetics, pp. 2361–2366, 2017.

  2. X. Zhang, J. Wang, Y. Fang, and J. Yuan, “Multilevel humanlike motion planning for mobile robots in complex indoor environments,” IEEE Transactions on Automation Science and Engineering, vol. 16, no. 3, pp. 1244–1258, July 2019.

    Article  Google Scholar 

  3. M. Sharma, D. Ahmetovic, L. A. Jeni, and K. M. Kitani, “Recognizing visual signatures of spontaneous head gestures,” Proc. of the IEEE Winter Conf. on Applications of Computer Vision, pp. 400–408, 2018.

  4. P. Saikia and K. Das, “Head gesture recognition using optical flow based classification with reinforcement of gmm based background subtraction,” International Journal of Computer Applications, vol. 25, no. 65, pp. 5–11, March 2013.

    Google Scholar 

  5. S. S. Suni and K. Gopakumar, “A real time decision support system using head nod and shake,” Proc. of the International Conf. on Circuit, Power and Computing Technologies, pp. 1–5, 2016.

  6. J. Zhao and R. S. Allison, “Comparing head gesture hand gesture and gamepad interfaces for answering yes/no questions in virtual environments,” Virtual Reality, vol. 24, pp. 515–524, November 2020.

    Article  Google Scholar 

  7. U. Mavuş, and V. Sezer, “Head gesture recognition via dynamic time warping and threshold optimization,” Proc. of the IEEE Conf. on Cognitive and Computational Aspects of Situation Management, pp. 1–7, 2017.

  8. N. Rudigkeit, M. Gebhard, and A. Graser, “An analytical approach for head gesture recognition with motion sensors,” Proc. of the 9th International Conf. on Sensing Technology, pp. 1–6, 2015.

  9. A. Jackowski, M. Gebhard, and R. Thietje, “Head motion and head gesture-based robot control: A usability study,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 26, no. 1, pp. 161–170, January 2018.

    Article  Google Scholar 

  10. M. A. Haseeb, M. Kyrarini, S. Jiang, D. Ristic-Durrant, and A. Gräser, “Head gesture-based control for assistive robots,” Proc. of the 11th ACM International Conf. on PErvasive Technologies Related to Assistive Environments, pp. 379–383, 2018.

  11. A. Jackowski, M. Gebhard, and A. Gräser, “A novel head gesture based interface for hands-free control of a robot,” Proc. of IEEE International Symposium on Medical Measurements and Applications, pp. 1–6, 2016.

  12. M. Dobrea, D. Dobrea, and I. Severin, “A new wearable system for head gesture recognition designed to control an intelligent wheelchair,” Proc. of the E-Health and Bioengineering Conference, pp. 1–5, 2019.

  13. M. Dreißig, M. H. Baccour, T. Schäck, and E. Kasneci, “Driver drowsiness classification based on eye blink and head movement features using the k-nn algorithm,” Proc. of IEEE Symposium Series on Computational Intelligence, pp. 889–896, 2020.

  14. I. C. Severin, “Time series feature extraction for head gesture recognition: Considerations toward hci applications,” Proc. of the 24th International Conference on System Theory, Control and Computing, pp. 232–237, 2020.

  15. A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, “Action recognition in video sequences using deep bi-directional LSTM with CNN features,” IEEE Access, vol. 6, pp. 1155–1166, November 2018.

    Article  Google Scholar 

  16. Y. LeCun, Fu Jie Huang, and L. Bottou, “Learning methods for generic object recognition with invariance to pose and lighting,” Proc. of the IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 94–104, 2004.

    Google Scholar 

  17. S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, March 2013.

    Article  Google Scholar 

  18. M. Colledanchise and P. Ögren, “How behavior trees modularize hybrid control systems and generalize sequential behavior compositions, the subsumption architecture, and decision trees,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 372–389, December 2017.

    Article  Google Scholar 

  19. L. Liu, Y. Liu, and J. Zhang, “Learning-based hand motion capture and understanding in assembly process,” IEEE Transactions on Industrial Electronics, vol. 66, no. 12, pp. 9703–9712, December 2019.

    Article  Google Scholar 

  20. C. Wu, H. Yang, Y. Chen, B. Ensa, Y. Ren, and Y. Tseng, “Applying machine learning to head gesture recognition using wearables,” Proc. of the 8th IEEE International Conference on Awareness Science and Technology, pp. 436–440, 2017.

  21. S. Kawato and J. Ohya, “Real-time detection of nodding and head-shaking by directly detecting and tracking the “between-eyes”,” Proc. of the 4th IEEE International Conf. on Automatic Face and Gesture Recognition, pp. 40–45, 2000.

  22. P. Lu, M. Zhang, X. S. Zhu, and Y. S.Wang, “Head nod and shake recognition based on multi-view model and hidden markov model,” Proc. of the International Conf. on Computer Graphics, Imaging and Visualization, pp. 61–64, 2005.

  23. T. Hong, Y. W. Li, and Z. Y. Wang, “Real-time head action recognition based on hof and elm,” IEICE Transactions on Information and Systems, vol. 102, no. 1, pp. 206–209, January 2019.

    Article  Google Scholar 

  24. T. Numanoglu, E. Erzin, Y. Yemezy, and M. T. Sezginy, “Head nod detection in dyadic conversations,” Proc. of the 27th Signal Processing and Communications Applications Conference SIU 2019, pp. 1–4, 2019.

  25. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” Proc. of the IEEE International Conference on Computer Vision, pp. 4489–4497, 2015.

  26. J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733, 2017.

  27. J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4694–4702, 2015.

  28. G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” Proc. of the Scandinavian Conference on Image analysis, pp. 363–370, 2003.

  29. A. Ajit, K. Acharya, and A. Samanta, “A review of convolutional neural networks,” Proc. of the International Conference on Emerging Trends in Information Technology and Engineering, pp. 1–5, 2020.

  30. A. Elhassouny and F. Smarandache, “Trends in deep convolutional neural networks architectures: A review,” Proc. of the International Conference of Computer Science and Renewable Energies), pp. 1–8, 2019.

  31. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” Proc. of the 31st AAAI Conference on Artificial Intelligence, pp. 4278–4284, 2017.

  32. C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1933–1941, 2016.

  33. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, and A. Rabinovich, “Going deeper with convolutions,” Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.

  34. V.-M. Khong and T.-H. Tran, “Improving human action recognition with two-stream 3d convolutional neural network,” Proc. of the 1st International Conference on Multimedia Analysis and Pattern Recognition, pp. 1–6, 2018.

  35. K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” CRCV-TR-12-01, 2012.

  36. K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?” Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, vol. 11208, pp. 6546–6555, 20

    Google Scholar 

  37. A. Diba, M. Fayyaz, V. Sharma, M. M. Arzani, R. Yousefzadeh, J. Gall, and L. Van Gool, “Spatio-temporal channel correlation networks for action classification,” Proc. of the European Conf. on Computer Vision, pp. 284–299, 2018.

  38. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei, Fei, “Large-scale video classification with convolutional neural networks,” Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732, 2014.

  39. T. Suzuki, T. Itazuri, K. Hara, and H. Kataoka, “Learning spatiotemporal 3d convolution with video order self-supervision,” Proc. of the European Conference on Computer Vision Workshops, vol. 11130, pp. 590–598, 2018.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Botao Zhang.

Additional information

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported in part by the Key Research and Development Project of Zhejiang Province [No. 2019C04018]; the Fundamental Research Funds for the Provincial Universities of Zhejiang under grant [No. GK229909299001-004]; the National Natural Science Foundation of China [No. 62073108]; the Zhejiang Provincial Natural Science Foundation[No. LZ23F030004].

Jialong Xie received his B.Eng. degree from the School of Automation, Hangzhou Dianzi University, Hangzhou, China, in 2019. He is currently working toward an M.S. degree in control science and engineering from Hangzhou Dianzi University. His research interests include robot vision, robot control, and human-robot interaction.

Botao Zhang received his Ph.D. degree in control engineering from East China University of Science and Technology, Shanghai, China, in 2012. He is presently an Associate Professor at the School of Automation, Hangzhou Dianzi University, Hangzhou, China. His current research interests include machine vision, intelligent perception, and navigation of mobile robots.

Qiang Lu received his B.Eng. and Ph.D. degrees in electrical engineering from the East China University of Science and Technology, Shanghai, China, in 2000 and 2007, respectively, and a Ph.D. degree in computer science from Central Queensland University, Rockhampton, QLD, Australia, in 2013. In 2007, he joined Hangzhou Dianzi University, Hangzhou, China, where he is currently a Professor at the School of Automation. His research interests include cooperative control of multi-robot systems, swarm intelligence, and their applications to human security.

Oleg Borisov received his Ph.D. degree in system analysis, control and signal processing (in Technical Systems) from ITMO University, St. Petersburg, in 2017. He is presently an Associate Professor at the Faculty of Control Systems and Robotics, ITMO University, St. Petersburg, Russia. His current research interests include adaptive and robust control, geometric control, nonlinear systems, multivariable systems, robotic systems, and control applications.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, J., Zhang, B., Lu, Q. et al. A Dynamic Head Gesture Recognition Method for Real-time Intention Inference and Its Application to Visual Human-robot Interaction. Int. J. Control Autom. Syst. 22, 252–264 (2024). https://doi.org/10.1007/s12555-022-0051-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12555-022-0051-6

Keywords

Navigation