Multimodal Gesture Recognition Using Multi-stream Recurrent Neural Network

  • Noriki NishidaEmail author
  • Hideki Nakayama
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9431)


In this paper, we present a novel method for multimodal gesture recognition based on neural networks. Our multi-stream recurrent neural network (MRNN) is a completely data-driven model that can be trained from end to end without domain-specific hand engineering. The MRNN extends recurrent neural networks with Long Short-Term Memory cells (LSTM-RNNs) that facilitate the handling of variable-length gestures. We propose a recurrent approach for fusing multiple temporal modalities using multiple streams of LSTM-RNNs. In addition, we propose alternative fusion architectures and empirically evaluate the performance and robustness of these fusion strategies. Experimental results demonstrate that the proposed MRNN outperforms other state-of-the-art methods in the Sheffield Kinect Gesture (SKIG) dataset, and has significantly high robustness to noisy inputs.


Multimodal gesture recognition Recurrent neural networks Long short-term memory Convolutional neural networks 



This work was supported by JST CREST, JSPS KAKENHI Grant Number 26730085. We would like to thank the three anonymous reviewers for their valuable feedback on this work.


  1. 1.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of the NIPS (2012)Google Scholar
  2. 2.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the CVPR (2014)Google Scholar
  3. 3.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the ICCV (2015)Google Scholar
  4. 4.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 (2015)
  5. 5.
    Baraldi, L., Paci, F., Serra, G., Benini, L., Cucchiara, R.: Gesture recognition in ego-centric videos using dense trajectories and hand segmentation. In: Proceedings of the EVW (2014)Google Scholar
  6. 6.
    Darwish, S.M., Madbouly, M.M., Khorsheed, M.B.: Hand gesture recognition for sign language: a new higher order fuzzy HMM approach. Hand 1, 18565 (2016)Google Scholar
  7. 7.
    Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. Trans. PAMI 35(8), 1798–1828 (2013)CrossRefGoogle Scholar
  8. 8.
    Wu, J., Cheng, J., Zhao, C., Lu, H.: Fusing multi-modal features for gesture recognition. In: Proceedings of the ICMI (2013)Google Scholar
  9. 9.
    Ohn-Bar, E., Trivedi, M.M.: Hand gesture recognition in real time for automotive interfaces: a multimodal vision-based approach and evaluations. Trans. ITS 15(6), 2368–2377 (2014)Google Scholar
  10. 10.
    Liu, L., Shao, L.: Learning discriminative representations from RGB-D video data. In: Proceedings of the IJCAI (2013)Google Scholar
  11. 11.
    Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990)CrossRefGoogle Scholar
  12. 12.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  13. 13.
    Murakami, K., Taguchi, H.: Gesture recognition using recurrent neural networks. In: Proceedings of the SIGCHI (1991)Google Scholar
  14. 14.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the CVPR (2015)Google Scholar
  15. 15.
    Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the ICML (2011)Google Scholar
  16. 16.
    Gregor, K., Danihelka, I., Graves, A., and Wierstra, D.: DRAW: a recurrent neural network for image generation. arXiv:1502.04623 (2015)
  17. 17.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. Trans. PAMI 35(1), 221–231 (2013)CrossRefGoogle Scholar
  18. 18.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the CVPR (2014)Google Scholar
  19. 19.
    Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the CVPR (2014)Google Scholar
  20. 20.
    Molchanov, P., Gupta, S., Kim, K., and Kautz, J.: Hand gesture recognition with 3D convolutional neural networks. In: CVPR Workshop on Hand Gesture Recognition (2015)Google Scholar
  21. 21.
    Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. Trans. Neural Netw. 5(2), 157–166 (1994)CrossRefGoogle Scholar
  22. 22.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the NIPS (2014)Google Scholar
  23. 23.
    Cho, K., van Merrinboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: SSST-8 (2014)Google Scholar
  24. 24.
    Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the ICML (2014)Google Scholar
  25. 25.
    Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10), 1550–1560 (1990)CrossRefGoogle Scholar
  26. 26.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Cogn. Model. 5, 3 (1988)Google Scholar
  27. 27.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the NIPS (2014)Google Scholar
  28. 28.
    Farnebäck, G.: Two-Frame Motion Estimation Based on Polynomial Expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  29. 29.
    Lin, M., Chen, Q., Yan, S.: Network In network. In: Proceedings of the ICLR (2014)Google Scholar
  30. 30.
    Le, Q.V., Jaitly, N., and Hinton, G.E.: A simple way to initialize recurrent networks of rectified linear units. arXiv:1504.00941 (2015)
  31. 31.
  32. 32.
    Choi, H., Park, H.: A hierarchical structure for gesture recognition using RGB-D sensor. In: Proceedings of the HAI (2014)Google Scholar
  33. 33.
    Tung, P.T., Ngoc, L.Q.: Elliptical density shape model for hand gesture recognition. In: Proceedings of the ICTD (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Machine Perception Group, Graduate School of Information Science and TechnologyThe University of TokyoTokyoJapan

Personalised recommendations