Deep Dynamic Neural Networks for Gesture Segmentation and Recognition

  • Di WuEmail author
  • Ling Shao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8925)


The purpose of this paper is to describe a novel method called Deep Dynamic Neural Networks(DDNN) for the Track 3 of the Chalearn Looking at People 2014 challenge [1]. A generalised semi-supervised hierarchical dynamic framework is proposed for simultaneous gesture segmentation and recognition taking both skeleton and depth images as input modules. First, Deep Belief Networks(DBN) and 3D Convolutional Neural Networks (3DCNN) are adopted for skeletal and depth data accordingly to extract high level spatio-temporal features. Then the learned representations are used for estimating emission probabilities of the Hidden Markov Models to infer an action sequence. The framework can be easily extended by including an ergodic state to segment and recognise video sequences by a frame-to-frame mechanism, rendering it possible for online segmentation and recognition for diverse input modules. Some normalisation details pertaining to preprocessing raw features are also discussed. This purely data-driven approach achieves 0.8162 score in this gesture spotting challenge. The performance is on par with a variety of the state-of-the-art hand-tuned-feature approaches and other learning-based methods, opening the doors for using deep learning techniques to explore time series multimodal data.


Deep Belief Networks 3D Convolutional Neural Networks Gesture recognition ChaLearn 


  1. 1.
    Escalera, S., Bar, X., Gonzlez, J., Bautista, M., Madadi, M., Reyes, M., Ponce, V., Escalante, H., Shotton, J., Guyon, I.: Chalearn looking at people challenge 2014: dataset and results. In: European Conference on Computer Vision workshop (2014)Google Scholar
  2. 2.
    Liu, L., Shao, L., Zheng, F., Li, X.: Realistic action recognition via sparsely-constructed gaussian processes. Pattern Recognition (2014). doi: 10.1016/j.patcog.2014.07.006 Google Scholar
  3. 3.
    Shao, L., Zhen, X., Li, X.: Spatio-temporal laplacian pyramid coding for action recognition. IEEE Transactions on Cybernetics 44(6), 817–827 (2014)CrossRefGoogle Scholar
  4. 4.
    Wu, D., Shao, L.: Silhouette analysis-based action recognition via exploiting human poses. IEEE Transactions on Circuits and Systems for Video Technology 23(2), 236–243 (2013)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Laptev, I.: On space-time interest points. International Journal of Computer Vision (2005)Google Scholar
  6. 6.
    Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance. IEEE (2005)Google Scholar
  7. 7.
    Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  8. 8.
    Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: International Conference on Multimedia. ACM (2007)Google Scholar
  9. 9.
    Klaser, A., Marszalek, M., Schmid, C.: A Spatio-temporal descriptor based on 3D-gradients. In: British Machine Vision Conference (2008)Google Scholar
  10. 10.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision (2013)Google Scholar
  11. 11.
    Zhou, T., Tao, D.: Double shrinking sparse dimension reduction. IEEE Transactions on Image Processing 22(1), 244–257 (2013)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Xu, C., Tao, D.: Large-margin multi-view information bottleneck. IEEE Trans. Pattern Anal. Mach. Intell. 36(8), 1559–1572 (2014)CrossRefGoogle Scholar
  13. 13.
    Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C., et al.: Evaluation of local spatio-temporal features for action recognition. In: British Machine Vision Conference (2009)Google Scholar
  14. 14.
    Yuan, J., Bae, E., Tai, X.-C., Boykov, Y.: A continuous max-flow approach to potts model. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol. 6316, pp. 379–392. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  15. 15.
    Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (2011)Google Scholar
  16. 16.
    Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Spatio-temporal convolutional sparse auto-encoder for sequence classification. In: British Machine Vision Conference (2012)Google Scholar
  17. 17.
    Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Computation (2006)Google Scholar
  18. 18.
    Schmidhuber, J.: Deep learning in neural networks: An overview (2014). arXiv preprint arXiv:1404.7828
  19. 19.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems (2012)Google Scholar
  20. 20.
    Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)Google Scholar
  21. 21.
    Shuiwang Ji, Wei Xu, M.Y., Yu, K.: 3d convolutional neural networks for human action recognition. In: International Conference on Machine Learning. IEEE (2010)Google Scholar
  22. 22.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2013)Google Scholar
  23. 23.
    Mohamed, A., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Transactions on Speech, and Language Processing, Audio (2012)Google Scholar
  24. 24.
    Wu, D., Shao, L.: Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)Google Scholar
  25. 25.
    Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: IEEE Conference on Computer Vision and Pattern Recognition (2011)Google Scholar
  26. 26.
    Han, J., Shao, L., Shotton, J.: Enhanced computer vision with microsoft kinect sensor: a review. IEEE Transactions on Cybernetics 43(5), 1317–1333 (2013)Google Scholar
  27. 27.
    Escalera, S., Gonzlez, J., Bar, X., Reyes, M., Lops, O., Guyon, I., Athitsos, V., Escalante, H.J.: Multi-modal gesture recognition challenge 2013: dataset and results. In: ACM ChaLearn Multi-Modal Gesture Recognition Grand Challenge and Workshop (2013)Google Scholar
  28. 28.
    Fothergill, S., Mentis, H.M., Kohli, P., Nowozin, S.: Instructing people for training gestural interactive systems. In: ACM Computer Human Interaction (2012)Google Scholar
  29. 29.
    Guyon, I., Athitsos, V., Jangyodsuk, P., Hamner, B., Escalante, H.J.: Chalearn gesture challenge: design and first results. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (2012)Google Scholar
  30. 30.
    Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)Google Scholar
  31. 31.
    Lehrmann, A., Gehler, P., Nowozin, S.: A non-parametric bayesian network prior of human pose. In: International Conference on Computer Vision (2013)Google Scholar
  32. 32.
    Nowozin, S., Shotton, J.: Action points: A representation for low-latency online human action recognition. Technical report (2012)Google Scholar
  33. 33.
    Bishop, C.: Pattern recognition and machine learning. Springer (2006)Google Scholar
  34. 34.
    Chaudhry, R., Ofli, F., Kurillo, G., Bajcsy, R., Vidal, R.: Bio-inspired dynamic 3d discriminative skeletal features for human action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (2013)Google Scholar
  35. 35.
    Müller, M., Röder, T.: Motion templates for automatic classification and retrieval of motion capture data. In: SIGGRAPH/Eurographics Symposium on Computer Animation, Eurographics Association (2006)Google Scholar
  36. 36.
    Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Sequence of the most informative joints (smij): A new representation for human skeletal action recognition. Journal of Visual Communication and Image Representation (2013)Google Scholar
  37. 37.
    Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: IEEE Conference on Computer Vision and Pattern Recognition (2011)Google Scholar
  38. 38.
    Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning (2013). arXiv preprint arXiv:1312.5602
  39. 39.
    Wu, D., Zhu, F., Shao, L.: One shot learning gesture recognition from RGBD images. In: International Conference on Computer Vision and Pattern Recognition Workshops (2012)Google Scholar
  40. 40.
    Lewis, J.: Fast normalized cross-correlation. Vision Interface 10, 120–123 (1995)Google Scholar
  41. 41.
    Bradski, G. Dr. Dobb’s Journal of Software ToolsGoogle Scholar
  42. 42.
    Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy) (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.The University of SheffieldSheffieldUK

Personalised recommendations