Multi-scale Deep Learning for Gesture Detection and Localization

  • Natalia NeverovaEmail author
  • Christian Wolf
  • Graham W. Taylor
  • Florian Nebout
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8925)


We present a method for gesture detection and localization based on multi-scale and multi-modal deep learning. Each visual modality captures spatial information at a particular spatial scale (such as motion of the upper body or a hand), and the whole system operates at two temporal scales. Key to our technique is a training strategy which exploits i) careful initialization of individual modalities; and ii) gradual fusion of modalities from strongest to weakest cross-modality structure. We present experiments on the ChaLearn 2014 Looking at People Challenge gesture recognition track, in which we placed first out of 17 teams.


Gesture recognition Multi-modal systems Deep learning 


  1. 1.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  2. 2.
    Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In: ICLR (2014)Google Scholar
  3. 3.
    Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet Classification with Deep Convolutional Neural Networks. In: NIPS (2012)Google Scholar
  4. 4.
    Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning Hierarchical Features for Scene Labeling. PAMI 35(8), 1915–1929 (2013)CrossRefGoogle Scholar
  5. 5.
    Couprie, C., Clment, F., Najman, L., LeCun, Y.: Indoor Semantic Segmentation using depth information. In: ICLR (2014)Google Scholar
  6. 6.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  7. 7.
    Kahou, S.E., Pal, C., Bouthillier, X., Froumenty, P., Gülçehre, C., Memisevic, R., Vincent, P., Courville, A., Bengio, Y.: Combining modality specific deep neural networks for emotion recognition in video. In: ICMI (2013)Google Scholar
  8. 8.
    aigman, Y., Yang, M., Ranzato, M.A., Wolf, L.: DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In: CVPR (2014)Google Scholar
  9. 9.
    Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification. In: BMVC (2012)Google Scholar
  10. 10.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.F.: Large-scale Video Classification with Convolutional Neural Networks. In: CVPR (2014)Google Scholar
  11. 11.
    Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos. In: arXiv preprint arXiv:1406.2199v1 (2014)
  12. 12.
    Escalera, S., Baró, X., Gonzàlez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce, V., Escalante, H.J., Shotton, J., Guyon, I.: ChaLearn Looking at People Challenge 2014: Dataset and Results. In: ECCV ChaLearn Workshop on Looking at People (2014)Google Scholar
  13. 13.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(1), 60–79 (2013)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. BMVC 124(1-124), 11 (2009)Google Scholar
  15. 15.
    Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior Recognition via Sparse Spatio-Temporal Features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (2005)Google Scholar
  16. 16.
    Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  17. 17.
    Kläser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: BMVC (2008)Google Scholar
  18. 18.
    Willems, G., Tuytelaars, T., Van Gool, L.: An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  19. 19.
    Keskin, C., Kiraç, F., Kara, Y., Akarun, L.: Real time hand pose estimation using depth sensors. In: ICCV Workshop on Consumer Depth Cameras. IEEE (2011)Google Scholar
  20. 20.
    Półrola, M., Wojciechowski, A.: Real-Time Hand Pose Estimation Using Classifiers. In: Bolc, L., Tadeusiewicz, R., Chmielewski, L.J., Wojciechowski, K. (eds.) ICCVG 2012. LNCS, vol. 7594, pp. 573–580. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  21. 21.
    Tang, D., Yu, T.H., Kim, T.K.: Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests. In: ICCV (2013)Google Scholar
  22. 22.
    Tompson, J., Stein, M., LeCun, Y., Perlin, K.: Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. ACM Transaction on Graphics (2014)Google Scholar
  23. 23.
    Oikonomidis, I., Kyriazis, N., Argyros, A.: Efficient model-based 3D tracking of hand articulations using Kinect. BMVC 101(1–101), 11 (2011)Google Scholar
  24. 24.
    Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and Robust Hand Tracking from Depth. In: CVPR (2014)Google Scholar
  25. 25.
    Wang, F., Li, Y.: Beyond Physical Connections: Tree Models in Human Pose Estimation. In: CVPR (2013)Google Scholar
  26. 26.
    Tang, D., Chang, H.J., Tejani, A., Kim, T.K.: Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture. In: CVPR (2014)Google Scholar
  27. 27.
    Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR (2012)Google Scholar
  28. 28.
    Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured Human Activity Detection from RGBD Images. In: ICRA (2012)Google Scholar
  29. 29.
    Chen, X., Koskela, M.: Online RGB-D gesture recognition with extreme learning machines. In: ICMI (2013)Google Scholar
  30. 30.
    Nandakumar, K., Wah, W.K., Alice, C.S.M., Terence, N.W.Z., Gang, W.J., Yun, Y.W.: A Multi-modal Gesture Recognition System Using Audio, Video, and Skeletal Joint Data Categories and Subject Descriptors. In: 2013 Multi-modal Challenge Workshop in Conjunction with ICMI (2013)Google Scholar
  31. 31.
    Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR, pp. 3361–3368 (2011)Google Scholar
  32. 32.
    Ranzato, M., Huang, F.J., Boureau, Y.L., LeCun, Y.: Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition. In: CVPR (2007)Google Scholar
  33. 33.
    Chen, B., Ting, J.A., Marlin, B., de Freitas, N.: Deep learning of invariant Spatio-Temporal Features from Video. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2010)Google Scholar
  34. 34.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D Convolutional Neural Networks for Human Action Recognition. PAMI 35(1), 221–231 (2013)CrossRefGoogle Scholar
  35. 35.
    Ngiam, J., Khosla, A., Kin, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML (2011)Google Scholar
  36. 36.
    Srivastava, N., Salakhutdinov, R.: Multimodal learning with Deep Boltzmann Machines. In: NIPS (2013)Google Scholar
  37. 37.
    Neverova, N., Wolf, C., Paci, G., Sommavilla, G., Taylor, G.W., Nebout, F.: A multi-scale approach to gesture detection and recognition. In: ICCV Workshop on Understanding Human Activities: Context and Interactions (HACI) (2013)Google Scholar
  38. 38.
    Zanfir, M., Leordeanu, M., Sminchisescu, C.: The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection. In: ICCV (2013)Google Scholar
  39. 39.
    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICMlL (2009)Google Scholar
  40. 40.
    Wu, D.: Deep Dynamic Neural Networks for Gesture Segmentation and Recognition. In: ECCV ChaLearn Workshop on Looking at People (2014)Google Scholar
  41. 41.
    Monnier, C., German, S., Ost, A.: A Multi-scale Boosted Detector for Efficient and Robust Gesture Recognition. In: ECCV ChaLearn Workshop on Looking at People (2014)Google Scholar
  42. 42.
    Camgoz, N.C., Kindiroglu, A.A., Akarun, L.: Gesture Recognition using Template Based Random Forest Classifiers. In: ECCV ChaLearn Workshop on Looking at People (2014)Google Scholar
  43. 43.
    Chang, J.Y.: Nonparametric Gesture Labeling from Multi-modal Data. In: ECCV ChaLearn Workshop on Looking at People (2014)Google Scholar
  44. 44.
    Evangelidis, G., Singh, G., Horaud, R.: Continuous gesture recognition from articulated poses. In: ECCV ChaLearn Workshop on Looking at People (2014)Google Scholar
  45. 45.
    Peng, X., Wang, L., Cai, Z.: Action and Gesture Temporal Spotting with Super Vector Representation. In: ECCV ChaLearn Workshop on Looking at People (2014)Google Scholar
  46. 46.
    Pigou, L., Dieleman, S., Kindermans, P.J.: Sign Language Recognition Using Convolutional Neural Networks. In: ECCV ChaLearn Workshop on Looking at People (2014)Google Scholar
  47. 47.
    Chen, G., Clarke, D., Giuliani, M., Weikersdorfer, D., Knoll, A.: Multi-modality Gesture Detection and Recognition With Un-supervision, Randomization and Discrimination. In: ECCV ChaLearn Workshop on Looking at People (2014)Google Scholar
  48. 48.
    Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: CVPR (2005)Google Scholar
  49. 49.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: CVPR (2006)Google Scholar
  50. 50.
    Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learning 63(1), 3–42 (2006)CrossRefzbMATHGoogle Scholar
  51. 51.
    Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees (1984)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Natalia Neverova
    • 1
    • 2
    Email author
  • Christian Wolf
    • 1
    • 2
  • Graham W. Taylor
    • 3
  • Florian Nebout
    • 4
  1. 1.Université de Lyon, CNRSLyonFrance
  2. 2.INSA-Lyon, LIRIS, UMR5205Villeurbanne cedexFrance
  3. 3.University of GuelphGuelphCanada
  4. 4.AwabotLyonFrance

Personalised recommendations