Multi-modality Gesture Detection and Recognition with Un-supervision, Randomization and Discrimination

  • Guang ChenEmail author
  • Daniel Clarke
  • Manuel Giuliani
  • Andre Gaschler
  • Di Wu
  • David Weikersdorfer
  • Alois Knoll
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8925)


We describe in this paper our gesture detection and recognition system for the 2014 ChaLearn Looking at People (Track 3: Gesture Recognition) organized by ChaLearn in conjunction with the ECCV 2014 conference. The competition’s task was to learn a vacabulary of 20 types of Italian gestures and detect them in sequences. Our system adopts a multi-modality approach for detecting as well as recognizing the gestures. The goal of our approach is to identify semantically meaningful contents from dense sampling spatio-temporal feature space for gesture recognition. To achieve this, we develop three concepts under the random forest framework: un-supervision; discrimination; and randomization. Un-supervision learns spatio-temporal features from two channels (grayscale and depth) of RGB-D video in an unsupervised way. Discrimination extracts the information in dense sampling spatio-temporal space effectively. Randomization explores the dense sampling spatio-temporal feature space efficiently. An evaluation of our approach shows that we achieve a mean Jaccard Index of \(0.6489\), and a mean average accuracy of \(90.3\,\%\) over the test dataset.


Multi-modality gesture Unsupervised learning Random forest Discriminative training 


  1. 1.
    Bosch, A., Zisserman, A., Muoz, X.: Image classification using random forests and ferns. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 1–8, October 2007Google Scholar
  2. 2.
    Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Chen, G., Clarke, D., Giuliani, M., Gaschler, A., Knoll, A.: Combining unsupervised learning and discrimination for 3d action recognition. Signal Processing (2014)Google Scholar
  4. 4.
    Chen, G., Clarke, D., Knoll, A.: Learning weighted joint-based features for action recognition using depth camera. In: International Conference on Computer Vision Theory and Applications (2014)Google Scholar
  5. 5.
    Chen, G., Giuliani, M., Clarke, D., Knoll, A.: Action recognition using ensemble weighted multi-instance learning. In: IEEE International Conference on Robotics and Automation (2014)Google Scholar
  6. 6.
    Chen, G., Zhang, F., Giuliani, M., Buckl, C., Knoll, A.: Unsupervised learning spatio-temporal features for human activity recognition from RGB-D video data. In: Herrmann, G., Pearson, M.J., Lenz, A., Bremner, P., Spiers, A., Leonards, U. (eds.) ICSR 2013. LNCS, vol. 8239, pp. 341–350. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  7. 7.
    Escalera, S., Bar, X., Gonzlez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce, V., Escalante, H.J., Shotton, J., Guyon, I.: Chalearn looking at people challenge 2014: dataset and results. In: Proceedings of the ChaLearn Looking at People 2014 Workshop, ECCV 2014 (2014)Google Scholar
  8. 8.
    Escalera, S., Gonzlez, J., Bar, X., Reyes, M., Lops, O., Guyon, I., Athitsos, V., Escalante, H.J.: Multi-modal gesture recognition challenge 2013: dataset and results. In: Chalearn Multi-Modal Gesture Recognition Workshop, International Conference on Multimodal Interaction (2013)Google Scholar
  9. 9.
    Gaschler, A., Huth, K., Giuliani, M., Kessler, I., de Ruiter, J., Knoll, A.: Modelling state of interaction from head poses for social Human-Robot Interaction. In: ACM/IEEE HCI Conference on Gaze in Human-Robot Interaction Workshop (2012)Google Scholar
  10. 10.
    Hadfield, S., Bowden, R.: Hollywood 3d: recognizing actions in 3d natural scenes. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3398–3405 (2013)Google Scholar
  11. 11.
    Laptex, I.: On space-time interest points. International Journal of Computer Vision 64, 107–123 (2005)CrossRefGoogle Scholar
  12. 12.
    Le, Q., Zou, W., Yeung, S., Ng, A.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3361–3368 (2011)Google Scholar
  13. 13.
    Lu, D.V., Pileggi, A., Smart, W.D.: Multi-person motion capture dataset for analyzing human interaction. In: RSS 2011 Workshop on Human-Robot Interaction. RSS, Los Angeles, California, July 2011Google Scholar
  14. 14.
    Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y.: Robust 3D action recognition with random occupancy patterns. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 872–885. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  15. 15.
    Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297 (2012)Google Scholar
  16. 16.
    Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3360–3367, June 2010Google Scholar
  17. 17.
    Yao, B., Khosla, A., Fei-Fei, L.: Combining randomization and discrimination for fine-grained image categorization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2011Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Guang Chen
    • 1
    • 2
    Email author
  • Daniel Clarke
    • 2
  • Manuel Giuliani
    • 2
  • Andre Gaschler
    • 2
  • Di Wu
    • 3
  • David Weikersdorfer
    • 2
  • Alois Knoll
    • 1
  1. 1.Technische Universität MünchenGarching bei MünchenGermany
  2. 2.fortiss GmbHMunichGermany
  3. 3.University of SheffieldSheffieldUK

Personalised recommendations