Advertisement

A Multi-scale Boosted Detector for Efficient and Robust Gesture Recognition

  • Camille MonnierEmail author
  • Stan German
  • Andrey Ost
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8925)

Abstract

We present an approach to detecting and recognizing gestures in a stream of multi-modal data. Our approach combines a sliding-window gesture detector with features drawn from skeleton data, color imagery, and depth data produced by a first-generation Kinect sensor. The detector consists of a set of one-versus-all boosted classifiers, each tuned to a specific gesture. Features are extracted at multiple temporal scales, and include descriptive statistics of normalized skeleton joint positions, angles, and velocities, as well as image-based hand descriptors. The full set of gesture detectors may be trained in under two hours on a single machine, and is extremely efficient at runtime, operating at 1700fps using only skeletal data, or at 100fps using fused skeleton and image features. Our method achieved a Jaccard Index score of 0.834 on the ChaLearn-2014 Gesture Recognition Test dataset, and was ranked 2nd overall in the competition.

Keywords

Gesture recognition Boosting methods One-vs-all Multi-modal fusion Feature pooling 

References

  1. 1.
    Poppe, R.: A survey on vision-based human action recognition. Image and Vision Computing 28(6), 976–990 (2010)CrossRefGoogle Scholar
  2. 2.
    Andriluka, M., Roth, S., Schiele, B.: Monocular 3d pose estimation and tracking by detection. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 623–630 (2010)Google Scholar
  3. 3.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision 103(1), 60–79 (2013)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Cherian, A., Mairal, J., Alahari, K., Schmid, C., et al.: Mixing body-part sequences for human pose estimation. In: CVPR 2014-IEEE Conference on Computer Vision & Pattern Recognition (2014)Google Scholar
  5. 5.
    Yu, T.H., Kim, T.K., Cipolla, R.: Unconstrained monocular 3d human pose estimation by action detection and cross-modality regression forest. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 3642–3649. IEEE (2013)Google Scholar
  6. 6.
    Nickel, K., Stiefelhagen, R.: Visual recognition of pointing gestures for human-robot interaction. Image and Vision Computing 25(12), 1875–1884 (2007)CrossRefGoogle Scholar
  7. 7.
    Song, Y., Demirdjian, D., Davis, R.: Continuous body and hand gesture recognition for natural human-computer interaction. ACM Transactions on Interactive Intelligent Systems (TiiS) 2(1), 5 (2012)Google Scholar
  8. 8.
    Van den Bergh, M., Carton, D., De Nijs, R., Mitsou, N., Landsiedel, C., Kuehnlenz, K., Wollherr, D., Van Gool, L., Buss, M.: Real-time 3d hand gesture interaction with a robot for understanding directions from humans. In: RO-MAN, 2011 IEEE, pp. 357–362. IEEE (2011)Google Scholar
  9. 9.
    Zhang, Z.: Microsoft kinect sensor and its effect. IEEE MultiMedia 19(2), 4–10 (2012)CrossRefGoogle Scholar
  10. 10.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)Google Scholar
  11. 11.
    Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1365–1372. IEEE (2009)Google Scholar
  12. 12.
    Holt, B., Ong, E.J., Cooper, H., Bowden, R.: Putting the pieces together: Connected poselets for human pose estimation. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1196–1201. IEEE (2011)Google Scholar
  13. 13.
    Song, Y., Morency, L.P., Davis, R.: Action recognition by hierarchical sequence summarization. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3562–3569. IEEE (2013)Google Scholar
  14. 14.
    Wang, J., Chen, Z., Wu, Y.: Action recognition with multiscale spatio-temporal contexts. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3185–3192. IEEE (2011)Google Scholar
  15. 15.
    Mitra, S., Acharya, T.: Gesture recognition: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 37(3), 311–324 (2007)CrossRefGoogle Scholar
  16. 16.
    Wu, J., Cheng, J., Zhao, C., Lu, H.: Fusing multi-modal features for gesture recognition. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, pp. 453–460. ACM (2013)Google Scholar
  17. 17.
    Vapnik, V.N., Vapnik, V.: Statistical learning theory. vol. 2. Wiley, New York (1998)Google Scholar
  18. 18.
    Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Freund, Y., Schapire, R., Abe, N.: A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence 14(771–780), 1612 (1999)Google Scholar
  20. 20.
    Gall, J., Yao, A., Razavi, N., Van Gool, L., Lempitsky, V.: Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(11), 2188–2202 (2011)CrossRefGoogle Scholar
  21. 21.
    Tang, K., Fei-Fei, L., Koller, D.: Learning latent temporal structure for complex event detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1250–1257. IEEE (2012)Google Scholar
  22. 22.
    Yao, A., Gall, J., Fanelli, G., Van Gool, L.J.: Does human action recognition benefit from pose estimation?. In: BMVC, vol. 3, p. 6 (2011)Google Scholar
  23. 23.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 886–893. IEEE (2005)Google Scholar
  24. 24.
    Bourdev, L., Brandt, J.: Robust object detection via soft cascade. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 2, pp. 236–243. IEEE (2005)Google Scholar
  25. 25.
    Rifkin, R., Klautau, A.: In defense of one-vs-all classification. The Journal of Machine Learning Research 5, 101–141 (2004)MathSciNetzbMATHGoogle Scholar
  26. 26.
    Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(4), 743–761 (2012)CrossRefGoogle Scholar
  27. 27.
    Walk, S., Majer, N., Schindler, K., Schiele, B.: New features and insights for pedestrian detection. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1030–1037. IEEE (2010)Google Scholar
  28. 28.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88(2), 303–338 (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Charles River AnalyticsCambridgeUSA

Personalised recommendations