Skip to main content
Log in

Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Recent studies have demonstrated the power of recurrent neural networks for machine translation, image captioning and speech recognition. For the task of capturing temporal structure in video, however, there still remain numerous open research questions. Current research suggests using a simple temporal feature pooling strategy to take into account the temporal aspect of video. We demonstrate that this method is not sufficient for gesture recognition, where temporal information is more discriminative compared to general video classification tasks. We explore deep architectures for gesture recognition in video and propose a new end-to-end trainable neural network architecture incorporating temporal convolutions and bidirectional recurrence. Our main contributions are twofold; first, we show that recurrence is crucial for this task; second, we show that adding temporal convolutions leads to significant improvements. We evaluate the different approaches on the Montalbano gesture recognition dataset, where we achieve state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In A. Salah & B. Lepri (Eds.), Human behavior understanding (pp. 29–39). Berlin Heidelberg: Springer.

    Google Scholar 

  • Chang, J. Y. (2014). Nonparametric gesture labeling from multi-modal data. Computer vision-ECCV 2014 workshops (pp. 503–517). Springer.

  • Dieleman, S., van den Oord, A., Korshunova, I., Burms, J., Degrave, J., Pigou, L., & Buteneers, P. (2015). Classifying plankton with deep neural networks. http://benanne.github.io/2015/03/17/plankton.html. Accessed 17 Mar 2015.

  • Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634.

  • Escalera, S., Bar, X., Gonzlez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce, V., Escalante, H.J., Shotton, J., & Guyon, I. (2014). Chalearn looking at people challenge 2014: Dataset and results. In: ECCV workshop.

  • Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In J. Bigun & T. Gustavsson (Eds.), Scandinavian conference on image analysis (pp. 363–370). Berlin Heidelberg: Springer.

    Chapter  Google Scholar 

  • Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2003). Learning precise timing with lstm recurrent networks. The Journal of Machine Learning Research, 3, 115–143.

    MathSciNet  MATH  Google Scholar 

  • Graham, B. (2014). Spatially-sparse convolutional neural networks. arXiv:1409.6070 (preprint).

  • Graves, A., Mohamed, A.R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on, IEEE, pp. 6645–6649.

  • Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., & Coates, A., et al. (2014). Deepspeech: Scaling up end-to-end speech recognition. arXiv:1412.5567 (preprint).

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Jain, A., Tompson, J., LeCun, Y., & Bregler, C. (2014). MoDeep: A deep learning framework using motion features for human pose estimation. Computer Vision ACCV, 2014, 302–315.

    Google Scholar 

  • Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.

    Article  Google Scholar 

  • Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In: Computer vision and pattern recognition (CVPR), 2014 IEEE conference on, IEEE, pp. 1725–1732.

  • Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR 2015.

  • Krizhevsky, A., Sutskever, I., & Hinton, GE, (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 1097–1105). http://papers.nips.cc/paper/4824-imagenet-classification-withdeep-convolutional-neural-networks.pdf.

  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In: Computer vision (ICCV), 2011 IEEE international conference on, IEEE, pp. 2556–2563.

  • LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

    Article  Google Scholar 

  • Maas, A.L., Hannun, A.Y., & Ng, A.Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In: Proc. ICML, vol. 30.

  • Monnier, C., German, S., & Ost, A. (2014). A multi-scale boosted detector for efficient and robust gesture recognition. In: Computer vision-ECCV 2014 workshops (pp. 491–502). Springer

  • Neverova, N., Wolf, C., Taylor, G.W., & Nebout, F. (2014). ModDrop: Adaptive multi-modal gesture recognition. arXiv:1501.00102 (preprint).

  • Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In: Computer vision and pattern recognition (CVPR), 2015 IEEE conference on, IEEE, pp. 4694–4702.

  • Saxe, A.M., McClelland, J.L., & Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120 (preprint).

  • Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229 (preprint).

  • Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 568–576). http://papers.nips.cc/paper/5353-two-stream-convolutionalnetworks-for-action-recognition-in-videos.pdf.

  • Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (preprint).

  • Sutskever. I., Vinyals, O., & Le, Q.V. (2014). Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 3104–3112). http://papers.nips.cc/paper/5346-sequence-to-sequence-learningwith-neural-networks.pdf.

  • Taylor, G. W., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), Computer vision-ECCV 2010 (pp. 140–153). Berlin Heidelberg: Springer.

    Chapter  Google Scholar 

  • Toshev, A., & Szegedy, C. (2014). DeepPose: Human pose estimation via deep neural networks. In: Computer vision and pattern recognition (CVPR), 2014 IEEE conference on, IEEE, pp. 1653–1660.

  • Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence–video to text. arXiv:1505.00487 (preprint).

  • Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164.

  • Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. In: ICML deep learning workshop.

Download references

Acknowledgments

We would like to thank NVIDIA Corporation for the donation of a GPU used for this research. The research leading to these results has received funding from the Agency for Innovation by Science and Technology in Flanders (IWT).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lionel Pigou.

Additional information

Communicated by Greg Mori.

A. van den Oord and S. Dieleman: Now at Google DeepMind.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pigou, L., van den Oord, A., Dieleman, S. et al. Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video. Int J Comput Vis 126, 430–439 (2018). https://doi.org/10.1007/s11263-016-0957-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0957-7

Keywords

Navigation