Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video

Pigou, Lionel; van den Oord, Aäron; Dieleman, Sander; Van Herreweghe, Mieke; Dambre, Joni

doi:10.1007/s11263-016-0957-7

Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video

Published: 04 October 2016

Volume 126, pages 430–439, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Lionel Pigou ORCID: orcid.org/0000-0002-3054-4960¹,
Aäron van den Oord¹,
Sander Dieleman¹,
Mieke Van Herreweghe² &
…
Joni Dambre¹

4971 Accesses
154 Citations
3 Altmetric
Explore all metrics

Abstract

Recent studies have demonstrated the power of recurrent neural networks for machine translation, image captioning and speech recognition. For the task of capturing temporal structure in video, however, there still remain numerous open research questions. Current research suggests using a simple temporal feature pooling strategy to take into account the temporal aspect of video. We demonstrate that this method is not sufficient for gesture recognition, where temporal information is more discriminative compared to general video classification tasks. We explore deep architectures for gesture recognition in video and propose a new end-to-end trainable neural network architecture incorporating temporal convolutions and bidirectional recurrence. Our main contributions are twofold; first, we show that recurrence is crucial for this task; second, we show that adding temporal convolutions leads to significant improvements. We evaluate the different approaches on the Montalbano gesture recognition dataset, where we achieve state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning models beyond temporal frame-wise features for hand gesture video recognition

Article 14 February 2024

Anwar Mira & Olaf Hellwich

Comparative Analysis of CNN-Based Spatiotemporal Reasoning in Videos

Multimodal Gesture Recognition Using Multi-stream Recurrent Neural Network

References

Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In A. Salah & B. Lepri (Eds.), Human behavior understanding (pp. 29–39). Berlin Heidelberg: Springer.
Google Scholar
Chang, J. Y. (2014). Nonparametric gesture labeling from multi-modal data. Computer vision-ECCV 2014 workshops (pp. 503–517). Springer.
Dieleman, S., van den Oord, A., Korshunova, I., Burms, J., Degrave, J., Pigou, L., & Buteneers, P. (2015). Classifying plankton with deep neural networks. http://benanne.github.io/2015/03/17/plankton.html. Accessed 17 Mar 2015.
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634.
Escalera, S., Bar, X., Gonzlez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce, V., Escalante, H.J., Shotton, J., & Guyon, I. (2014). Chalearn looking at people challenge 2014: Dataset and results. In: ECCV workshop.
Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In J. Bigun & T. Gustavsson (Eds.), Scandinavian conference on image analysis (pp. 363–370). Berlin Heidelberg: Springer.
Chapter Google Scholar
Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2003). Learning precise timing with lstm recurrent networks. The Journal of Machine Learning Research, 3, 115–143.
MathSciNet MATH Google Scholar
Graham, B. (2014). Spatially-sparse convolutional neural networks. arXiv:1409.6070 (preprint).
Graves, A., Mohamed, A.R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on, IEEE, pp. 6645–6649.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., & Coates, A., et al. (2014). Deepspeech: Scaling up end-to-end speech recognition. arXiv:1412.5567 (preprint).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Jain, A., Tompson, J., LeCun, Y., & Bregler, C. (2014). MoDeep: A deep learning framework using motion features for human pose estimation. Computer Vision ACCV, 2014, 302–315.
Google Scholar
Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In: Computer vision and pattern recognition (CVPR), 2014 IEEE conference on, IEEE, pp. 1725–1732.
Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR 2015.
Krizhevsky, A., Sutskever, I., & Hinton, GE, (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 1097–1105). http://papers.nips.cc/paper/4824-imagenet-classification-withdeep-convolutional-neural-networks.pdf.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In: Computer vision (ICCV), 2011 IEEE international conference on, IEEE, pp. 2556–2563.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Article Google Scholar
Maas, A.L., Hannun, A.Y., & Ng, A.Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In: Proc. ICML, vol. 30.
Monnier, C., German, S., & Ost, A. (2014). A multi-scale boosted detector for efficient and robust gesture recognition. In: Computer vision-ECCV 2014 workshops (pp. 491–502). Springer
Neverova, N., Wolf, C., Taylor, G.W., & Nebout, F. (2014). ModDrop: Adaptive multi-modal gesture recognition. arXiv:1501.00102 (preprint).
Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In: Computer vision and pattern recognition (CVPR), 2015 IEEE conference on, IEEE, pp. 4694–4702.
Saxe, A.M., McClelland, J.L., & Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120 (preprint).
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229 (preprint).
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 568–576). http://papers.nips.cc/paper/5353-two-stream-convolutionalnetworks-for-action-recognition-in-videos.pdf.
Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (preprint).
Sutskever. I., Vinyals, O., & Le, Q.V. (2014). Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 3104–3112). http://papers.nips.cc/paper/5346-sequence-to-sequence-learningwith-neural-networks.pdf.
Taylor, G. W., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), Computer vision-ECCV 2010 (pp. 140–153). Berlin Heidelberg: Springer.
Chapter Google Scholar
Toshev, A., & Szegedy, C. (2014). DeepPose: Human pose estimation via deep neural networks. In: Computer vision and pattern recognition (CVPR), 2014 IEEE conference on, IEEE, pp. 1653–1660.
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence–video to text. arXiv:1505.00487 (preprint).
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164.
Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. In: ICML deep learning workshop.

Download references

Acknowledgments

We would like to thank NVIDIA Corporation for the donation of a GPU used for this research. The research leading to these results has received funding from the Agency for Innovation by Science and Technology in Flanders (IWT).

Author information

Authors and Affiliations

Data Science Lab, ELIS, Ghent University, Ghent, Belgium
Lionel Pigou, Aäron van den Oord, Sander Dieleman & Joni Dambre
Department of Linguistics, Ghent University, Ghent, Belgium
Mieke Van Herreweghe

Authors

Lionel Pigou
View author publications
You can also search for this author in PubMed Google Scholar
Aäron van den Oord
View author publications
You can also search for this author in PubMed Google Scholar
Sander Dieleman
View author publications
You can also search for this author in PubMed Google Scholar
Mieke Van Herreweghe
View author publications
You can also search for this author in PubMed Google Scholar
Joni Dambre
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lionel Pigou.

Additional information

Communicated by Greg Mori.

A. van den Oord and S. Dieleman: Now at Google DeepMind.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pigou, L., van den Oord, A., Dieleman, S. et al. Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video. Int J Comput Vis 126, 430–439 (2018). https://doi.org/10.1007/s11263-016-0957-7

Download citation

Received: 11 February 2016
Accepted: 20 September 2016
Published: 04 October 2016
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11263-016-0957-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video

Abstract

Access this article

Similar content being viewed by others

Deep learning models beyond temporal frame-wise features for hand gesture video recognition

Comparative Analysis of CNN-Based Spatiotemporal Reasoning in Videos

Multimodal Gesture Recognition Using Multi-stream Recurrent Neural Network

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video

Abstract

Access this article

Similar content being viewed by others

Deep learning models beyond temporal frame-wise features for hand gesture video recognition

Comparative Analysis of CNN-Based Spatiotemporal Reasoning in Videos

Multimodal Gesture Recognition Using Multi-stream Recurrent Neural Network

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation