Skip to main content

Using Deep Convolutional LSTM Networks for Learning Spatiotemporal Features

Part of the Lecture Notes in Computer Science book series (LNIP,volume 12047)

Abstract

This paper explores the use of convolutional LSTMs to simultaneously learn spatial- and temporal-information in videos. A deep network of convolutional LSTMs allows the model to access the entire range of temporal information at all spatial scales. We describe our experiments involving convolutional LSTMs for lipreading that demonstrate the model is capable of selectively choosing which spatiotemporal scales are most relevant for a particular dataset. The proposed deep architecture holds promise in other applications where spatiotemporal features play a vital role without having to specifically cater the design of the network for the particular spatiotemporal features existent within the problem. Our model has comparable performance with the current state of the art achieving 83.4% on the Lip Reading in the Wild (LRW) dataset. Additional experiments indicate convolutional LSTMs may be particularly data hungry considering the large performance increases when fine-tuning on LRW after pretraining on larger datasets like LRS2 (85.2%) and LRS3-TED (87.1%). However, a sensitivity analysis providing insight on the relevant spatiotemporal temporal features allows certain convolutional LSTM layers to be replaced with 2D convolutions decreasing computational cost without performance degradation indicating their usefulness in accelerating the architecture design process when approaching new problems.

Keywords

  • Convolutional LSTMs
  • Deep learning
  • Video analytics
  • Lip Reading
  • Action recognition

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-41299-9_24
  • Chapter length: 14 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-41299-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.

References

  1. Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. arXiv e-prints, September 2018

    Google Scholar 

  2. Afouras, T., Chung, J.S., Zisserman, A.: Deep lip reading: a comparison of models and an online application. arXiv e-prints, June 2018

    Google Scholar 

  3. Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv e-prints, September 2018

    Google Scholar 

  4. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning (2009)

    Google Scholar 

  5. Bian, Y., et al.: Revisiting the effectiveness of off-the-shelf temporal modeling approaches for large-scale video classification. CoRR abs/1708.03805 (2017)

    Google Scholar 

  6. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)

    Google Scholar 

  7. Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  8. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6

    CrossRef  Google Scholar 

  9. Chung, J.S., Zisserman, A.: Lip reading in profile. In: British Machine Vision Conference (2017)

    Google Scholar 

  10. Conneau, A., Schwenk, H., Barrault, L., LeCun, Y.: Very deep convolutional networks for natural language processing. CoRR abs/1606.01781 (2016)

    Google Scholar 

  11. Cotterell, R., Mielke, S.J., Eisner, J., Roark, B.: Are all languages equally hard to language-model? In: NAACL-HLT (2018)

    Google Scholar 

  12. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 369–376 (2006)

    Google Scholar 

  13. Hanson, A., Pnvr, K., Krishnagopal, S., Davis, L.: Bidirectional convolutional LSTM for the detection of violence in videos. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11130, pp. 280–295. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11012-3_24

    CrossRef  Google Scholar 

  14. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? CoRR abs/1711.09577 (2017)

    Google Scholar 

  15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015)

    Google Scholar 

  16. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    CrossRef  Google Scholar 

  17. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of International Computer Vision and Pattern Recognition (CVPR 2014) (2014)

    Google Scholar 

  18. Kay, W., et al.: The kinetics human action video dataset. CoRR abs/1705.06950 (2017)

    Google Scholar 

  19. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)

    Google Scholar 

  20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012)

    Google Scholar 

  21. Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. CoRR abs/1701.04128 (2017)

    Google Scholar 

  22. Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  23. Pascanu, R., Mikolov, T., Bengio, Y.: Understanding the exploding gradient problem. CoRR abs/1211.5063 (2012)

    Google Scholar 

  24. Paszke, A., et al.: Automatic differentiation in pytorch (2017)

    Google Scholar 

  25. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

    MathSciNet  CrossRef  Google Scholar 

  26. Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., Woo, W.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. CoRR abs/1506.04214 (2015)

    Google Scholar 

  27. Shillingford, B., et al.: Large-scale visual speech recognition. CoRR abs/1807.05162 (2018)

    Google Scholar 

  28. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 568–576. Curran Associates, Inc. (2014)

    Google Scholar 

  29. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)

    Google Scholar 

  30. Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. CoRR abs/1811.01194 (2018)

    Google Scholar 

  31. Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. CoRR abs/1703.04105 (2017)

    Google Scholar 

  32. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 3104–3112. Curran Associates, Inc. (2014)

    Google Scholar 

  33. Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR abs/1602.07261 (2016)

    Google Scholar 

  34. Szegedy, C., et al.: Going deeper with convolutions (2015)

    Google Scholar 

  35. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. CoRR abs/1412.0767 (2014)

    Google Scholar 

  36. Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017)

    Google Scholar 

  37. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., Hovy, E.H.: Hierarchical attention networks for document classification. In: HLT-NAACL (2016)

    Google Scholar 

  38. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. CoRR abs/1409.2329 (2014)

    Google Scholar 

  39. Zhang, L., Zhu, G., Shen, P., Song, J.: Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition, pp. 3120–3128, October 2017. https://doi.org/10.1109/ICCVW.2017.369

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Logan Courtney .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Courtney, L., Sreenivas, R. (2020). Using Deep Convolutional LSTM Networks for Learning Spatiotemporal Features. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W. (eds) Pattern Recognition. ACPR 2019. Lecture Notes in Computer Science(), vol 12047. Springer, Cham. https://doi.org/10.1007/978-3-030-41299-9_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-41299-9_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-41298-2

  • Online ISBN: 978-3-030-41299-9

  • eBook Packages: Computer ScienceComputer Science (R0)