Advertisement

Memory-Augmented Dense Predictive Coding for Video Representation Learning

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12348)

Abstract

The objective of this paper is self-supervised learning from video, in particular for representations for action recognition. We make the following contributions: (i) We propose a new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) for the task. It is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condensed representations, allowing to make multiple hypotheses efficiently. (ii) We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both. (iii) We thoroughly evaluate the quality of the learnt representation on four different downstream tasks: action recognition, video retrieval, learning with scarce annotations, and unintentional action classification. In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.

Notes

Acknowledgements

Funding for this research is provided by a Google-DeepMind Graduate Scholarship, and by the EPSRC Programme Grant Seebibyte EP/M013774/1. We would like to thank João F. Henriques, Samuel Albanie and Triantafyllos Afouras for helpful discussions.

Supplementary material

504435_1_En_19_MOESM1_ESM.pdf (3 mb)
Supplementary material 1 (pdf 3051 KB)

References

  1. 1.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: Proceedings of the ICCV, pp. 37–45. IEEE (2015)Google Scholar
  2. 2.
    Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667 (2019)
  3. 3.
    Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the CVPR (2016)Google Scholar
  4. 4.
    Arandjelović, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the ICCV (2017)Google Scholar
  5. 5.
    Arandjelović, R., Zisserman, A.: Objects that sound. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 451–466. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01246-5_27CrossRefGoogle Scholar
  6. 6.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the ICLR (2015)Google Scholar
  7. 7.
    Benaim, S., et al.: SpeedNet: learning the speediness in videos. In: Proceedings of the CVPR (2020)Google Scholar
  8. 8.
    Brabandere, B.D., Jia, X., Tuytelaars, T., Gool, L.V.: Dynamic filter networks. In: NeurIPS (2016)Google Scholar
  9. 9.
    Büchler, U., Brattoli, B., Ommer, B.: Improving spatiotemporal self-supervision by deep reinforcement learning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 797–814. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01267-0_47CrossRefGoogle Scholar
  10. 10.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the CVPR (2017)Google Scholar
  11. 11.
    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of the ICML (2020)Google Scholar
  12. 12.
    Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of the CVPR (2005)Google Scholar
  13. 13.
    Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
  14. 14.
    Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  15. 15.
    Diba, A., Sharma, V., Gool, L.V., Stiefelhagen, R.: DynamoNet: dynamic action and motion network. In: Proceedings of the ICCV (2019)Google Scholar
  16. 16.
    Dosovitskiy, A., Koltun, V.: Learning to act by predicting the future. In: Proceedings of the ICLR (2017)Google Scholar
  17. 17.
    Epstein, D., Chen, B., Vondrick, C.: Oops! Predicting unintentional action in video. In: Proceedings of the CVPR (2020)Google Scholar
  18. 18.
    Feichtenhofer, C., Pinz, A., Wildes, R.P., Zisserman, A.: What have we learned from deep representations for action recognition? In: Proceedings of the CVPR (2018)Google Scholar
  19. 19.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the CVPR (2016)Google Scholar
  20. 20.
    Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the ICCV (2017)Google Scholar
  21. 21.
    Gan, C., Gong, B., Liu, K., Su, H., Guibas, L.J.: Geometry guided convolutional neural networks for self-supervised video representation learning. In: Proceedings of the CVPR (2018)Google Scholar
  22. 22.
    Graves, A., Wayne, G., Danihelka, I.: Neural turing machines. arXiv preprint arXiv:1410.5401 (2014)
  23. 23.
    Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: AISTATS (2010)Google Scholar
  24. 24.
    Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: Workshop on Large Scale Holistic Video Understanding, ICCV (2019)Google Scholar
  25. 25.
    Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: CVPR (2018) Google Scholar
  26. 26.
    He, K., Fan, H., Wu, A., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the CVPR (2020)Google Scholar
  27. 27.
    Hénaff, O.J., Razavi, A., Doersch, C., Eslami, S.M.A., van den Oord, A.: Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019)
  28. 28.
    Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: Proceedings of the ICLR (2019)Google Scholar
  29. 29.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  30. 30.
    Isola, P., Zoran, D., Krishnan, D., Adelson, E.H.: Learning visual groups from co-occurrences in space and time. In: Proceedings of the ICLR (2015)Google Scholar
  31. 31.
    Jakab, T., Gupta, A., Bilen, H., Vedaldi, A.: Unsupervised learning of object landmarks through conditional image generation. In: NeurIPS (2018)Google Scholar
  32. 32.
    Jayaraman, D., Grauman, K.: Learning image representations tied to ego-motion. In: Proceedings of the ICCV (2015)Google Scholar
  33. 33.
    Jayaraman, D., Grauman, K.: Slow and steady feature analysis: higher order temporal coherence in video. In: Proceedings of the CVPR (2016)Google Scholar
  34. 34.
    Ji, X., Henriques, J.F., Vedaldi, A.: Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings of the ICCV (2019)Google Scholar
  35. 35.
    Jing, L., Tian, Y.: Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387 (2018)
  36. 36.
    Kay, W., ET AL.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  37. 37.
    Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: AAAI (2019)Google Scholar
  38. 38.
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  39. 39.
    Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: NeurIPS (2018)Google Scholar
  40. 40.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings of the ICCV, pp. 2556–2563 (2011)Google Scholar
  41. 41.
    Kumar, A., et al.: Ask me anything: dynamic memory networks for natural language processing. In: Proceedings of the ICML (2016)Google Scholar
  42. 42.
    Lai, Z., Lu, E., Xie, W.: MAST: A memory-augmented self-supervised tracker. In: Proceedings of the CVPR (2020)Google Scholar
  43. 43.
    Lai, Z., Xie, W.: Self-supervised learning for video correspondence flow. In: Proceedings of the BMVC (2019)Google Scholar
  44. 44.
    Lee, H., Huang, J., Singh, M., Yang, M.: Unsupervised representation learning by sorting sequence. In: Proceedings of the ICCV (2017)Google Scholar
  45. 45.
    Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. In: Proceedings of the ICLR (2017)Google Scholar
  46. 46.
    Luo, D., Liu, C., Zhou, Y., Yang, D., Ma, C., Ye, Q., Wang, W.: Video cloze procedure for self-supervised spatio-temporal learning. In: AAAI (2020)Google Scholar
  47. 47.
    Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the CVPR (2020)Google Scholar
  48. 48.
    Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_32CrossRefGoogle Scholar
  49. 49.
    Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving Jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_5CrossRefGoogle Scholar
  50. 50.
    van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
  51. 51.
    Patrick, M., Asano, Y.M., Fong, R., Henriques, J.F., Zweig, G., Vedaldi, A.: Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298 (2020)
  52. 52.
    Piergiovanni, A., Angelova, A., Ryoo, M.S.: Evolving losses for unsupervised video representation learning. In: Proceedings of the CVPR (2020)Google Scholar
  53. 53.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NeurIPS (2014)Google Scholar
  54. 54.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  55. 55.
    Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks. In: NeurIPS (2015)Google Scholar
  56. 56.
    Sun, C., Baradel, F., Murphy, K., Schmid, C.: Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743 (2019)
  57. 57.
    Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019)
  58. 58.
    Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)Google Scholar
  59. 59.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabelled video. In: CVPR (2016)Google Scholar
  60. 60.
    Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 402–419. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01261-8_24CrossRefGoogle Scholar
  61. 61.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the CVPR (2018)Google Scholar
  62. 62.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Proceedings of the ICCV (2015)Google Scholar
  63. 63.
    Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: CVPR (2019)Google Scholar
  64. 64.
    Wiles, O., Koepke, A.S., Zisserman, A.: Self-supervised learning of a facial attribute embedding from video. In: Proceedings of the BMVC (2018)Google Scholar
  65. 65.
    Wu, Z., Xiong, Y., Yu, S., Lin, D.: Unsupervised feature learning via non-parametric instance-level discrimination. In: Proceedings of the CVPR, vol. abs/1805.01978 (2018)Google Scholar
  66. 66.
    Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: CVPR (2019)Google Scholar
  67. 67.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the ICML (2015)Google Scholar
  68. 68.
    Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L\(^{1}\) optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-74936-3_22CrossRefGoogle Scholar
  69. 69.
    Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_40CrossRefGoogle Scholar
  70. 70.
    Zhuang, C., Zhai, A.L., Yamins, D.: Local aggregation for unsupervised learning of visual embeddings. In: ICCV (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Visual Geometry Group, Department of Engineering ScienceUniversity of OxfordOxfordUK

Personalised recommendations