Skip to main content

Static and Dynamic Concepts for Self-supervised Video Representation Learning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

In this paper, we propose a novel learning scheme for self-supervised video representation learning. Motivated by how humans understand videos, we propose to first learn general visual concepts then attend to discriminative local areas for video understanding. Specifically, we utilize static frame and frame difference to help decouple static and dynamic concepts, and respectively align the concept distributions in latent space. We add diversity and fidelity regularizations to guarantee that we learn a compact set of meaningful concepts. Then we employ a cross-attention mechanism to aggregate detailed local features of different concepts, and filter out redundant concepts with low activations to perform local concept contrast. Extensive experiments demonstrate that our method distills meaningful static and dynamic concepts to guide video understanding, and obtains state-of-the-art results on UCF-101, HMDB-51, and Diving-48.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For simplicity, we use the same symbol to denote the backbone for sdv.

References

  1. Alvarez Melis, D., Jaakkola, T.: Towards robust interpretability with self-explaining neural networks. In: Advances in Neural Information Processing Systems 31 (2018)

    Google Scholar 

  2. Asano, Y.M., Patrick, M., Rupprecht, C., Vedaldi, A.: Labelling unlabelled videos from scratch with multi-modal self-supervision. arXiv preprint arXiv:2006.13662 (2020)

  3. Behrmann, N., Fayyaz, M., Gall, J., Noroozi, M.: Long short view feature decomposition via contrastive video representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9244–9253 (2021)

    Google Scholar 

  4. Behrmann, N., Gall, J., Noroozi, M.: Unsupervised video representation learning by bidirectional feature prediction. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1670–1679 (2021)

    Google Scholar 

  5. Benaim, S., et al.: Learning the speediness in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9922–9931 (2020)

    Google Scholar 

  6. Bucher, M., Herbin, S., Jurie, F.: Semantic bottleneck for computer vision tasks. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11362, pp. 695–712. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20890-5_44

    Chapter  Google Scholar 

  7. Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_9

    Chapter  Google Scholar 

  8. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882 (2020)

  9. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  10. Chen, B., Selvaraju, R.R., Chang, S.F., Niebles, J.C., Naik, N.: Previts: contrastive pretraining with video tracking supervision. arXiv preprint arXiv:2112.00804 (2021)

  11. Chen, P., et al.: RSPNet: relative speed perception for unsupervised video representation learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 1 (2021)

    Google Scholar 

  12. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  13. Chen, Z., Bei, Y., Rudin, C.: Concept whitening for interpretable image recognition. Nat. Mach. Intell. 2(12), 772–782 (2020)

    Article  Google Scholar 

  14. Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: NIPS. vol. 2, p. 4 (2013)

    Google Scholar 

  15. Dave, I., Gupta, R., Rizve, M.N., Shah, M.: TCLR: temporal contrastive learning for video representation. arXiv preprint arXiv:2101.07974 (2021)

  16. De Fauw, J., et al.: Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24(9), 1342–1350 (2018)

    Article  Google Scholar 

  17. Ding, S., et al.: Motion-aware contrastive video representation learning via foreground-background merging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9716–9726 (2022)

    Google Scholar 

  18. Ding, S., Qian, R., Xiong, H.: Dual contrastive learning for spatio-temporal representation. arXiv preprint arXiv:2207.05340 (2022)

  19. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  20. Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. In: Advances in neural information processing systems 27 (2014)

    Google Scholar 

  21. Ermolov, A., Siarohin, A., Sangineto, E., Sebe, N.: Whitening for self-supervised representation learning. In: International Conference on Machine Learning, pp. 3015–3024. PMLR (2021)

    Google Scholar 

  22. Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., He, K.: A large-scale study on unsupervised spatiotemporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3299–3309 (2021)

    Google Scholar 

  23. Gao, P., Lu, J., Li, H., Mottaghi, R., Kembhavi, A.: Container: context aggregation network. arXiv preprint arXiv:2106.01401 (2021)

  24. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)

  25. Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. JMLR Workshop and Conference Proceedings (2010)

    Google Scholar 

  26. Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)

    Google Scholar 

  27. Han, T., Xie, W., Zisserman, A.: Memory-augmented dense predictive coding for video representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 312–329. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_19

    Chapter  Google Scholar 

  28. Han, T., Xie, W., Zisserman, A.: Self-supervised co-training for video representation learning. arXiv preprint arXiv:2010.09709 (2020)

  29. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

    Google Scholar 

  30. Hu, D., et al.: Discriminative sounding objects localization via self-supervised audiovisual matching. Adv. Neural. Inf. Process. Syst. 33, 10077–10087 (2020)

    Google Scholar 

  31. Huang, D., et al.: ASCNet: self-supervised video representation learning with appearance-speed consistency. arXiv preprint arXiv:2106.02342 (2021)

  32. Huang, L., Liu, Y., Wang, B., Pan, P., Xu, Y., Jin, R.: Self-supervised video representation learning by context and motion decoupling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13886–13895 (2021)

    Google Scholar 

  33. Jabri, A., Owens, A., Efros, A.A.: Space-time correspondence as a contrastive random walk. arXiv preprint arXiv:2006.14613 (2020)

  34. Jenni, S., Jin, H.: Time-equivariant contrastive video representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9970–9980 (2021)

    Google Scholar 

  35. Jenni, S., Meishvili, G., Favaro, P.: Video representation learning by recognizing temporal transformations. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 425–442. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_26

    Chapter  Google Scholar 

  36. Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8545–8552 (2019)

    Google Scholar 

  37. Kim, D., Cho, D., Yoo, D., Kweon, I.S.: Learning image representations by completing damaged jigsaw puzzles. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 793–802. IEEE (2018)

    Google Scholar 

  38. Koh, P.W., et al.: Concept bottleneck models. In: International Conference on Machine Learning, pp. 5338–5348. PMLR (2020)

    Google Scholar 

  39. Kuang, H., et al.: Video contrastive learning with global context. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3195–3204 (2021)

    Google Scholar 

  40. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)

    Google Scholar 

  41. Li, R., Zhang, Y., Qiu, Z., Yao, T., Liu, D., Mei, T.: Motion-focused contrastive learning of video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2105–2114 (2021)

    Google Scholar 

  42. Li, X., Liu, S., De Mello, S., Wang, X., Kautz, J., Yang, M.H.: Joint-task self-supervised learning for temporal correspondence. arXiv preprint arXiv:1909.11895 (2019)

  43. Li, Y., Li, Y., Vasconcelos, N.: RESOUND: towards action recognition without representation bias. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 520–535. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_32

    Chapter  Google Scholar 

  44. Liu, X., et al.: Visual sound localization in the wild by cross-modal interference erasing. arXiv preprint arXiv:2202.06406 (2022)

  45. Liu, X., et al.: Learning hierarchical cross-modal association for co-speech gesture generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10462–10472 (2022)

    Google Scholar 

  46. Liu, X., Xu, Y., Wu, Q., Zhou, H., Wu, W., Zhou, B.: Semantic-aware implicit neural audio-driven video portrait generation. arXiv preprint arXiv:2201.07786 (2022)

  47. Losch, M., Fritz, M., Schiele, B.: Interpretability beyond classification output: semantic bottleneck networks. arXiv preprint arXiv:1907.10882 (2019)

  48. Luo, D., et al.: Video cloze procedure for self-supervised spatio-temporal learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11701–11708 (2020)

    Google Scholar 

  49. Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. In: Proceedings of the IEEE international conference on computer vision, pp. 2203–2212 (2017)

    Google Scholar 

  50. Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889 (2020)

    Google Scholar 

  51. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32

    Chapter  Google Scholar 

  52. van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  53. Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: Videomoco: contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11205–11214 (2021)

    Google Scholar 

  54. Piergiovanni, A., Angelova, A., Ryoo, M.S.: Evolving losses for unsupervised video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 133–142 (2020)

    Google Scholar 

  55. Qian, R., Hu, D., Dinkel, H., Wu, M., Xu, N., Lin, W.: Multiple sound sources localization from coarse to fine. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 292–308. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_18

    Chapter  Google Scholar 

  56. Qian, R., et al.: Exploring temporal granularity in self-supervised video representation learning. arXiv preprint arXiv:2112.04480 (2021)

  57. Qian, R., et al.: Enhancing self-supervised video representation learning via multi-level feature optimization. arXiv preprint arXiv:2108.02183 (2021)

  58. Qian, R., et al.: Spatiotemporal contrastive video representation learning. arXiv preprint arXiv:2008.03800 (2020)

  59. Recasens, A., et al.: Broaden your views for self-supervised video learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1255–1265 (2021)

    Google Scholar 

  60. Regatti, J.R., Deshmukh, A.A., Manavoglu, E., Dogan, U.: Consensus clustering with unsupervised representation learning. arXiv preprint arXiv:2010.01245 (2020)

  61. Sawada, Y., Nakamura, K.: Concept bottleneck model with additional unsupervised concepts. arXiv preprint arXiv:2202.01459 (2022)

  62. Seel, N.M.: Encyclopedia of the sciences of learning, 1st edn. Springer (2011). https://doi.org/10.1007/978-1-4419-1428-6

  63. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  64. Sun, C., Baradel, F., Murphy, K., Schmid, C.: Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743 (2019)

  65. Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 776–794. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_45

    Chapter  Google Scholar 

  66. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

    Google Scholar 

  67. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)

    Google Scholar 

  68. Wang, J., Jiao, J., Bao, L., He, S., Liu, W., Liu, Y.: Self-supervised video representation learning by uncovering spatio-temporal statistics. arXiv preprint arXiv:2008.13426 (2020)

  69. Wang, J., Jiao, J., Bao, L., He, S., Liu, Y., Liu, W.: Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4006–4015 (2019)

    Google Scholar 

  70. Wang, J., Jiao, J., Liu, Y.-H.: Self-supervised video representation learning by pace prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 504–521. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_30

    Chapter  Google Scholar 

  71. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

    Google Scholar 

  72. Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2566–2576 (2019)

    Google Scholar 

  73. Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3024–3033 (2021)

    Google Scholar 

  74. Weinzaepfel, P., Lucas, T., Larlus, D., Kalantidis, Y.: Learning super-features for image retrieval. arXiv preprint arXiv:2201.13182 (2022)

  75. Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733–3742 (2018)

    Google Scholar 

  76. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19

    Chapter  Google Scholar 

  77. Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., Hu, H.: Propagate yourself: exploring pixel-level consistency for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16684–16693 (2021)

    Google Scholar 

  78. Xiong, S., Tan, Y., Wang, G.: Explore visual concept formation for image classification. In: International Conference on Machine Learning, pp. 11470–11479. PMLR (2021)

    Google Scholar 

  79. Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10334–10343 (2019)

    Google Scholar 

  80. Yang, C., Xu, Y., Dai, B., Zhou, B.: Video representation learning with visual tempo consistency. arXiv preprint arXiv:2006.15489 (2020)

  81. Yao, T., Zhang, Y., Qiu, Z., Pan, Y., Mei, T.: SeCo: exploring sequence supervision for unsupervised representation learning. arXiv preprint arXiv:2008.00975 (2020)

  82. Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6548–6557 (2020)

    Google Scholar 

  83. Yuan, L., et al.: Contextualized spatio-temporal contrastive learning with self-supervision. arXiv preprint arXiv:2112.05181 (2021)

Download references

Acknowledgements

This work is supported by GRF 14205719, TRS T41-603/20-R, Centre for Perceptual and Interactive Intelligence, and CUHK Interdisciplinary AI Research Institute.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dahua Lin .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 180 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qian, R., Ding, S., Liu, X., Lin, D. (2022). Static and Dynamic Concepts for Self-supervised Video Representation Learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13686. Springer, Cham. https://doi.org/10.1007/978-3-031-19809-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19809-0_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19808-3

  • Online ISBN: 978-3-031-19809-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics