Skip to main content

MaCLR: Motion-Aware Contrastive Learning of Representations for Videos

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

We present MaCLR, a novel method to explicitly perform cross-modal self-supervised video representations learning from visual and motion modalities. Compared to previous video representation learning methods that mostly focus on learning motion cues implicitly from RGB inputs, MaCLR enriches standard contrastive learning objectives for RGB video clips with a cross-modal learning objective between a Motion pathway and a Visual pathway. We show that the representation learned with our MaCLR method focuses more on foreground motion regions and thus generalizes better to downstream tasks. To demonstrate this, we evaluate MaCLR on five datasets for both action recognition and action detection, and demonstrate state-of-the-art self-supervised performance on all datasets. Furthermore, we show that MaCLR representation can be as effective as representations learned with full supervision on UCF101 and HMDB51 action recognition, and even outperform the supervised representation for action recognition on VidSitu and SSv2, and action detection on AVA.

F. Xiao—Work done while at Amazon, now at Meta AI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Sometimes also referred to as “RGB” in the literature.

References

  1. 20BN-Something-Something Dataset V2

    Google Scholar 

  2. Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. In: NeurIPS (2020)

    Google Scholar 

  3. Bao, L., Wu, B., Liu, W.: CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In: CVPR (2018)

    Google Scholar 

  4. Benaim, S., et al.: SpeedNet: learning the speediness in videos. In: CVPR (2020)

    Google Scholar 

  5. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.: Fully-convolutional siamese networks for object tracking. In: ECCV (2016)

    Google Scholar 

  6. Brattoli, B., Buchler, U., Wahl, A.S., Schwab, M.E., Ommer, B.: LSTM self-supervision for detailed behavior analysis. In: CVPR (2017)

    Google Scholar 

  7. Brown, T.B., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  8. Brox, T., Malik, J.: Large displacement optical flow: descriptor matching in variational motion estimation. T-PAMI (2011)

    Google Scholar 

  9. Buchler, U., Brattoli, B., Ommer, B.: Improving spatiotemporal self-supervision by deep reinforcement learning. In: ECCV (2018)

    Google Scholar 

  10. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: NeurIPS (2020)

    Google Scholar 

  11. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)

    Google Scholar 

  12. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)

    Google Scholar 

  13. Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)

  14. Chen, X., He, K.: Exploring simple siamese representation learning. In: CVPR (2021)

    Google Scholar 

  15. Cheng, J., Tsai, Y.H., Hung, W.C., Wang, S., Yang, M.H.: Fast and accurate online video object segmentation via tracking parts. In: CVPR (2018)

    Google Scholar 

  16. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: ECCV (2006)

    Google Scholar 

  17. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)

    Google Scholar 

  18. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)

    Google Scholar 

  19. Diba, A., Sharma, V., Gool, L.V., Stiefelhagen, R.: DynamoNet: dynamic action and motion network. In: ICCV (2019)

    Google Scholar 

  20. Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)

    Google Scholar 

  21. Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: ICCV (2015)

    Google Scholar 

  22. Fan, H., Li, Y., Xiong, B., Lo, W.Y., Feichtenhofer, C.: Pyslowfast. https://github.com/facebookresearch/slowfast (2020)

  23. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: ICCV (2019)

    Google Scholar 

  24. Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., He, K.: A large-scale study on unsupervised spatiotemporal representation learning. In: CVPR (2021)

    Google Scholar 

  25. Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: NeurIPS (2016)

    Google Scholar 

  26. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)

    Google Scholar 

  27. Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: ICCV (2017)

    Google Scholar 

  28. Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: CVPR (2017)

    Google Scholar 

  29. Fragkiadaki, K., Arbelaez, P., Felsen, P., Malik, J.: Learning to segment moving objects in videos. In: CVPR (2015)

    Google Scholar 

  30. Gavrilyuk, K., Jain, M., Karmanov, I., Snoek, C.G.: Motion-augmented self-training for video recognition at smaller scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  31. Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. In: NeurIPS (2020)

    Google Scholar 

  32. Gu, C., et al.: AVA: A video dataset of spatio-temporally localized atomic visual actions. In: CVPR (2018)

    Google Scholar 

  33. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)

    Google Scholar 

  34. Han, T., Xie, W., Zisserman, A.: Memory-augmented dense predictive coding for video representation learning. In: ECCV (2020)

    Google Scholar 

  35. Han, T., Xie, W., Zisserman, A.: Self-supervised co-training for video representation learning. In: NeurIPS (2020)

    Google Scholar 

  36. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)

    Google Scholar 

  37. Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. T-PAMI (2014)

    Google Scholar 

  38. Huang, D., et al.: ASCNet: self-supervised video representation learning with appearance-speed consistency. In: ICCV (2021)

    Google Scholar 

  39. Huang, L., Liu, Y., Wang, B., Pan, P., Xu, Y., Jin, R.: Self-supervised video representation learning by context and motion decoupling. In: CVPR (2021)

    Google Scholar 

  40. Kang, K., et al.: Object detection in videos with tubelet proposal networks. In: CVPR (2017)

    Google Scholar 

  41. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  42. Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: CVPR (2019)

    Google Scholar 

  43. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)

    Google Scholar 

  44. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)

    Google Scholar 

  45. Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Flow-grounded spatial-temporal video prediction from still images. In: ECCV (2018)

    Google Scholar 

  46. Mahendran, A., Thewlis, J., Vedaldi, A.: Cross pixel optical-flow similarity for self-supervised learning. In: ACCV (2018)

    Google Scholar 

  47. Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: CVPR (2020)

    Google Scholar 

  48. Misra, I., Maaten, L.V.D.: Self-supervised learning of pretext-invariant representations. In: CVPR (2020)

    Google Scholar 

  49. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: ECCV (2016)

    Google Scholar 

  50. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)

    Google Scholar 

  51. Noroozi, M., Pirsiavash, H., Favaro, P.: Representation learning by learning to count. In: ICCV (2017)

    Google Scholar 

  52. Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  53. Patrick, M., et al.: Multi-modal self-supervision from generalized data transformations. In: ICCV (2021)

    Google Scholar 

  54. Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR (2017)

    Google Scholar 

  55. Piergiovanni, A., Angelova, A., Ryoo, M.S.: Evolving losses for unsupervised video representation learning. In: CVPR (2020)

    Google Scholar 

  56. Qian, R., et al.: Enhancing self-supervised video representation learning via multi-level feature optimization. In: ICCV (2021)

    Google Scholar 

  57. Qian, R., et al.: Spatiotemporal contrastive video representation learning. In: CVPR (2021)

    Google Scholar 

  58. Recasens, A., et al.: Broaden your views for self-supervised video learning. In: ICCV (2021)

    Google Scholar 

  59. Sadhu, A., Gupta, T., Yatskar, M., Nevatia, R., Kembhavi, A.: Visual semantic role labeling for video understanding. In: CVPR (2021)

    Google Scholar 

  60. Sayed, N., Brattoli, B., Ommer, B.: Cross and learn: cross-modal self-supervision. In: German Conference on Pattern Recognition (2018)

    Google Scholar 

  61. Sedaghat, N., Zolfaghari, M., Brox, T.: Hybrid learning of optical flow and next frame prediction to boost optical flow in the wild. arXiv preprint arXiv:1612.03777 (2016)

  62. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)

    Google Scholar 

  63. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google Scholar 

  64. Sobel, I.: History and definition of the sobel operator (2014)

    Google Scholar 

  65. Soomro, K., Zamir, A.R., Shah, M.: A dataset of 101 human action classes from videos in the wild. In: ICCV Workshops (2013)

    Google Scholar 

  66. Sun, C., Baradel, F., Murphy, K., Schmid, C.: Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743 (2019)

  67. Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: ECCV (2020)

    Google Scholar 

  68. Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: CVPR (2016)

    Google Scholar 

  69. Ummenhofer, B., et al.: Demon: depth and motion network for learning monocular stereo. In: CVPR (2017)

    Google Scholar 

  70. Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: ECCV (2018)

    Google Scholar 

  71. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)

    Google Scholar 

  72. Wang, J., Jiao, J., Bao, L., He, S., Liu, Y., Liu, W.: Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In: CVPR (2019)

    Google Scholar 

  73. Wang, J., Bertasius, G., Tran, D., Torresani, L.: Long-short temporal contrastive learning of video transformers. arXiv preprint arXiv:2106.09212 (2021)

  74. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: ECCV (2016)

    Google Scholar 

  75. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)

    Google Scholar 

  76. Wang, X., Gupta, A.: Unsupervised Learning of Visual Representations using Videos. In: ICCV (2015)

    Google Scholar 

  77. Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: CVPR (2019)

    Google Scholar 

  78. Wei, D., Lim, J.J., Zisserman, A., Freeman, W.T.: Learning and using the arrow of time. In: CVPR (2018)

    Google Scholar 

  79. Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Deepflow: large displacement optical flow with deep matching. In: ICCV (2013)

    Google Scholar 

  80. Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR (2018)

    Google Scholar 

  81. Xiao, F., Lee, Y.J.: Video object detection with an aligned spatial-temporal memory. In: ECCV (2018)

    Google Scholar 

  82. Xiao, F., Lee, Y.J., Grauman, K., Malik, J., Feichtenhofer, C.: Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740 (2019)

  83. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: ECCV (2018)

    Google Scholar 

  84. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV (2016)

    Google Scholar 

  85. Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. ICCV (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Davide Modolo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xiao, F., Tighe, J., Modolo, D. (2022). MaCLR: Motion-Aware Contrastive Learning of Representations for Videos. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19833-5_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19832-8

  • Online ISBN: 978-3-031-19833-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics