Expanding Language-Image Pretrained Models for General Video Recognition

Ni, Bolin; Peng, Houwen; Chen, Minghao; Zhang, Songyang; Meng, Gaofeng; Fu, Jianlong; Xiang, Shiming; Ling, Haibin

doi:10.1007/978-3-031-19772-7_1

Bolin Ni^12,13,
Houwen Peng¹⁵,
Minghao Chen¹⁶,
Songyang Zhang¹⁷,
Gaofeng Meng^12,13,14,
Jianlong Fu¹⁵,
Shiming Xiang^12,13 &
…
Haibin Ling¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13664))

Included in the following conference series:

European Conference on Computer Vision

3329 Accesses
40 Citations

Abstract

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12\(\times \) fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at here.

B. Ni and M. Chen—Work done during internship at Microsoft Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Prompting Visual-Language Models for Efficient Video Understanding

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval

Frozen CLIP Models are Efficient Video Learners

References

Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for image classification. IEEE T-PAMI 38, 1425–1438 (2015)
Article Google Scholar
Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: CVPR, pp. 2927–2936 (2015)
Google Scholar
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: ICCV, pp. 6836–6846 (2021)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, pp. 813–824 (2021)
Google Scholar
Brattoli, B., Tighe, J., Zhdanov, F., Perona, P., Chalupka, K.: Rethinking zero-shot video classification: end-to-end training for realistic applications. In: CVPR, pp. 4613–4623 (2020)
Google Scholar
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. arXiv preprint arXiv:1808.01340 (2018)
Chen, S., Huang, D.: Elaborative rehearsal for zero-shot action recognition. In: ICCV, pp. 13638–13647 (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: ICCV, pp. 6824–6835 (2021)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)
Google Scholar
Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)
Google Scholar
Gao, J., Zhang, T., Xu, C.: I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In: AAAI, vol. 33, pp. 8303–8311 (2019)
Google Scholar
Gao, P., et al.: Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021)
Ghosh, P., Saini, N., Davis, L.S., Shrivastava, A.: All about knowledge graphs for actions. arXiv preprint arXiv:2008.12432 (2020)
Girdhar, R., Grauman, K.: Anticipative video transformer. In: ICCV (2021)
Google Scholar
Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. Image Vis. Comput. 60, 4–21 (2017)
Article Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, pp. 4904–4916 (2021)
Google Scholar
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: CVPR (2022)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV, pp. 2556–2563 (2011)
Google Scholar
Li, K., et al.: Uniformer: unifying convolution and self-attention for visual recognition. In: ICLR (2022)
Google Scholar
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: temporal excitation and aggregation for action recognition. In: CVPR, pp. 909–918 (2020)
Google Scholar
Li, Y., et al.: Improved multiscale vision transformers for classification and detection. In: CVPR (2022)
Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV, pp. 7083–7093 (2019)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Liu, Z., et al.: Video swin transformer. In: CVPR (2022)
Google Scholar
Liu, Z., Wang, L., Wu, W., Qian, C., Lu, T.: TAM: temporal adaptive module for video recognition. In: ICCV, pp. 13708–13718 (2021)
Google Scholar
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: ICCV, pp. 2630–2640 (2019)
Google Scholar
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. arXiv preprint arXiv:2102.00719 (2021)
Patrick, M., et al.: Keeping your eye on the ball: trajectory attention in video transformers. In: NIPS (2021)
Google Scholar
Qin, J., et al.: Zero-shot action recognition with error-correcting output codes. In: CVPR, pp. 2833–2842 (2017)
Google Scholar
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: ICCV, pp. 5533–5541 (2017)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Rao, Y., et al.: DenseCLIP: language-guided dense prediction with context-aware prompting. In: CVPR (2022)
Google Scholar
Romera-Paredes, B., Torr, P.: An embarrassingly simple approach to zero-shot learning. In: ICML, pp. 2152–2161 (2015)
Google Scholar
Ryoo, M., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: Tokenlearner: adaptive space-time tokenization for videos. In: NIPS, vol. 34 (2021)
Google Scholar
Selva, J., Johansen, A.S., Escalera, S., Nasrollahi, K., Moeslund, T.B., Clapés, A.: Video transformers: a survey. arXiv preprint arXiv:2201.05991 (2022)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Sun, C., Baradel, F., Murphy, K., Schmid, C.: Learning video representations using contrastive bidirectional transformer. In: ECCV (2020)
Google Scholar
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: Videobert: a joint model for video and language representation learning. In: ICCV, pp. 7464–7473 (2019)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wang, M., Xing, J., Liu, Y.: ActionCLIP: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)
Wang, Q., Chen, K.: Alternative semantic representations for zero-shot human action recognition. In: ECML PKDD, pp. 87–102 (2017)
Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: ECCV, pp. 305–321 (2018)
Google Scholar
Xu, H., et al.: Videoclip: contrastive pre-training for zero-shot video-text understanding. In: EMNLP (2021)
Google Scholar
Xu, X., Hospedales, T.M., Gong, S.: Multi-task zero-shot action recognition with prioritised data augmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 343–359. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_22
Chapter Google Scholar
Yan, S., et al.: Multiview transformers for video recognition. In: CVPR (2022)
Google Scholar
Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)
Zhang, B., et al.: Co-training transformer with videos and images improves action recognition. arXiv preprint arXiv:2112.07175 (2021)
Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: CVPR, pp. 2021–2030 (2017)
Google Scholar
Zhang, R., et al.: Tip-adapter: training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930 (2021)
Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: CVPR (2021)
Google Scholar
Zhou, C., Loy, C.C., Dai, B.: Denseclip: extract free dense labels from clip. arXiv preprint arXiv:2112.01071 (2021)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134 (2021)
Zhu, L., Yang, Y.: Actbert: learning global-local video-text representations. In: CVPR, pp. 8746–8755 (2020)
Google Scholar
Zhu, Y., et al.: A comprehensive study of deep video action recognition. arXiv preprint arXiv:2012.06567 (2020)
Zhu, Y., Long, Y., Guan, Y., Newsam, S., Shao, L.: Towards universal representation for unseen action recognition. In: CVPR, pp. 9436–9445 (2018)
Google Scholar

Download references

Acknowledgements

This research was supported in part by the National Key Research and Development Program of China under Grant No. 2018AAA0100400, and the National Natural Science Foundation of China under Grants 61976208, 62071466 and 62076242, and the InnoHK project. HL was not supported by any fund for this research.

Author information

Authors and Affiliations

NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Bolin Ni, Gaofeng Meng & Shiming Xiang
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Bolin Ni, Gaofeng Meng & Shiming Xiang
CAIR, HK Institute of Science and Innovation, Chinese Academy of Sciences, Hong Kong, China
Gaofeng Meng
Microsoft Research Asia, Beijing, China
Houwen Peng & Jianlong Fu
Stony Brook University, Stony Brook, NY, USA
Minghao Chen & Haibin Ling
University of Rochester, Rochester, USA
Songyang Zhang

Authors

Bolin Ni
View author publications
You can also search for this author in PubMed Google Scholar
Houwen Peng
View author publications
You can also search for this author in PubMed Google Scholar
Minghao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Songyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Gaofeng Meng
View author publications
You can also search for this author in PubMed Google Scholar
Jianlong Fu
View author publications
You can also search for this author in PubMed Google Scholar
Shiming Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Haibin Ling
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Houwen Peng or Gaofeng Meng .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 215 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ni, B. et al. (2022). Expanding Language-Image Pretrained Models for General Video Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13664. Springer, Cham. https://doi.org/10.1007/978-3-031-19772-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-19772-7_1
Published: 28 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19771-0
Online ISBN: 978-3-031-19772-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Expanding Language-Image Pretrained Models for General Video Recognition

Abstract

Access this chapter

Similar content being viewed by others

Prompting Visual-Language Models for Efficient Video Understanding

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval

Frozen CLIP Models are Efficient Video Learners

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 215 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Expanding Language-Image Pretrained Models for General Video Recognition

Abstract

Access this chapter

Similar content being viewed by others

Prompting Visual-Language Models for Efficient Video Understanding

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval

Frozen CLIP Models are Efficient Video Learners

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 215 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation