K-centered Patch Sampling for Efficient Video Recognition

Park, Seong Hyeon; Tack, Jihoon; Heo, Byeongho; Ha, Jung-Woo; Shin, Jinwoo

doi:10.1007/978-3-031-19833-5_10

Seong Hyeon Park¹²,
Jihoon Tack¹²,
Byeongho Heo¹³,
Jung-Woo Ha¹³ &
…
Jinwoo Shin¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13695))

Included in the following conference series:

European Conference on Computer Vision

2047 Accesses
5 Citations

Abstract

For decades, it has been a common practice to choose a subset of video frames for reducing the computational burden of a video understanding model. In this paper, we argue that this popular heuristic might be sub-optimal under recent transformer-based models. Specifically, inspired by that transformers are built upon patches of video frames, we propose to sample patches rather than frames using the greedy K-center search, i.e., the farthest patch to what has been chosen so far is sampled iteratively. We then show that a transformer trained with the selected video patches can outperform its baseline trained with the video frames sampled in the traditional way. Furthermore, by adding a certain spatiotemporal structuredness condition, the proposed K-centered patch sampling can be even applied to the recent sophisticated video transformers, boosting their performance further. We demonstrate the superiority of our method on Something–Something and Kinetics datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Denotes a vector distance between patches; not a distance between patch’s coordinates in the video.
2.
We omit the last dimension D for simplicity.
3.
We utilize the original ViT-B weights to ensure the fair comparison, unlike Table 1 that utilizes DeiT-base.
4.
Grayscale is only used for sampling, i.e., we use RGB patches for network input.

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: VIVIT: a video vision transformer. In: IEEE International Conference on Computer Vision (2021)
Google Scholar
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: International Conference on Machine Learning (2021)
Google Scholar
Bulat, A., Perez-Rua, J.M., Sudhakaran, S., Martinez, B., Tzimiropoulos, G.: Space-time mixing attention for video transformer. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Chen, B., et al.: PSViT: better vision transformer via token pooling and attention sharing. arXiv preprint arXiv:2108.03428 (2021)
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: RandAugment: practical automated data augmentation with a reduced search space. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Annual Conference of the North American Chapter of the Association for Computational Linguistics (2019)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: IEEE International Conference on Computer Vision (2021)
Google Scholar
Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: IEEE International Conference on Computer Vision (2019)
Google Scholar
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. In: Advances in Neural Information Processing Systems (2016)
Google Scholar
Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theoret. Comput. Sci. 38, 293–306 (1985)
Article MathSciNet MATH Google Scholar
Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition. In: AAAI Conference on Artificial Intelligence (2021)
Google Scholar
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: IEEE International Conference on Computer Vision (2017)
Google Scholar
Har-Peled, S.: Geometric approximation algorithms. No. 173, American Mathematical Soc. (2011)
Google Scholar
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: IEEE International Conference on Computer Vision (2021)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset (2017)
Google Scholar
Li, K., et al.: Uniformer: unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676 (2022)
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (2022)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE International Conference on Computer Vision (2021)
Google Scholar
Liu, Z., et al.: Video swin transformer. arXiv preprint arXiv:2106.13230 (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)
Google Scholar
Marin, D., Chang, J.H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers. arXiv preprint arXiv:2110.03860 (2021)
Meng, L., et al.: AdaViT: adaptive vision transformers for efficient image recognition. arXiv preprint arXiv:2111.15668 (2021)
Meng, Y., et al.: AR-Net: adaptive frame resolution for efficient action recognition. In: European Conference on Computer Vision (2020)
Google Scholar
Meng, Y., et al.: AdaFuse: adaptive temporal fusion network for efficient action recognition. In: International Conference on Learning Representations (2021)
Google Scholar
Micikevicius, P., et al.: Mixed precision training. In: International Conference on Learning Representations (2018)
Google Scholar
Naseer, M., Ranasinghe, K., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Intriguing properties of vision transformers. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. arXiv preprint arXiv:2102.00719 (2021)
Patrick, M., et al.: Keeping your eye on the ball: trajectory attention in video transformers. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: DynamicViT: efficient vision transformers with dynamic token sparsification. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Schwartz, R., Dodge, J., Smith, N.A., Etzioni, O.: Green AI. arXiv preprint arXiv:1907.10597 (2019)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (2014)
Google Scholar
Sun, X., Panda, R., Chen, C.F., Oliva, A., Feris, R., Saenko, K.: Dynamic network quantization for efficient video inference. In: IEEE International Conference on Computer Vision (2021)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (2021)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Wang, J., et al.: Removing the background by adding the background: Towards background robust self-supervised video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11804–11813 (2021)
Google Scholar
Wang, J., Yang, X., Li, H., Wu, Z., Jiang, Y.G.: Efficient video transformers with spatial-temporal token selection. arXiv preprint arXiv:2111.11591 (2021)
Wu, Z., Xiong, C., Ma, C.Y., Socher, R., Davis, L.S.: AdaFrame: adaptive frame selection for fast video recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 305–321 (2018)
Google Scholar
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157 (2021)
Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Zhi, Y., Tong, Z., Wang, L., Wu, G.: MGSampler: an explainable sampling strategy for video action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1513–1522 (2021)
Google Scholar

Download references

Acknowledgement

This work was mainly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub; No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)). This work was partly supported by KAIST-NAVER Hypercreative AI Center.

Author information

Authors and Affiliations

Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
Seong Hyeon Park, Jihoon Tack & Jinwoo Shin
NAVER AI LAB, Paris, France
Byeongho Heo & Jung-Woo Ha

Authors

Seong Hyeon Park
View author publications
You can also search for this author in PubMed Google Scholar
Jihoon Tack
View author publications
You can also search for this author in PubMed Google Scholar
Byeongho Heo
View author publications
You can also search for this author in PubMed Google Scholar
Jung-Woo Ha
View author publications
You can also search for this author in PubMed Google Scholar
Jinwoo Shin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seong Hyeon Park .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4661 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Park, S.H., Tack, J., Heo, B., Ha, JW., Shin, J. (2022). K-centered Patch Sampling for Efficient Video Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-19833-5_10
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19832-8
Online ISBN: 978-3-031-19833-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

K-centered Patch Sampling for Efficient Video Recognition