Compound Prototype Matching for Few-Shot Action Recognition

Huang, Yifei; Yang, Lijin; Sato, Yoichi

doi:10.1007/978-3-031-19772-7_21

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13664))

Included in the following conference series:

European Conference on Computer Vision

2495 Accesses
14 Citations

Abstract

Few-shot action recognition aims to recognize novel action classes using only a small number of labeled training samples. In this work, we propose a novel approach that first summarizes each video into compound prototypes consisting of a group of global prototypes and a group of focused prototypes, and then compares video similarity based on the prototypes. Each global prototype is encouraged to summarize a specific aspect from the entire video, e.g., the start/evolution of the action. Since no clear annotation is provided for the global prototypes, we use a group of focused prototypes to focus on certain timestamps in the video. We compare video similarity by matching the compound prototypes between the support and query videos. The global prototypes are directly matched to compare videos from the same perspective, e.g., to compare whether two actions start similarly. For the focused prototypes, since actions have various temporal variations in the videos, we apply bipartite matching to allow the comparison of actions with different temporal positions and shifts. Experiments demonstrate that our proposed method achieves state-of-the-art results on multiple benchmarks.

Y. Huang and L. Yang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Matching Compound Prototypes for Few-Shot Action Recognition

Article Open access 29 April 2024

Rethinking Matching-Based Few-Shot Action Recognition

Attention-Based Video Disentangling and Matching Network for Zero-Shot Action Recognition

References

Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: NeurIPS (2016)
Google Scholar
Antoniou, A., Edwards, H., Storkey, A.: How to train your MAML. In: ICML (2019)
Google Scholar
Bishay, M., Zoumpourlis, G., Patras, I.: TARN: temporal attentive relation network for few-shot and zero-shot action recognition. In: BMVC (2019)
Google Scholar
Cao, C., Li, Y., Lv, Q., Wang, P., Zhang, Y.: Few-shot action recognition with implicit temporal alignment and pair similarity optimization. In: CVIU (2021)
Google Scholar
Cao, K., Ji, J., Cao, Z., Chang, C.Y., Niebles, J.C.: Few-shot video classification via temporal alignment. In: CVPR (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)
Google Scholar
Chang, C.Y., Huang, D.A., Sui, Y., Fei-Fei, L., Niebles, J.C.: D3TW: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: CVPR (2019)
Google Scholar
Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: CVPR (2018)
Google Scholar
Chowdhury, A., Jiang, M., Chaudhuri, S., Jermaine, C.: Few-shot image classification: just use a library of pre-trained feature extractors and a simple classifier. In: ICCV (2021)
Google Scholar
Cong, Y., Liao, W., Ackermann, H., Rosenhahn, B., Yang, M.Y.: Spatial-temporal transformer for dynamic scene graph generation. In: ICCV (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers. In: ICCV (2021)
Google Scholar
Dhillon, G.S., Chaudhari, P., Ravichandran, A., Soatto, S.: A baseline for few-shot image classification. In: ICLR (2019)
Google Scholar
Doersch, C., Gupta, A., Zisserman, A.: CrossTransformers: spatially-aware few-shot transfer. In: NeurIPS (2020)
Google Scholar
Fan, Q., Zhuo, W., Tang, C.K., Tai, Y.W.: Few-shot object detection with attention-RPN and multi-relation detector. In: CVPR (2020)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning. PMLR (2017)
Google Scholar
Fu, Y., Zhang, L., Wang, J., Fu, Y., Jiang, Y.G.: Depth guided adaptive meta-fusion network for few-shot video recognition. In: ACM MM (2020)
Google Scholar
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV (2017)
Google Scholar
Grauman, K., Westbury, A., Byrne, E., et al.: Ego4D: around the world in 3,000 hours of egocentric video. arXiv preprint arXiv:2110.07058 (2021)
Gui, L.-Y., Wang, Y.-X., Ramanan, D., Moura, J.M.F.: Few-shot human motion prediction via meta-learning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 441–459. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_27
Chapter Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hong, J., Fisher, M., Gharbi, M., Fatahalian, K.: Video pose distillation for few-shot, fine-grained sports action recognition. In: ICCV (2021)
Google Scholar
Huang, Y., Cai, M., Li, Z., Sato, Y.: Predicting gaze in egocentric video by learning task-dependent attention transition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 789–804. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_46
Chapter Google Scholar
Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detection via feature reweighting. In: ICCV (2019)
Google Scholar
Kang, D., Kwon, H., Min, J., Cho, M.: Relational embedding for few-shot classification. In: ICCV (2021)
Google Scholar
Kliper-Gross, O., Hassner, T., Wolf, L.: One shot similarity metric learning for action recognition. In: SIMBAD (2011)
Google Scholar
Koch, G., Zemel, R., Salakhutdinov, R., et al.: Siamese neural networks for one-shot image recognition. In: ICML (2015)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)
Google Scholar
Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2, 83–97 (1955)
Article MathSciNet MATH Google Scholar
Kumar Dwivedi, S., Gupta, V., Mitra, R., Ahmed, S., Jain, A.: ProtoGAN: towards few shot learning for action recognition. In: CVPRW (2019)
Google Scholar
Li, H., Eigen, D., Dodge, S., Zeiler, M., Wang, X.: Finding task-relevant features for few-shot learning by category traversal. In: CVPR (2019)
Google Scholar
Li, S., et al.: TA2N: two-stage action alignment network for few-shot action recognition. arXiv preprint arXiv:2107.04782 (2021)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, W., Zhang, C., Lin, G., Liu, F.: CRNet: cross-reference networks for few-shot segmentation. In: CVPR (2020)
Google Scholar
Liu, Y., Zhang, X., Zhang, S., He, X.: Part-aware prototype network for few-shot semantic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 142–158. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_9
Chapter Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Lu, Z., He, S., Zhu, X., Zhang, L., Song, Y.Z., Xiang, T.: Simpler is better: few-shot semantic segmentation with classifier weight transformer. In: ICCV (2021)
Google Scholar
Luo, Z., et al.: Weakly-supervised action localization with expectation-maximization multi-instance learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 729–745. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_43
Chapter Google Scholar
Ma, J., Gorti, S.K., Volkovs, M., Yu, G.: Weakly supervised action selection learning in video. In: CVPR (2021)
Google Scholar
Mishra, A., Verma, V.K., Reddy, M.S.K., Arulkumar, S., Rai, P., Mittal, A.: A generative approach to zero-shot and few-shot action recognition. In: WACV (2018)
Google Scholar
Patravali, J., Mittal, G., Yu, Y., Li, F., Chen, M.: Unsupervised few-shot action recognition via action-appearance aligned meta-adaptation. In: ICCV (2021)
Google Scholar
Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., Damen, D.: Temporal-relational crosstransformers for few-shot action recognition. In: CVPR (2021)
Google Scholar
Qiao, S., Liu, C., Shen, W., Yuille, A.L.: Few-shot image recognition by predicting parameters from activations. In: CVPR (2018)
Google Scholar
Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: ICLR (2017)
Google Scholar
Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. In: NeurIPS (2017)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Sun, R., Li, Y., Zhang, T., Mao, Z., Wu, F., Zhang, Y.: Lesion-aware transformers for diabetic retinopathy grading. In: CVPR (2021)
Google Scholar
Thatipelli, A., Narayan, S., Khan, S., Anwer, R.M., Khan, F.S., Ghanem, B.: Spatio-temporal relation modeling for few-shot action recognition. arXiv preprint arXiv:2112.05132 (2021)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: NeurIPS (2016)
Google Scholar
Wang, H., Zhang, X., Hu, Y., Yang, Y., Cao, X., Zhen, X.: Few-shot semantic segmentation with democratic attention networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 730–746. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_43
Chapter Google Scholar
Wang, K., Liew, J.H., Zou, Y., Zhou, D., Feng, J.: PANet: few-shot image semantic segmentation with prototype alignment. In: ICCV (2019)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wang, X., et al.: Semantic-guided relation propagation network for few-shot action recognition. In: ACM MM (2021)
Google Scholar
Wang, X., Huang, T.E., Darrell, T., Gonzalez, J.E., Yu, F.: Frustratingly simple few-shot object detection. In: ICML (2020)
Google Scholar
Wei, X.S., Wang, P., Liu, L., Shen, C., Wu, J.: Piecewise classifier mappings: learning fine-grained learners for novel categories with few examples. TIP 28, 6116–6125 (2019)
MathSciNet MATH Google Scholar
Xian, Y., Korbar, B., Douze, M., Schiele, B., Akata, Z., Torresani, L.: Generalized many-way few-shot video classification. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12540, pp. 111–127. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-65414-6_10
Chapter Google Scholar
Xian, Y., Korbar, B., Douze, M., Torresani, L., Schiele, B., Akata, Z.: Generalized few-shot video classification with video retrieval and feature generation. In: TPAMI (2021)
Google Scholar
Xu, B., Ye, H., Zheng, Y., Wang, H., Luwang, T., Jiang, Y.G.: Dense dilated network for few shot action recognition. In: ICMR (2018)
Google Scholar
Xu, C., et al.: Learning dynamic alignment via meta-filter for few-shot learning. In: CVPR (2021)
Google Scholar
Xu, M., Zhao, C., Rojas, D.S., Thabet, A., Ghanem, B.: G-TAD: sub-graph localization for temporal action detection. In: CVPR (2020)
Google Scholar
Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. In: NeurIPS (2021)
Google Scholar
Ye, H.J., Hu, H., Zhan, D.C., Sha, F.: Few-shot learning via embedding adaptation with set-to-set functions. In: CVPR, pp. 8808–8817 (2020)
Google Scholar
Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., Gan, C.: Graph convolutional networks for temporal action localization. In: ICCV (2019)
Google Scholar
Zhang, C., Cai, Y., Lin, G., Shen, C.: DeepEMD: few-shot image classification with differentiable earth mover’s distance and structured classifiers. In: CVPR (2020)
Google Scholar
Zhang, C., Gupta, A., Zisserman, A.: Temporal query networks for fine-grained video understanding. In: CVPR (2021)
Google Scholar
Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P.H.S., Koniusz, P.: Few-shot action recognition with permutation-invariant attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 525–542. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_31
Chapter Google Scholar
Zhang, S., Zhou, J., He, X.: Learning implicit temporal alignment for few-shot video classification. IJCAI (2021)
Google Scholar
Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 831–846. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_49
Chapter Google Scholar
Zhu, L., Yang, Y.: Compound memory networks for few-shot video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 782–797. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_46
Chapter Google Scholar
Zhu, L., Yang, Y.: Label independent memory for semi-supervised few-shot video classification. TPAMI 44, 273–285 (2020)
Google Scholar
Zhu, X., Toisoul, A., Perez-Rua, J.M., Zhang, L., Martinez, B., Xiang, T.: Few-shot action recognition with prototype-centered attentive learning. BMVC (2021)
Google Scholar
Zhu, Z., Wang, L., Guo, S., Wu, G.: A closer look at few-shot video classification: a new baseline and benchmark. BMVC (2021)
Google Scholar

Download references

Acknowledgement

This work is supported by JSPS KAKENHI Grant Number JP22K17905, JP20H04205 and JST AIP Acceleration Research Grant Number JPMJCR20U1.

Author information

Authors and Affiliations

Institute of Industrial Science, The University of Tokyo, Tokyo, Japan
Yifei Huang, Lijin Yang & Yoichi Sato

Authors

Yifei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Lijin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yoichi Sato
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yifei Huang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1968 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, Y., Yang, L., Sato, Y. (2022). Compound Prototype Matching for Few-Shot Action Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13664. Springer, Cham. https://doi.org/10.1007/978-3-031-19772-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-19772-7_21
Published: 28 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19771-0
Online ISBN: 978-3-031-19772-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Compound Prototype Matching for Few-Shot Action Recognition

Abstract

Access this chapter

Similar content being viewed by others

Matching Compound Prototypes for Few-Shot Action Recognition

Rethinking Matching-Based Few-Shot Action Recognition

Attention-Based Video Disentangling and Matching Network for Zero-Shot Action Recognition

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1968 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Compound Prototype Matching for Few-Shot Action Recognition

Abstract

Access this chapter

Similar content being viewed by others

Matching Compound Prototypes for Few-Shot Action Recognition

Rethinking Matching-Based Few-Shot Action Recognition

Attention-Based Video Disentangling and Matching Network for Zero-Shot Action Recognition

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1968 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation