Few-Shot Action Recognition with Permutation-Invariant Attention

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12350)


Many few-shot learning models focus on recognising images. In contrast, we tackle a challenging task of few-shot action recognition from videos. We build on a C3D encoder for spatio-temporal video blocks to capture short-range action patterns. Such encoded blocks are aggregated by permutation-invariant pooling to make our approach robust to varying action lengths and long-range temporal dependencies whose patterns are unlikely to repeat even in clips of the same class. Subsequently, the pooled representations are combined into simple relation descriptors which encode so-called query and support clips. Finally, relation descriptors are fed to the comparator with the goal of similarity learning between query and support clips. Importantly, to re-weight block contributions during pooling, we exploit spatial and temporal attention modules and self-supervision. In naturalistic clips (of the same class) there exists a temporal distribution shift–the locations of discriminative temporal action hotspots vary. Thus, we permute blocks of a clip and align the resulting attention regions with similarly permuted attention regions of non-permuted clip to train the attention mechanism invariant to block (and thus long-term hotspot) permutations. Our method outperforms the state of the art on the HMDB51, UCF101, miniMIT datasets.



This research is supported in part by the Australian Research Council through Australian Centre for Robotic Vision (CE140100016), Australian Research Council grants (DE140100180), the China Scholarship Council (CSC Student ID 201603170283). Hongdong Li is funded in part by ARC-DP (190102261) and ARC-LE (190100080). We thank CSIRO Scientific Computing, NVIDIA (GPU grant) and the National University of Defense Technology.

Supplementary material

504441_1_En_31_MOESM1_ESM.pdf (135 kb)
Supplementary material 1 (pdf 135 KB)


  1. 1.
    Antoniou, A., Edwards, H., Storkey, A.: How to train your MAML. arXiv preprint (2018)Google Scholar
  2. 2.
    Bart, E., Ullman, S.: Cross-generalization: Learning novel classes from a single example by feature replacement. In: CVPR (2005)Google Scholar
  3. 3.
    Carreira, J., Zisserman, A.: Quo Vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2018)Google Scholar
  4. 4.
    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: CVPR (2015)Google Scholar
  5. 5.
    Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. In: NeurIPS (2014)Google Scholar
  6. 6.
    Dwivedi, S.K., Gupta, V., Mitra, R., Ahmed, S., Jain, A.: Protogan: towards few shot learning for action recognition. arXiv preprint (2019)Google Scholar
  7. 7.
    Engin, M., Wang, L., Zhou, L., Liu, X.: DeepKSPD: learning kernel-matrix-based SPD representation for fine-grained image recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 629–645. Springer, Cham (2018). Scholar
  8. 8.
    Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. In: TPAMI (2006)Google Scholar
  9. 9.
    Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: CVPR (2017)Google Scholar
  10. 10.
    Fink, M.: Object classification from a single example utilizing class relevance metrics. In: NeurIPS (2005)Google Scholar
  11. 11.
    Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)Google Scholar
  12. 12.
    Gan, C., Gong, B., Liu, K., Su, H., Guibas, L.J.: Geometry guided convolutional neural networks for self-supervised video representation learning. In: CVPR (2018)Google Scholar
  13. 13.
    Garcia, V., Bruna, J.: Few-shot learning with graph neural networks. In: ICLR (2018)Google Scholar
  14. 14.
    Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M.: Boosting few-shot visual learning with self-supervision. In: ICCV (2019)Google Scholar
  15. 15.
    Gidaris, S., Komodakis, N.: Generating classification weights with GNN denoising autoencoders for few-shot learning. In: CVPR (2019)Google Scholar
  16. 16.
    Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint (2018)Google Scholar
  17. 17.
    Guo, M., Chou, E., Huang, D.-A., Song, S., Yeung, S., Fei-Fei, L.: Neural graph matching networks for fewshot 3D action recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 673–689. Springer, Cham (2018). Scholar
  18. 18.
    Hariharan, B., Girshick, R.: Low-shot visual recognition by shrinking and hallucinating features. In: ICCV (2017)Google Scholar
  19. 19.
    Jian, S., Hu, L., Cao, L., Lu, K.: Representation learning with multiple Lipschitz-constrained alignments on partially-labeled cross-domain data. In: AAAI, pp. 4320–4327 (2020)Google Scholar
  20. 20.
    Jian, S., Hu, L., Cao, L., Lu, K., Gao, H.: Evolutionarily learning multi-aspect interactions and influences from network structure and node content. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 598–605 (2019)Google Scholar
  21. 21.
    Kim, J., Kim, T., Kim, S., Yoo, C.D.: Edge-labeling graph neural network for few-shot learning. In: CVPR (2019)Google Scholar
  22. 22.
    Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop (2015)Google Scholar
  23. 23.
    Koniusz, P., Cherian, A., Porikli, F.: Tensor representations via kernel linearization for action recognition from 3D skeletons. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 37–53. Springer, Cham (2016). Scholar
  24. 24.
    Koniusz, P., Tas, Y., Porikli, F.: Domain adaptation by mixture of alignments of second-or higher-order scatter tensors. In: CVPR (2017)Google Scholar
  25. 25.
    Koniusz, P., Wang, L., Cherian, A.: Tensor representations for action recognition. TPAMI (2020)Google Scholar
  26. 26.
    Koniusz, P., Yan, F., Gosselin, P.H., Mikolajczyk, K.: Higher-order occurrence pooling for bags-of-words: Visual concept detection. TPAMI (2017)Google Scholar
  27. 27.
    Koniusz, P., Zhang, H.: Power normalizations in fine-grained image, few-shot image and graph classification. TPAMI (2020)Google Scholar
  28. 28.
    Koniusz, P., Zhang, H., Porikli, F.: A deeper look at power normalizations. In: CVPR (2018)Google Scholar
  29. 29.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)Google Scholar
  30. 30.
    Lai, Z., Lu, E., Xie, W.: Mast: a memory-augmented self-supervised tracker. In: CVPR (2020)Google Scholar
  31. 31.
    Lake, B.M., Salakhutdinov, R., Gross, J., Tenenbaum, J.B.: One shot learning of simple visual concepts. CogSci (2011)Google Scholar
  32. 32.
    Lee, K., Maji, S., Ravichandran, A., Soatto, S.: Meta-learning with differentiable convex optimization. In: CVPR (2019)Google Scholar
  33. 33.
    Li, F.F., VanRullen, R., Koch, C., Perona, P.: Rapid natural scene categorization in the near absence of attention. Proc. Natl. Acad. Sci. 99, 9596–9601 (2002)CrossRefGoogle Scholar
  34. 34.
    Miller, E.G., Matsakis, N.E., Viola, P.A.: Learning from one example through shared densities on transforms. In: CVPR (2000)Google Scholar
  35. 35.
    Mishra, A., Verma, V.K., Reddy, M.S.K., Arulkumar, S., Rai, P., Mittal, A.: A generative approach to zero-shot and few-shot action recognition. In: WACV (2018)Google Scholar
  36. 36.
    Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. TPAMI (2019)Google Scholar
  37. 37.
    Romero, A., Terán, M.Y., Gouiffès, M., Lacassagne, L.: Enhanced local binary covariance matrices (ELBCM) for texture analysis and object tracking. MIRAGE (2013)Google Scholar
  38. 38.
    Russakovsky, O.: ImageNet largescale visual recognition challenge. IJCV 115, 211–252 (2015). Scholar
  39. 39.
    Rusu, A.A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., Hadsell, R.: Meta-learning with latent embedding optimization. In: ICLR (2019)Google Scholar
  40. 40.
    Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine, S.: Time-contrastive networks: self-supervised learning from pixels. In: ICRA (2017)Google Scholar
  41. 41.
    Simon, C., Koniusz, P., Nock, R., Harandi, M.: Deep subspace networks for few-shot learning. In: NeurIPS Workshops (2019)Google Scholar
  42. 42.
    Simon, C., Koniusz, P., Nock, R., Harandi, M.: Adaptive subspaces for few-shot learning. In: CVPR (2020)Google Scholar
  43. 43.
    Simon, C., Koniusz, P., Nock, R., Harandi, M.: On modulating the gradient formeta-learning. In: ECCV (2020)Google Scholar
  44. 44.
    Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)Google Scholar
  45. 45.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint (2012)Google Scholar
  46. 46.
    Su, J.C., Maji, S., Hariharan, B.: Boosting supervision with self-supervision for few-shot learning. arXiv preprint (2019)Google Scholar
  47. 47.
    Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: CVPR (2018)Google Scholar
  48. 48.
    Tuzel, O., Porikli, F., Meer, P.: Region covariance: a fast descriptor for detection and classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 589–600. Springer, Heidelberg (2006). Scholar
  49. 49.
    Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: NeurIPS (2016)Google Scholar
  50. 50.
    Wang, L., Zhang, J., Zhou, L., Tang, C., Li, W.: Beyond covariance: Feature representation with nonlinear kernel matrices. In: ICCV, pp. 4570–4578 (2015).
  51. 51.
    Wertheimer, D., Hariharan, B.: Few-shot learning with localization in realistic settings. In: CVPR (2019)Google Scholar
  52. 52.
    Xu, B., Ye, H., Zheng, Y., Wang, H., Luwang, T., Jiang, Y.G.: Dense dilated network for few shot action recognition. In: ICMR (2018)Google Scholar
  53. 53.
    Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: CVPR (2019)Google Scholar
  54. 54.
    Zhang, H., Koniusz, P.: Power normalizing second-order similarity network for few-shot learning. In: WACV (2019)Google Scholar
  55. 55.
    Zhang, H., Zhang, J., Koniusz, P.: Few-shot learning via saliency-guided hallucination of samples. In: CVPR (2019)Google Scholar
  56. 56.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: NeurIPS (2014)Google Scholar
  57. 57.
    Zhu, F., Zhang, L., Fu, Y., Guo, G., Xie, W.: Self-supervised video object segmentation. arXiv preprint (2020)Google Scholar
  58. 58.
    Zhu, L., Yang, Y.: Compound memory networks for few-shot video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 782–797. Springer, Cham (2018). Scholar
  59. 59.
    Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K., Whiteson, S.: Fast context adaptation via meta-learning. In: ICML (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Australian National UniversityCanberraAustralia
  2. 2.University of OxfordOxfordUK
  3. 3.Data61/CSIROCanberraAustralia
  4. 4.The University of Hong KongPokfulamHong Kong, China
  5. 5.Australian Centre for Robotic VisionBrisbane CityAustralia

Personalised recommendations