Skip to main content
Log in

Zero-shot action recognition by clustered representation with redundancy-free features

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Zero-shot action recognition (ZSAR) is a practical and challenging issue, which compensates for the shortcomings of existing action recognition by being able to recognize those action classes that don’t have visual representation during training. However, existing zero-shot action recognition doesn’t focus on the fact that the generated features have many outliers, which harms the recognition. A new method for zero-shot action recognition is proposed, which suppresses this defect by clustered representation with redundancy-free features. In addition, a generative adversarial network (GAN) with gradient penalty is trained to synthesize stable features, solving the problem of data imbalance and alleviating the bottleneck of unstable features generated in existing methods. To reduce the dimension and the subsequent computation, a redundancy-free feature is introduced into the ZSAR. Experiments performed on Olympic Sports, HMDB51, and UCF101 public datasets prove that our method outperforms the state-of-the-art approaches with absolute gains of 1.8%, 0.3%, and 1.7%, respectively, in zero-shot action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

The public Olympic Sports Dataset used in this study is available in the Stanford repository, http://vision.stanford.edu/Datasets/OlympicSports/. UCF101 Dataset is available in the University of Central Florida repository, https://www.crcv.ucf.edu/data/UCF101.php. HMDB51 Dataset is available in the SERRE LAB repository, https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/.

References

  1. Wang, P., Liu, L., Shen, C., Shen, H.T.: Order-aware convolutional pooling for video based action recognition. Pattern Recogn. 91, 357–365 (2019). https://doi.org/10.1016/j.patcog.2019.03.002

    Article  Google Scholar 

  2. Li, J., Liu, X., Zhang, M., Wang, D.: Spatio-temporal deformable 3d convnets with attention for action recognition. Pattern Recogn. 98, 107037 (2020). https://doi.org/10.1016/j.patcog.2019.107037

    Article  Google Scholar 

  3. Sun, B., Kong, D., Wang, S., Li, J., Yin, B., Luo, X.: Gan for vision, kg for relation: a two-stage network for zero-shot action recognition. Pattern Recogn. 126, 108563 (2022). https://doi.org/10.1016/j.patcog.2022.108563

    Article  Google Scholar 

  4. Xia, L., Ma, W.: Human action recognition using high-order feature of optical flows. J. Supercomput. 77(12), 14230–14251 (2021)

    Article  Google Scholar 

  5. Gowda, S.N., Sevilla-Lara, L., Keller, F., Rohrbach, M.: Claster: clustering with reinforcement learning for zero-shot action recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision ECCV 2022, pp. 187–203. Springer, Cham (2022)

    Chapter  Google Scholar 

  6. Wang, W., Zheng, V.W., Yu, H., Miao, C.: A survey of zero-shot learning: settings, methods, and applications. ACM Trans. Intell. Syst. Technol. (2019). https://doi.org/10.1145/3293318

    Article  Google Scholar 

  7. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958 (2009). https://doi.org/10.1109/CVPR.2009.5206594

  8. Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning - the good, the bad and the ugly. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3077–3086 (2017). https://doi.org/10.1109/CVPR.2017.328

  9. Zhang, H., Liu, L., Long, Y., Zhang, Z., Shao, L.: Deep transductive network for generalized zero shot learning. Pattern Recogn. 105, 107370 (2020). https://doi.org/10.1016/j.patcog.2020.107370

    Article  Google Scholar 

  10. Geng, C., Tao, L., Chen, S.: Guided CNN for generalized zero-shot and open-set recognition using visual and semantic prototypes. Pattern Recogn. 102, 107263 (2020). https://doi.org/10.1016/j.patcog.2020.107263

    Article  Google Scholar 

  11. Li, Z., Yao, L., Chang, X., Zhan, K., Sun, J., Zhang, H.: Zero-shot event detection via event-adaptive concept relevance mining. Pattern Recogn. 88, 595–603 (2019). https://doi.org/10.1016/j.patcog.2018.12.010

    Article  Google Scholar 

  12. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: A deep visual-semantic embedding model. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. NIPS’13, pp. 2121–2129. Curran Associates Inc., Red Hook, NY, USA (2013)

  13. Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 453–465 (2014). https://doi.org/10.1109/TPAMI.2013.140

    Article  Google Scholar 

  14. Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2927–2936 (2015)

  15. Xian, Y., Lorenz, T., Schiele, B., Akata, Z.: Feature generating networks for zero-shot learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5542–5551 (2018). https://doi.org/10.1109/CVPR.2018.00581

  16. Verma, V.K., Arora, G., Mishra, A., Rai, P.: Generalized zero-shot learning via synthesized examples. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4281–4289 (2018). https://doi.org/10.1109/CVPR.2018.00450

  17. Schönfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero- and few-shot learning via aligned variational autoencoders. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8239–8247 (2019). https://doi.org/10.1109/CVPR.2019.00844

  18. Han, Z., Fu, Z., Yang, J.: Learning the redundancy-free features for generalized zero-shot object recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12862–12871 (2020). https://doi.org/10.1109/CVPR42600.2020.01288

  19. Doshi, K., Yilmaz, Y.: Zero-shot action recognition with transformer-based video semantic embedding. arXiv preprint arXiv:2203.05156 (2022)

  20. Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for attribute-based classification. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 819–826 (2013). https://doi.org/10.1109/CVPR.2013.111

  21. Xu, X., Hospedales, T., Gong, S.: Semantic embedding space for zero-shot action recognition. In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 63–67 (2015). https://doi.org/10.1109/ICIP.2015.7350760

  22. Xu, X., Hospedales, T., Gong, S.: Transductive zero-shot action recognition by word-vector embedding. Int. J. Comput. Vision 123(3), 309–333 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  23. Long, Y., Liu, L., Shao, L., Shen, F., Ding, G., Han, J.: From zero-shot learning to conventional supervised classification: Unseen visual data synthesis. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6165–6174 (2017). https://doi.org/10.1109/CVPR.2017.653

  24. Long, Y., Liu, L., Shen, F., Shao, L., Li, X.: Zero-shot learning using synthesised unseen visual data with diffusion regularisation. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2498–2512 (2018). https://doi.org/10.1109/TPAMI.2017.2762295

    Article  Google Scholar 

  25. Jurie, F., Bucher, M., Herbin, S.: Generating visual representations for zero-shot classification. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 2666–2673 (2017). https://doi.org/10.1109/ICCVW.2017.308

  26. Felix, R., Reid, I., Carneiro, G., : Multi-modal cycle-consistent generalized zero-shot learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 21–37 (2018)

  27. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Adv. Neural Inform.Process. Syst. 27 (2014)

  28. Mandal, D., Narayan, S., Dwivedi, S.K., Gupta, V., Ahmed, S., Khan, F.S., Shao, L.: Out-of-distribution detection for generalized zero-shot action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9977–9985 (2019). https://doi.org/10.1109/CVPR.2019.01022

  29. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223 (2017). PMLR

  30. Mishra, A., Verma, V.K., Reddy, M.S.K., S., A., Rai, P., Mittal, A.: A generative approach to zero-shot and few-shot action recognition. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 372–380 (2018). https://doi.org/10.1109/WACV.2018.00047

  31. Huang, H., Wang, C., Yu, P.S., Wang, C.-D.: Generative dual adversarial network for generalized zero-shot learning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 801–810 (2019). https://doi.org/10.1109/CVPR.2019.00089

  32. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251 (2017). https://doi.org/10.1109/ICCV.2017.244

  33. Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. arXiv preprint arXiv:1612.00410 (2016)

  34. Shermin, T., Teng, S.W., Sohel, F., Murshed, M., Lu, G.: Integrated generalized zero-shot learning for fine-grained classification. Pattern Recogn. 122, 108246 (2022). https://doi.org/10.1016/j.patcog.2021.108246

    Article  Google Scholar 

  35. Likas, A.: A reinforcement learning approach to online clustering. Neural Comput. 11(8), 1915–1932 (1999). https://doi.org/10.1162/089976699300016025

    Article  Google Scholar 

  36. Liu, B., Yao, L., Ding, Z., Xu, J., Wu, J.: Combining ontology and reinforcement learning for zero-shot classification. Knowl.-Based Syst. 144, 42–50 (2018). https://doi.org/10.1016/j.knosys.2017.12.022

    Article  Google Scholar 

  37. Tutsoy, O., Brown, M.: Chaotic dynamics and convergence analysis of temporal difference algorithms with bang-bang control. Optim. Control Appl. Methods 37, 108–126 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  38. Tutsoy, O., Barkana, D.E., Balikci, K.: A novel exploration-exploitation-based adaptive law for intelligent model-free control approaches. IEEE Trans. Cybern. 53(1), 329–337 (2023). https://doi.org/10.1109/TCYB.2021.3091680

    Article  Google Scholar 

  39. Feng, J., Bai, G., Li, D., Zhang, X., Shang, R., Jiao, L.: Mr-selection: a meta-reinforcement learning approach for zero-shot hyperspectral band selection. IEEE Trans. Geosci. Remote Sens. 61, 1–20 (2023). https://doi.org/10.1109/TGRS.2022.3231870

    Article  Google Scholar 

  40. Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. Adv. Neural Inform. Process. Syst. 30 (2017)

  41. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  42. Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision - ECCV 2016, pp. 499–515. Springer, Cham (2016)

    Chapter  Google Scholar 

  43. Forgy, E.W.: Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21, 768–769 (1965)

    Google Scholar 

  44. Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: European Conference on Computer Vision, pp. 392–405 (2010). Springer

  45. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563 (2011). IEEE

  46. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  47. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

  48. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497 (2015). https://doi.org/10.1109/ICCV.2015.510

  49. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014). https://doi.org/10.1109/CVPR.2014.223

  50. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inform. Process. Syst. 26 (2013)

  51. Xia, L., Ma, W., Feng, L.: Semantic features and high-order physical features fusion for action recognition. Clust. Comput. 24(4), 3515–3529 (2021). https://doi.org/10.1007/s10586-021-03346-9

    Article  Google Scholar 

  52. Exarchakis, G., Oubari, O., Lenz, G.: A sampling-based approach for efficient clustering in large datasets. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12393–12402 (2022). https://doi.org/10.1109/CVPR52688.2022.01208

  53. Paoletti, G., Cavazza, J., Beyan, C., Del Bue, A.: Subspace clustering for action recognition with covariance representations and temporal pruning. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 6035–6042 (2021). https://doi.org/10.1109/ICPR48806.2021.9412060

  54. Maldonado, S., Saltos, R., Vairetti, C., Delpiano, J.: Mitigating the effect of dataset shift in clustering. Pattern Recogn. 134, 109058 (2023). https://doi.org/10.1016/j.patcog.2022.109058

    Article  Google Scholar 

  55. Gao, J., Zhang, T., Xu, C.: I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8303–8311 (2019)

  56. Zhang, C., Peng, Y.: Visual data synthesis via gan for zero-shot video classification. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. IJCAI’18, pp. 1128–1134. AAAI Press (2018)

  57. Qi, C., Feng, Z., Xing, M., Su, Y., Zheng, J., Zhang, Y.: Energy-based temporal summarized attentive network for zero-shot action recognition. IEEE Trans. Multimedia 25, 1940–1953 (2023). https://doi.org/10.1109/TMM.2023.3264847

  58. Huang, K., Miralles-Pechuán, L., McKeever, S.: Enhancing zero-shot action recognition in videos by combining GANs with text and images. SN Comput. Sci. 4(4), 375 (2023). https://doi.org/10.1007/s42979-023-01803-3

    Article  Google Scholar 

  59. Brattoli, B., Tighe, J., Zhdanov, F., Perona, P., Chalupka, K.: Rethinking zero-shot video classification: End-to-end training for realistic applications. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4613–4623 (2020)

  60. Qi, C., Feng, Z., Xing, M., Su, Y.: Dvamn: dual visual attention matching network for zero-shot action recognition. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds.) Artificial Neural Networks and Machine Learning - ICANN 2021, pp. 564–575. Springer, Cham (2021)

  61. Gao, J., Xu, C.: Ci-GNN: building a category-instance graph for zero-shot video classification. IEEE Trans. Multimedia 22(12), 3088–3100 (2020). https://doi.org/10.1109/TMM.2020.2969787

    Article  MathSciNet  Google Scholar 

Download references

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 51678075) the Science and Technology Project of Hunan (Grant No. 2017GK2271).

Author information

Authors and Affiliations

Authors

Contributions

LX proposed the research topic, guided the design of the research proposal and the conduct of experiments, completed part of the writing, and proofread the whole text. XW participated in the experimental design, implemented the research process, and wrote the part manuscript text. All authors reviewed the manuscript.

Corresponding author

Correspondence to Xin Wen.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethics approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, L., Wen, X. Zero-shot action recognition by clustered representation with redundancy-free features. Machine Vision and Applications 34, 116 (2023). https://doi.org/10.1007/s00138-023-01470-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-023-01470-7

Keywords

Navigation