Skip to main content
Log in

Improving zero-shot action recognition using human instruction with text description

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Zero-shot action recognition, which recognizes actions in videos without having received any training examples, is gaining wide attention considering it can save labor costs and training time. Nevertheless, the performance of zero-shot learning is still unsatisfactory, which limits its practical application. To solve this problem, this study proposes a framework to improve zero-shot action recognition using human instructions with text descriptions. The proposed framework manually describes video contents, which incurs some labor costs; in many situations, the labor costs are worth it. We manually annotate text features for each action, which can be a word, phrase, or sentence. Then by computing the matching degrees between the video and all text features, we can predict the class of the video. Furthermore, the proposed model can also be combined with other models to improve its accuracy. In addition, our model can be continuously optimized to improve the accuracy by repeating human instructions. The results with UCF101 and HMDB51 showed that our model achieved the best accuracy and improved the accuracies of other models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

The UCF101 and HMDB51 datasets used in this study are available in the Internet The data presented in this study are available on request from the corresponding author.

References

  1. Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951–958. IEEE

  2. Lampert CH, Nickisch H, Harmeling S (2013) Attribute-based classification for zero-shot visual object categorization. IEEE Trans Pattern Anal Mach Intell 36(3):453–465

    Article  Google Scholar 

  3. Qin J, Liu L, Shao L, Shen F, Ni B, Chen J, Wang Y (2017) Zero-shot action recognition with error-correcting output codes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2833–2842

  4. Zhu Y, Yang L, Yu G, Newsam S, Shao L (2018) Towards universal representation for unseen action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9436–9445

  5. Gao J, Zhang T, Changsheng X (2019) I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8303–8311

  6. Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

  7. Wang H, Schmid C (2013) Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, pp 3551–3558

  8. Wang H, Kläser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79

    Article  MathSciNet  Google Scholar 

  9. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576

  10. Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 591–600

  11. Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 909–918

  12. Majd M, Safabakhsh R (2019) A motion-aware convlstm network for action recognition. Appl Intell 49(7):2515–2521

    Article  Google Scholar 

  13. Elharrouss O, Almaadeed N, Al-Maadeed S, Bouridane A, Beghdadi A (2021) A combined multiple action recognition and summarization for surveillance video sequences. Appl Intell 51(2):690–712

    Article  Google Scholar 

  14. Gao R, Tae-Hyun O, Grauman K, Torresani L (2020) Listen to look: Action recognition by previewing audio. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10457–10467

  15. Shi L, Zhang Y, Cheng J, Lu H (2019) Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7912–7921

  16. Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H (2020) Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 183–192

  17. Li M, Chen S, Xu C, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3595–3603

  18. Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, pp 143–152

  19. Franco A, Magnani A, Maio D (2020) A multimodal approach for human activity recognition based on skeleton and rgb data. Pattern Recogn Lett 131:293–299

    Article  Google Scholar 

  20. Nan W, Kawamoto K (2021) Zero-shot action recognition with three-stream graph convolutional networks. Sens 21(11):3793

    Article  Google Scholar 

  21. Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’ Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. Adv Neural Inf Proces Syst, pages 2121–2129, 2013

  22. Kodirov E, Xiang T, Gong S (2017) Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3174–3183

  23. Kong D, Li X, Wang S, Li J, Yin B (2022) Learning visual-and-semantic knowledge embedding for zero-shot image classification. Appl Intell, pages 1–15

  24. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  25. Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543

  26. Ye M, Guo Y (2019) Progressive ensemble networks for zero-shot recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11728–11736

  27. Xu W, Xian Y, Wang J, Schiele B, Akata Z (2022) Vgse: Visually-grounded semantic embeddings for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9316–9325

  28. Huang P, Han J, Cheng D, Zhang D (2022) Robust region feature synthesizer for zero-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7622–7631

  29. Hahn M, Silva A, Rehg JM (2019) Action2vec: A crossmodal embedding approach to action learning. arXiv preprint arXiv:1901.00484

  30. Ji S, Wei X, Yang M, Kai Y (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  31. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  32. Kun Liu W, Liu HM, Huang W, Dong X (2019) Generalized zero-shot learning for action recognition with web-scale video data. World Wide Web 22(2):807–824

    Article  Google Scholar 

  33. Kerrigan A, Duarte K, Rawat Y, Shah M (2021) Reformulating zero-shot action recognition for multi-label actions. Adv Neural Inf Proces Syst 34:25566–25577

    Google Scholar 

  34. Xing M, Feng Z, Yong S, Peng W, Zhang J (2021) Ventral & dorsal stream theory based zero-shot action recognition. Pattern Recogn 116:107953

    Article  Google Scholar 

  35. Hirschman L, Gaizauskas R (2001) Natural language question answering: the view from here. Nat Lang Eng 7(4):275–300

    Article  Google Scholar 

  36. Choi E, He H, Iyyer M, Yatskar M, Yih W-T, Choi Y, Liang P, Zettlemoyer L (2018) QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184, Brussels, Belgium, October–November 2018. Association for Computational Linguistics

  37. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick C L, Parikh D (2015) Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp 2425–2433

  38. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, pp 1–14

  39. Dancette C, Cadene R, Teney D, Cord M (2021) Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. Proceed IEEE/CVF Int Conf Comput Vis:1574–1583

  40. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6077–6086

  41. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308

  42. Mettes P, Koelma DC, Snoek CGM (2016) The imagenet shuffle: Reorganized pre-training for video event detection. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pages 175–182

  43. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 248–255. Ieee

  44. Cao Z, Hidalgo Martinez G, Simon T, Wei S, Sheikh YA (2019) Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell:1–1

  45. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI Conference on Artificial Intelligence, 32(1)

  46. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. PMLR, pp 8748–8763

  47. Wang X, Wu J, Chen J, Li L, Wang Y-F, Wang WY (2019) Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591

  48. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics

  49. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pp 2556–2563. IEEE

  50. Carreira J, Noland E, Banki-Horvath A, Hillier C, Zisserman A (2018) A short note about kinetics-600. arXiv preprint arXiv:1808.01340

  51. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

  52. Romera-Paredes B, Torr P (2015) An embarrassingly simple approach to zero-shot learning. In: Proceedings of the 32nd International Conference on Machine Learning, pp 2152–2161. PMLR

  53. Mandal D, Narayan S, Dwivedi SK, Gupta V, Ahmed S, Khan FS, Shao L (2019) Out-of-distribution detection for generalized zero-shot action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9985–9993

  54. Mishra A, Pandey A, Murthy HA (2020) Zero-shot learning for action recognition using synthesized features. Neurocomput 390:117–130

    Article  Google Scholar 

  55. Yong S, Xing M, An S, Peng W, Feng Z (2021) Vdarn: Video disentangling attentive relation network for few-shot and zero-shot action recognition. Ad Hoc Netw 113:102380

    Article  Google Scholar 

  56. Chen S, Dong H (2021) Elaborative rehearsal for zero-shot action recognition. Proceed IEEE/CVF Int Conf Comput Vis, pages 13638–13647

  57. Gao Z, Hou Y, Li W, Guo Z, Yu B (2022) Learning using privileged information for zero-shot action recognition. In Proceedings of the Asian Conference on Computer Vision, pages 773–788

  58. Lin C-C, Lin K, Wang L, Liu Z, Li L (2022) Cross-modal representation learning for zero-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19978–19988

Download references

Funding

This work was supported by the Japan Society for the Promotion of Science KAKENHI Grant Number JP19K12039 and JP22H03658.

Author information

Authors and Affiliations

Authors

Contributions

N. Wu: Conceptualisation, Methodology, Software, Writing - original draft. H. Kera: Writing - review & editing, Supervision. K. Kawamoto: Conceptualisation, Writing - review & editing, Supervision. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Kazuhiko Kawamoto.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical and informed consent for data used

Not applicable.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, N., Kera, H. & Kawamoto, K. Improving zero-shot action recognition using human instruction with text description. Appl Intell 53, 24142–24156 (2023). https://doi.org/10.1007/s10489-023-04808-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04808-w

Keywords

Navigation