Abstract
Zero-shot action recognition, which recognizes actions in videos without having received any training examples, is gaining wide attention considering it can save labor costs and training time. Nevertheless, the performance of zero-shot learning is still unsatisfactory, which limits its practical application. To solve this problem, this study proposes a framework to improve zero-shot action recognition using human instructions with text descriptions. The proposed framework manually describes video contents, which incurs some labor costs; in many situations, the labor costs are worth it. We manually annotate text features for each action, which can be a word, phrase, or sentence. Then by computing the matching degrees between the video and all text features, we can predict the class of the video. Furthermore, the proposed model can also be combined with other models to improve its accuracy. In addition, our model can be continuously optimized to improve the accuracy by repeating human instructions. The results with UCF101 and HMDB51 showed that our model achieved the best accuracy and improved the accuracies of other models.
Similar content being viewed by others
Data availability
The UCF101 and HMDB51 datasets used in this study are available in the Internet The data presented in this study are available on request from the corresponding author.
References
Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951–958. IEEE
Lampert CH, Nickisch H, Harmeling S (2013) Attribute-based classification for zero-shot visual object categorization. IEEE Trans Pattern Anal Mach Intell 36(3):453–465
Qin J, Liu L, Shao L, Shen F, Ni B, Chen J, Wang Y (2017) Zero-shot action recognition with error-correcting output codes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2833–2842
Zhu Y, Yang L, Yu G, Newsam S, Shao L (2018) Towards universal representation for unseen action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9436–9445
Gao J, Zhang T, Changsheng X (2019) I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8303–8311
Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Wang H, Schmid C (2013) Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, pp 3551–3558
Wang H, Kläser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576
Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 591–600
Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 909–918
Majd M, Safabakhsh R (2019) A motion-aware convlstm network for action recognition. Appl Intell 49(7):2515–2521
Elharrouss O, Almaadeed N, Al-Maadeed S, Bouridane A, Beghdadi A (2021) A combined multiple action recognition and summarization for surveillance video sequences. Appl Intell 51(2):690–712
Gao R, Tae-Hyun O, Grauman K, Torresani L (2020) Listen to look: Action recognition by previewing audio. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10457–10467
Shi L, Zhang Y, Cheng J, Lu H (2019) Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7912–7921
Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H (2020) Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 183–192
Li M, Chen S, Xu C, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3595–3603
Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, pp 143–152
Franco A, Magnani A, Maio D (2020) A multimodal approach for human activity recognition based on skeleton and rgb data. Pattern Recogn Lett 131:293–299
Nan W, Kawamoto K (2021) Zero-shot action recognition with three-stream graph convolutional networks. Sens 21(11):3793
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’ Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. Adv Neural Inf Proces Syst, pages 2121–2129, 2013
Kodirov E, Xiang T, Gong S (2017) Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3174–3183
Kong D, Li X, Wang S, Li J, Yin B (2022) Learning visual-and-semantic knowledge embedding for zero-shot image classification. Appl Intell, pages 1–15
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
Ye M, Guo Y (2019) Progressive ensemble networks for zero-shot recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11728–11736
Xu W, Xian Y, Wang J, Schiele B, Akata Z (2022) Vgse: Visually-grounded semantic embeddings for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9316–9325
Huang P, Han J, Cheng D, Zhang D (2022) Robust region feature synthesizer for zero-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7622–7631
Hahn M, Silva A, Rehg JM (2019) Action2vec: A crossmodal embedding approach to action learning. arXiv preprint arXiv:1901.00484
Ji S, Wei X, Yang M, Kai Y (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Kun Liu W, Liu HM, Huang W, Dong X (2019) Generalized zero-shot learning for action recognition with web-scale video data. World Wide Web 22(2):807–824
Kerrigan A, Duarte K, Rawat Y, Shah M (2021) Reformulating zero-shot action recognition for multi-label actions. Adv Neural Inf Proces Syst 34:25566–25577
Xing M, Feng Z, Yong S, Peng W, Zhang J (2021) Ventral & dorsal stream theory based zero-shot action recognition. Pattern Recogn 116:107953
Hirschman L, Gaizauskas R (2001) Natural language question answering: the view from here. Nat Lang Eng 7(4):275–300
Choi E, He H, Iyyer M, Yatskar M, Yih W-T, Choi Y, Liang P, Zettlemoyer L (2018) QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184, Brussels, Belgium, October–November 2018. Association for Computational Linguistics
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick C L, Parikh D (2015) Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp 2425–2433
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, pp 1–14
Dancette C, Cadene R, Teney D, Cord M (2021) Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. Proceed IEEE/CVF Int Conf Comput Vis:1574–1583
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6077–6086
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308
Mettes P, Koelma DC, Snoek CGM (2016) The imagenet shuffle: Reorganized pre-training for video event detection. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pages 175–182
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 248–255. Ieee
Cao Z, Hidalgo Martinez G, Simon T, Wei S, Sheikh YA (2019) Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell:1–1
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI Conference on Artificial Intelligence, 32(1)
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. PMLR, pp 8748–8763
Wang X, Wu J, Chen J, Li L, Wang Y-F, Wang WY (2019) Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591
Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pp 2556–2563. IEEE
Carreira J, Noland E, Banki-Horvath A, Hillier C, Zisserman A (2018) A short note about kinetics-600. arXiv preprint arXiv:1808.01340
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950
Romera-Paredes B, Torr P (2015) An embarrassingly simple approach to zero-shot learning. In: Proceedings of the 32nd International Conference on Machine Learning, pp 2152–2161. PMLR
Mandal D, Narayan S, Dwivedi SK, Gupta V, Ahmed S, Khan FS, Shao L (2019) Out-of-distribution detection for generalized zero-shot action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9985–9993
Mishra A, Pandey A, Murthy HA (2020) Zero-shot learning for action recognition using synthesized features. Neurocomput 390:117–130
Yong S, Xing M, An S, Peng W, Feng Z (2021) Vdarn: Video disentangling attentive relation network for few-shot and zero-shot action recognition. Ad Hoc Netw 113:102380
Chen S, Dong H (2021) Elaborative rehearsal for zero-shot action recognition. Proceed IEEE/CVF Int Conf Comput Vis, pages 13638–13647
Gao Z, Hou Y, Li W, Guo Z, Yu B (2022) Learning using privileged information for zero-shot action recognition. In Proceedings of the Asian Conference on Computer Vision, pages 773–788
Lin C-C, Lin K, Wang L, Liu Z, Li L (2022) Cross-modal representation learning for zero-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19978–19988
Funding
This work was supported by the Japan Society for the Promotion of Science KAKENHI Grant Number JP19K12039 and JP22H03658.
Author information
Authors and Affiliations
Contributions
N. Wu: Conceptualisation, Methodology, Software, Writing - original draft. H. Kera: Writing - review & editing, Supervision. K. Kawamoto: Conceptualisation, Writing - review & editing, Supervision. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical and informed consent for data used
Not applicable.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, N., Kera, H. & Kawamoto, K. Improving zero-shot action recognition using human instruction with text description. Appl Intell 53, 24142–24156 (2023). https://doi.org/10.1007/s10489-023-04808-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04808-w