International Journal of Computer Vision

, Volume 120, Issue 1, pp 61–77 | Cite as

Recognizing an Action Using Its Name: A Knowledge-Based Approach

  • Chuang Gan
  • Yi Yang
  • Linchao Zhu
  • Deli Zhao
  • Yueting Zhuang


Existing action recognition algorithms require a set of positive exemplars to train a classifier for each action. However, the amount of action classes is very large and the users’ queries vary dramatically. It is impractical to pre-define all possible action classes beforehand. To address this issue, we propose to perform action recognition with no positive exemplars, which is often known as the zero-shot learning. Current zero-shot learning paradigms usually train a series of attribute classifiers and then recognize the target actions based on the attribute representation. To ensure the maximum coverage of ad-hoc action classes, the attribute-based approaches require large numbers of reliable and accurate attribute classifiers, which are often unavailable in the real world. In this paper, we propose an approach that merely takes an action name as the input to recognize the action of interest without any pre-trained attribute classifiers and positive exemplars. Given an action name, we first build an analogy pool according to an external ontology, and each action in the analogy pool is related to the target action at different levels. The correlation information inferred from the external ontology may be noisy. We then propose an algorithm, namely adaptive multi-model rank-preserving mapping (AMRM), to train a classifier for action recognition, which is able to evaluate the relatedness of each video in the analogy pool adaptively. As multiple mapping models are employed, our algorithm has better capability to bridge the gap between visual features and the semantic information inferred from the ontology. Extensive experiments demonstrate that our method achieves the promising performance for action recognition only using action names, while no attributes and positive exemplars are available.


Action recognition Semantic correlation Adaptive multi-model rank-preserving mapping (AMRM) 



This work was partially supported by the 973 Program (No. 2012CB316400), partially supported by the National Natural Science Foundation of China Grant 61033001, 61361136003, and partially supported by the ARC DECRA (DE130101311), the ACR DP (DP150103008). This work was done when Chuang Gan was a visiting student at Zhejiang University.


  1. Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C., et al. (2013). Label-embedding for attribute-based classification. In CVPR (pp. 819–826).Google Scholar
  2. Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In ICCV (Vol. 2, pp. 1395–1402).Google Scholar
  3. Cai, J., Zha, Z. J., Zhou, W., & Tian, Q. (2012). Attribute-assisted reranking for web image retrieval. In Multimedia (pp. 873–876). ACM.Google Scholar
  4. Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In ECCV (pp. 430–443).Google Scholar
  5. Chen, M. Y., & Hauptmann, A. (2009). Mosift: Recognizing human actions in surveillance videos.Google Scholar
  6. Duan, L., Xu, D., Tsang, I. H., & Luo, J. (2012). Visual event recognition in videos by learning from web data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1667–1680.CrossRefGoogle Scholar
  7. Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In CVPR, (pp. 1778–1785).Google Scholar
  8. Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., & Mikolov, T., et al. (2013). Devise: A deep visual-semantic embedding model. In NIPS, (pp. 2121–2129).Google Scholar
  9. Fu, Y., Hospedales, T. M., Xiang, T., Fu, Z., & Gong, S. (2014). Transductive multi-view embedding for zero-shot recognition and annotation. In ECCV (pp. 584–599).Google Scholar
  10. Fu, Y., Hospedales, T. M., Xiang, T., & Gong, S. (2015). Transductive multi-view zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11), 2332–2345.CrossRefGoogle Scholar
  11. Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI (pp. 1606–1611).Google Scholar
  12. Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV (pp. 2712–2719).Google Scholar
  13. Hauptmann, A., Yan, R., Lin, W. H., Christel, M., & Wactlar, H. (2007). Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Transactions on Multimedia, 9(5), 958–966.CrossRefGoogle Scholar
  14. Jiang, Y. G., Bhattacharya, S., Chang, S. F., & Shah, M. (2013). High-level event recognition in unconstrained videos. International Journal of Multimedia Information Retrieval, 2(2), 73–101.CrossRefGoogle Scholar
  15. Kankuekul, P., Kawewong, A., Tangruamsub, S., & Hasegawa, O. (2012). Online incremental attribute-based zero-shot learning. In CVPR (pp. 3657–3664).Google Scholar
  16. Lampert, C.H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR (pp. 951–958).Google Scholar
  17. Lan, Z. Z., Bao, L., Yu, S. I., Liu, W., & Hauptmann, A. G. (2012). Double fusion for multimedia event detection.Google Scholar
  18. Lan, Z.Z., Jiang, L., Yu, S.I., Rawat, S., Cai, Y., Gao, C., Xu, S., Shen, H., Li, X., & Wang, Y., et al. (2013). Cmu-informedia at trecvid 2013 multimedia event detection. In TRECVID 2013 Workshop (Vol. 1, p. 5).Google Scholar
  19. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.CrossRefGoogle Scholar
  20. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR (pp. 1–8).Google Scholar
  21. Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR (pp. 3337–3344).Google Scholar
  22. Liu, J., Yu, Q., Javed, O., Ali, S., Tamrakar, A., Divakaran, A., Cheng, H., & Sawhney, H.S. (2013). Video event recognition using concept attributes. In WACV (pp. 339–346).Google Scholar
  23. Ma, Z., Yang, Y., Nie, F., Sebe, N., Yan, S., & Hauptmann, A. G. (2014). Harnessing lab knowledge for real-world action recognition. International Journal of Computer Vision, 109(1–2), 60–73.CrossRefzbMATHGoogle Scholar
  24. Ma, Z., Yang, Y., Sebe, N., Zheng, K., & Hauptmann, A. G. (2013). Multimedia event detection using a classifier-specific intermediate representation. IEEE Transactions on Multimedia, 15(7), 1628–1637.CrossRefGoogle Scholar
  25. Ma, Z., Yang, Y., Xu, Z., Sebe, N., & Hauptmann, A.G. (2013). We are not equally negative: Fine-grained labeling for multimedia event detection. In ACM Multimedia (pp. 293–302).Google Scholar
  26. Oneata, D., Verbeek, J., & Schmid, C., et al. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.Google Scholar
  27. Reddy, K. K., & Shah, M. (2013). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5), 971–981.CrossRefGoogle Scholar
  28. Rohrbach, M., Ebert, S., & Schiele, B. (2013). Transfer learning in a transductive setting. In NIPS (pp. 46–54).Google Scholar
  29. Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., & Schiele, B. (2012). Script data for attribute-based recognition of composite activities. In ECCV (pp. 144–157).Google Scholar
  30. Rohrbach, M., Stark, M., & Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR (pp. 1641–1648).Google Scholar
  31. Rohrbach, M., Stark, M., Szarvas, G., Gurevych, I., & Schiele, B. (2010). What helps where-and why? Semantic relatedness for knowledge transfer. In CVPR (pp. 910–917).Google Scholar
  32. Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR (pp. 1234–1241).Google Scholar
  33. Sánchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245.MathSciNetCrossRefzbMATHGoogle Scholar
  34. Socher, R., Ganjoo, M., Manning, C.D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In NIPS (pp. 935–943).Google Scholar
  35. Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
  36. Sun, C., Gan, C., & Nevatia, R. (2015). Automatic concept discovery from parallel text and visual corpora. In ICCV.Google Scholar
  37. Tang, K., Yao, B., Fei-Fei, L., & Koller, D. (2013). Combining the right features for complex event recognition. In ICCV (pp. 2696–2703).Google Scholar
  38. Torresani, L., Szummer, M., & Fitzgibbon, A. (2010). Efficient object category recognition using classemes. In ECCV (pp. 776–789).Google Scholar
  39. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2014). C3D: generic features for video analysis. arXiv preprint arXiv:1412.0767.
  40. Vovk, V. (2013). Kernel ridge regression. In Empirical inference (pp. 105–116).Google Scholar
  41. Wang, H., Klaser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. In CVPR (pp. 3169–3176).Google Scholar
  42. Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.MathSciNetCrossRefGoogle Scholar
  43. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV (pp. 3551–3558).Google Scholar
  44. Wang, S., Yang, Y., Ma, Z., Li, X., Pang, C., & Hauptmann, A. G. (2012). Action recognition by exploring data distribution and feature correlation. In CVPR (pp. 1370–1377).Google Scholar
  45. Yang, Y., Ma, Z., Nie, F., Chang, X., & Hauptmann, A. G. (2014). Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision, 113(2), 113–127.MathSciNetCrossRefGoogle Scholar
  46. Yu, F. X., Cao, L., Feris, R. S., Smith, J. R., & Chang, S. F. (2013). Designing category-level attributes for discriminative visual recognition. In CVPR (pp. 771–778).Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Chuang Gan
    • 1
  • Yi Yang
    • 2
  • Linchao Zhu
    • 2
  • Deli Zhao
    • 3
  • Yueting Zhuang
    • 4
  1. 1.IIISTsinghua UniversityBeijingChina
  2. 2.QCISUniversity of TechnologySydneyAustralia
  3. 3.HTC ResearchBeijingChina
  4. 4.Zhejiang UniversityHangzhouChina

Personalised recommendations