International Journal of Computer Vision

, Volume 123, Issue 3, pp 309–333 | Cite as

Transductive Zero-Shot Action Recognition by Word-Vector Embedding



The number of categories for action recognition is growing rapidly and it has become increasingly hard to label sufficient training data for learning conventional models for all categories. Instead of collecting ever more data and labelling them exhaustively for all categories, an attractive alternative approach is “zero-shot learning” (ZSL). To that end, in this study we construct a mapping between visual features and a semantic descriptor of each action category, allowing new categories to be recognised in the absence of any visual training data. Existing ZSL studies focus primarily on still images, and attribute-based semantic representations. In this work, we explore word-vectors as the shared semantic space to embed videos and category labels for ZSL action recognition. This is a more challenging problem than existing ZSL of still images and/or attributes, because the mapping between video space-time features of actions and the semantic space is more complex and harder to learn for the purpose of generalising over any cross-category domain shift. To solve this generalisation problem in ZSL action recognition, we investigate a series of synergistic strategies to improve upon the standard ZSL pipeline. Most of these strategies are transductive in nature which means access to testing data in the training phase. First, we enhance significantly the semantic space mapping by proposing manifold-regularized regression and data augmentation strategies. Second, we evaluate two existing post processing strategies (transductive self-training and hubness correction), and show that they are complementary. We evaluate extensively our model on a wide range of human action datasets including HMDB51, UCF101, Olympic Sports and event datasets including CCV and TRECVID MED 13. The results demonstrate that our approach achieves the state-of-the-art zero-shot action recognition performance with a simple and efficient pipeline, and without supervised annotation of attributes. Finally, we present in-depth analysis into why and when zero-shot works, including demonstrating the ability to predict cross-category transferability in advance.


Zero-shot action recognition Zero-shot learning Semantic embedding Semi-supervised learning Transfer learning Action recognition 


  1. Aggarwal, J., & Ryoo, M. (2011). Human activity analysis: A review. ACM Computer Survey, 43(3), 16.CrossRefGoogle Scholar
  2. Akata, Z., Reed, S., Walter, D., Lee, H., & Schiele, B. (2015). Evaluation of output embeddings for fine-grained image classification. In CVPR (pp. 2927–2936).Google Scholar
  3. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research, 7, 2399–2434.MathSciNetMATHGoogle Scholar
  4. Chen, J., Cui, Y., Ye, G., Liu, D., & Chang, SF. (2014). Event-driven semantic concept discovery by exploiting weakly tagged internet images. In ICMR (p. 1).Google Scholar
  5. Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). Imagenet: A large-scale hierarchical image database. In CVPR (pp. 248–255).Google Scholar
  6. Dinu, G., Lazaridou, A., & Baroni, M. (2015). Improving zero-shot learning by mitigating the hubness problem. In ICLR, Workshop Track.Google Scholar
  7. Frome, A., Corrado, G. S., & Shlens, J. (2013). Devise: A deep visual-semantic embedding model. In NIPS (pp. 2121–2129).Google Scholar
  8. Fu, Y., Hospedales, T. M., Xiang, T., & Gong, S. (2012). Attribute learning for understanding unstructured social activity. In ECCV (pp. 530–543).Google Scholar
  9. Fu, Y., Hospedales, T. M., Xiang, T., Fu, Z., & Gong, S. (2014a). Transductive multi-view embedding for zero-shot recognition and annotation. In ECCV (pp. 584–599).Google Scholar
  10. Fu, Y., Hospedales, T. M., Xiang, T., & Gong, S. (2014b). Learning multimodal latent attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2), 303–316.CrossRefGoogle Scholar
  11. Fu, Y., Yang, Y., & Gong, S. (2014c). Transductive multi-label zero-shot learning. In BMVC (pp. 1–5).Google Scholar
  12. Fu, Y., Hospedales, T. M., Xiang, T., & Gong, S. (2015a). Transductive multi-view zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11), 2332–2345.CrossRefGoogle Scholar
  13. Fu, Z., Xiang, T., Kodirov, E., & Gong, S. (2015b). Zero-shot object recognition by semantic manifold distance. In CVPR (pp. 2635–2644).Google Scholar
  14. Gan, C., Lin, M., Yang, Y., Zhuang, Y., & GHauptmann, A. (2015). Exploring semantic inter-class relationships (sir) for zero-shot action recognition. In AAAI (pp. 3769–3775).Google Scholar
  15. Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2247–2253.CrossRefGoogle Scholar
  16. Habibian, A., Mensink, T., & Snoek, C. G. (2014a). Composite concept discovery for zero-shot video event detection. In ICMR (p. 17).Google Scholar
  17. Habibian, A., Mensink, T., & Snoek, C. G. M. (2014b). VideoStory: A new multimedia embedding for few-example recognition and translation of events. In ACM Multimedia (pp. 17–26).Google Scholar
  18. Jain, M., & Snoek, C. G. M. (2015). What do 15,000 object categories tell us about classifying and localizing actions? In CVPR (pp. 46–55).Google Scholar
  19. Jain, M., van Gemert, J. C., Mensink, T., & Snoek, C. G. M. (2015). Objects2action: Classifying and localizing actions without any video example. In ICCV (pp. 4588–4596).Google Scholar
  20. Jiang, Y., Wu, Z., Wang, J., Xue, X., & Chang, S. (2015). Exploiting feature and class relationships in video categorization with regularized deep neural networks. arXiv preprint arXiv:1502.07209.
  21. Jiang, Y. G., Ye, G., Chang, S. F., Ellis, D. P. W., & Loui, A. C. (2011). Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ICMR (p. 29).Google Scholar
  22. Jiang, Y. G., Liu, J., Zamir A. R., Laptev, I., Piccardi, M., Shah, M., & Sukthankar, R. (2013). THUMOS challenge: Action recognition with a large number of classes.Google Scholar
  23. Klaser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In BMVC (pp 1–10).Google Scholar
  24. Kodirov, E., Xiang, T., Fu, Z., & Gong, S. (2015). Unsupervised domain adaptation for zero-shot learning. In ICCV (pp 2452–2460).Google Scholar
  25. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In ICCV (pp 2556–2563).Google Scholar
  26. Lampert, CH., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR (pp 951–958).Google Scholar
  27. Lampert, C. H., Nickisch, H., & Harmeling, S. (2014). Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 453–465.CrossRefGoogle Scholar
  28. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.CrossRefGoogle Scholar
  29. Larochelle, H., Erhan, D., & Bengio, Y. (2008). Zero-data learning of new tasks. In AAAI (pp 646–651).Google Scholar
  30. Lazaridou, A., Bruni, E., & Baroni, M. (2014). Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world. In ACL (pp 1403–1414).Google Scholar
  31. Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR (pp 3337–3344).Google Scholar
  32. Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.MATHGoogle Scholar
  33. Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In CVPR (pp 2929–2936).Google Scholar
  34. Mensink, T., Gavves, E., & Snoek, C. G. (2014). Costa: Co-occurrence statistics for zero-shot classification. In CVPR (pp. 2441–2448).Google Scholar
  35. Mikolov, T., Sutskever, I., & Chen, K. (2013). Distributed representations of words and phrases and their compositionality. In NIPS (pp. 3111–3119).Google Scholar
  36. Milajevs, D., Kartsaklis, D., Sadrzadeh, M., & Purver, M. (2014). Evaluating neural word representations in tensor-based compositional settings. In EMNLP (pp. 708–719).Google Scholar
  37. Mitchell, J., & Lapata, M. (2008). Vector-based models of semantic composition. In ACL (pp. 236–244).Google Scholar
  38. Niebles, CWFFL Juan Carlos Chen. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV (pp 392–405).Google Scholar
  39. Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G. S., & Dean, J. (2014). Zero-shot learning by convex combination of semantic embeddings. In ICLR.Google Scholar
  40. Over, P., Fiscus, J., Sanders, G., Joy, D., Michel, M., Smeaton-Alan, A. F., & Quénot-Georges, G. (2014). Trecvid 2013—An overview of the goals, tasks, data, evaluation mechanisms, and metrics.Google Scholar
  41. Palatucci, M., Hinton, G., Pomerleau, D., & Mitchell, T. M. (2009). Zero-shot learning with semantic output codes. In NIPS (pp. 1410–1418).Google Scholar
  42. Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359.CrossRefGoogle Scholar
  43. Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV (pp. 143–156).Google Scholar
  44. Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976–990.CrossRefGoogle Scholar
  45. Rohrbach, M., Stark, M., Szarvas, G., Gurevych, I., & Schiele, B. (2010). What helps where- and why? Semantic relatedness for knowledge transfer. In CVPR (pp. 910–917).Google Scholar
  46. Rohrbach, M., Stark, M., & Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR (pp. 1641–1648).Google Scholar
  47. Rohrbach, M., Ebert, S., & Schiele, B. (2013). Transfer learning in a transductive setting. In NIPS (pp. 46–54).Google Scholar
  48. Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., et al. (2016). Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV, 119(3), 346–373.MathSciNetCrossRefGoogle Scholar
  49. Romera-Paredes, B., & Torr, P. H. S. (2015). An embarrassingly simple approach to zero-shot learning. In ICML (pp. 2152–2161).Google Scholar
  50. Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press.Google Scholar
  51. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In ICPR (pp. 32–36).Google Scholar
  52. Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In ACM Multimedia (pp. 357–360).Google Scholar
  53. Shao, L., Zhu, F., & Li, X. (2015). Transfer learning for visual categorization: A survey. IEEE Transactions on Neural Networks and Learning Systems, 26(5), 1019–1034.MathSciNetCrossRefGoogle Scholar
  54. Socher, R., Ganjoo, M., Manning, CD., & Ng, A. Y. (2013). Zero-shot learning through cross-modal transfer. In NIPS (pp. 935–943).Google Scholar
  55. Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
  56. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV (pp. 3551–3558).Google Scholar
  57. Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.MathSciNetCrossRefGoogle Scholar
  58. Wang, H., Oneata, D., Verbeek, J., & Schmid, C. (2016). A robust and efficient video representation for action recognition. International Journal of Computer Vision, 119(3), 219–238.MathSciNetCrossRefGoogle Scholar
  59. Wu, S., Bondugula, S., Luisier, F., Zhuang, X., & Natarajan, P. (2014). Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In CVPR (pp. 2665–2672).Google Scholar
  60. Xu, X., Hospedales, T., & Gong, S. (2015). Semantic embedding space for zero shot action recognition. In ICIP (pp. 63–67).Google Scholar
  61. Yang, Y., & Hospedales, T. (2015). A unified perspective on multi-domain and multi-task learning. In ICLR.Google Scholar
  62. Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In ICCV (pp. 492–497).Google Scholar
  63. Zhao, F., Huang, Y., Wang, L., & Tan, T. (2013). Relevance topic model for unstructured social group activity recognition. In NIPS (pp. 2580–2588).Google Scholar
  64. Zheng, J., & Jiang, Z. (2014). Submodular attribute selection for action recognition in video. In NIPS (pp. 1–9).Google Scholar
  65. Zhou, D., Bousquet, O., & Weston, J. (2004). Learning with local and global consistency. In NIPS, (pp. 321–328).Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Queen Mary, University of LondonLondonUK

Personalised recommendations