International Journal of Computer Vision

, Volume 126, Issue 2–4, pp 390–409 | Cite as

Transferring Deep Object and Scene Representations for Event Recognition in Still Images

  • Limin WangEmail author
  • Zhe Wang
  • Yu Qiao
  • Luc Van Gool


This paper addresses the problem of image-based event recognition by transferring deep representations learned from object and scene datasets. First we empirically investigate the correlation of the concepts of object, scene, and event, thus motivating our representation transfer methods. Based on this empirical study, we propose an iterative selection method to identify a subset of object and scene classes deemed most relevant for representation transfer. Afterwards, we develop three transfer techniques: (1) initialization-based transfer, (2) knowledge-based transfer, and (3) data-based transfer. These newly designed transfer techniques exploit multitask learning frameworks to incorporate extra knowledge from other networks or additional datasets into the fine-tuning procedure of event CNNs. These multitask learning frameworks turn out to be effective in reducing the effect of over-fitting and improving the generalization ability of the learned CNNs. We perform experiments on four event recognition benchmarks: the ChaLearn LAP Cultural Event Recognition dataset, the Web Image Dataset for Event Recognition, the UIUC Sports Event dataset, and the Photo Event Collection dataset. The experimental results show that our proposed algorithm successfully transfers object and scene representations towards the event dataset and achieves the current state-of-the-art performance on all considered datasets.


Event recognition Deep learning Transfer learning Multitask learning 



This work is partially supported by the ERC Advanced Grant VarCity, the Toyota Research Project TRACE-Zurich, the National Key Research and Development Program of China (2016YFC1400704), the National Natural Science Foundation of China (U1613211, 61633021), and the External Cooperation Program of BIC Chinese Academy of Sciences (172644KYSB20150019, 172644KYSB20160033).


  1. Azizpour, H., Sharif Razavian, A., Sullivan, J., Maki, A., & Carlsson, S. (2015). From generic to specific deep representations for visual recognition. In CVPR Workshop on DeepVision, pp. 36–45.Google Scholar
  2. Baro, X., Gonzalez, J., Fabian, J., Bautista, M. A., Oliu, M., Jair Escalante, H., Guyon, I., & Escalera, S .(2015). Chalearn looking at people 2015 challenges: Action spotting and cultural event recognition. In CVPR Workshop on ChaLearn Looking at People, pp. 1–9.Google Scholar
  3. Bhattacharya, S., Kalayeh, M. M., Sukthankar, R., & Shah, M. (2014). Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In CVPR, pp. 2243–2250.Google Scholar
  4. Bossard, L., Guillaumin, M., & Gool, L. J. V. (2013). Event recognition in photo collections with a stopwatch HMM. In ICCV, pp. 1193–1200.Google Scholar
  5. Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.MathSciNetCrossRefGoogle Scholar
  6. Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In BMVC, pp. 1–12.Google Scholar
  7. Cooper, M. L., Foote, J., Girgensohn, A., & Wilcox, L. (2003). Temporal event clustering for digital photo collections. In ACM Multimedia, pp. 364–373.Google Scholar
  8. Das, A., Dasgupta, A., & Kumar, R. (2012). Selecting diverse features via spectral regularization. In NIPS, pp. 1592–1600.Google Scholar
  9. Delaitre, V., Sivic, J., & Laptev, I. (2011). Learning person-object interactions for action recognition in still images. In NIPS, pp. 1503–1511.Google Scholar
  10. Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255.Google Scholar
  11. Desai, C., & Ramanan, D. (2012). Detecting actions, poses, and objects with relational phraselets. In ECCV, pp. 158–172.Google Scholar
  12. Duan, L., Xu, D., Tsang, I. W., & Luo, J. (2012). Visual event recognition in videos by learning from web data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1667–1680.CrossRefGoogle Scholar
  13. Ebadollahi, S., Xie, L., Chang, S., & Smith, J. R. (2006). Visual event detection using multi-dimensional concept dynamics. In ICME, pp. 881–884.Google Scholar
  14. Escalera, S., Fabian, J., Pardo, P., Baro, X., Gonzalez, J., Escalante, H. J., Misevic, D., Steiner, U., & Guyon, I. (2015). Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results. In ICCV Workshop on ChaLearn Looking at People, pp. 1–9.Google Scholar
  15. Everingham, M., Gool, L. J. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRefGoogle Scholar
  16. Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In ICCV, pp. 2960–2967.Google Scholar
  17. Gan, C., Wang, N., Yang, Y., Yeung, D., & Hauptmann, A. G. (2015). Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, pp. 2568–2577.Google Scholar
  18. Gao, B., Wei, X., Wu, J., & Lin, W. (2015). Deep spatial pyramid: The devil is once again in the details. CoRR abs/1504.05277.Google Scholar
  19. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587.Google Scholar
  20. Gkioxari, G., Girshick, R. B., & Malik, J. (2015). Contextual action recognition with r*cnn. In ICCV, pp. 1080–1088.Google Scholar
  21. Gong, B., Grauman, K., & Sha, F. (2014). Learning kernels for unsupervised domain adaptation with applications to visual object recognition. International Journal of Computer Vision, 109(1–2), 3–27.MathSciNetCrossRefzbMATHGoogle Scholar
  22. Habibian, A., & Snoek, C. G. M. (2014). Recommendations for recognizing video events by concept vocabularies. Computer Vision and Image Understanding, 124, 110–122.CrossRefGoogle Scholar
  23. Hauptmann, A. G., Christel, M. G., & Yan, R. (2008). Video retrieval based on semantic concepts. Proceedings of the IEEE, 96(4), 602–622.CrossRefGoogle Scholar
  24. He, K., Zhang, X., Ren, S., & Sun. J. (2015). Deep residual learning for image recognition. CoRR abs/1512.03385.Google Scholar
  25. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456.Google Scholar
  26. Izadinia, H., & Shah, M. (2012). Recognizing complex events using large margin joint low-level event model. In Computer Vision—ECCV 2012–12th European Conference on Computer Vision, Florence, Italy, October 7–13, 2012, Proceedings, Part IV, pp. 430–444.Google Scholar
  27. Jain, M., van Gemert, J. C., & Snoek, C. G. M. (2015). What do 15,000 object categories tell us about classifying and localizing actions? In CVPR, pp. 46–55.Google Scholar
  28. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. B., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. CoRR abs/1408.5093.Google Scholar
  29. Juneja, M., Vedaldi, A., Jawahar, C. V., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In CVPR, pp. 923–930.Google Scholar
  30. Kolmogorov, V. (2006). Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1568–1583.CrossRefGoogle Scholar
  31. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114.Google Scholar
  32. Kulis, B., Saenko, K., & Darrell, T. (2011). What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR, pp. 1785–1792.Google Scholar
  33. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRefGoogle Scholar
  34. Li, L., & Li, F. (2007). What, where and who? classifying events by scene and object recognition. In ICCV, pp. 1–8.Google Scholar
  35. Liu, J., Yu, Q., Javed, O., Ali, S., Tamrakar, A., Divakaran, A., Cheng, H., & Sawhney, H. S. (2013). Video event recognition using concept attributes. In WACV, pp. 339–346.Google Scholar
  36. Liu, M., Liu, X., Li, Y., Chen, X., Hauptmann, A. G., & Shan, S. (2015). Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In ICCV Workshop on ChaLearn Looking at People, pp. 32–37.Google Scholar
  37. Ma, Z., Yang, Y., Sebe, N., & Hauptmann, A. G. (2014). Knowledge adaptation with partiallyshared features for event detectionusing few exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(9), 1789–1802.CrossRefGoogle Scholar
  38. Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In CVPR, pp. 2929–2936.Google Scholar
  39. Mazloom, M., Gavves, E., & Snoek, C. G. M. (2014). Conceptlets: Selective semantics for classifying video events. IEEE Transactions on Multimedia, 16(8), 2214–2228.CrossRefGoogle Scholar
  40. Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, pp. 1717–1724.Google Scholar
  41. Park, S., & Kwak, N. (2015). Cultural event recognition by subregion classification with convolutional neural network. In CVPR Workshop on ChaLearn Looking at People, pp. 45–50.Google Scholar
  42. Ramanathan, V., Tang, K. D., Mori, G., & Li, F. (2015). Learning temporal embeddings for complex video analysis. In ICCV, pp. 4471–4479.Google Scholar
  43. Rothe, R., Timofte, R., & Van Gool, L. (2015). Dldr: Deep linear discriminative retrieval for cultural event classification from a single image. In The IEEE International Conference on Computer Vision (ICCV) Workshops, pp. 53–60.Google Scholar
  44. Salvador, A., Zeppelzauer, M., Manchon-Vizuete, D., Calafell, A., & Giro-i Nieto, X. (2015). Cultural event recognition with visual convnets and temporal models. In CVPR Workshop on ChaLearn Looking at People, pp. 36–44.Google Scholar
  45. Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn features off-the-shelf: An astounding baseline for recognition. In CVPR Workshop on DeepVision, pp. 806–813.Google Scholar
  46. Shen, L., Lin, Z., & Huang, Q. (2016). Relay backpropagation for effective learning of deep convolutional neural networks. In ECCV, pp. 467–482.Google Scholar
  47. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS, pp. 568–576.Google Scholar
  48. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR, pp. 1–8.Google Scholar
  49. Tang, K. D., Li, F., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR, pp. 1250–1257.Google Scholar
  50. Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR, pp. 1521–1528.Google Scholar
  51. Tzeng, E., Hoffman, J., Darrell, T., & Saenko, K. (2015). Simultanously deep transfer across domains and tasks. In ICCV, pp. 4068–4076.Google Scholar
  52. Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 480–492.CrossRefGoogle Scholar
  53. Vu, T., Olsson, C., Laptev, I., Oliva, A., & Sivic, J. (2014). Predicting actions from static scenes. In ECCV, pp. 421–436.Google Scholar
  54. Wang, L., Guo, S., Huang, W., Xiong, Y., & Qiao, Y. (2017). Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns. IEEE Transactions on Image Processing, 26(4), 2055–2068.Google Scholar
  55. Wang, H., Kläser, A., Schmid, C., & Liu, C. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.MathSciNetCrossRefGoogle Scholar
  56. Wang, H., Oneata, D., Verbeek, J. J., & Schmid, C. (2016a). A robust and efficient video representation for action recognition. International Journal of Computer Vision, 119(3), 219–238.MathSciNetCrossRefGoogle Scholar
  57. Wang, L., Qiao, Y., & Tang, X. (2015a). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, pp. 4305–4314.Google Scholar
  58. Wang, L., Qiao, Y., & Tang, X. (2016b). MoFAP: A multi-level representation for action recognition. International Journal of Computer Vision, 119(3), 254–271.MathSciNetCrossRefGoogle Scholar
  59. Wang, L., Wang, Z., Du, W., & Qiao, Y. (2015b). Object-scene convolutional neural networks for event recognition in images. In CVPR Workshop on ChaLearn Looking at People, pp. 30–35.Google Scholar
  60. Wang, L., Wang, Z., Guo, S., & Qiao, Y. (2015c). Better exploiting os-cnns for better event recognition in images. In ICCV Workshop on ChaLearn Looking at People, pp. 45–52.Google Scholar
  61. Wang, L., Xiong, Y., Wang, Z., & Qiao, Y. (2015d). Towards good practices for very deep two-stream convnets. CoRR abs/1507.02159.Google Scholar
  62. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016c). Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pp. 20–36.Google Scholar
  63. Wei, X. S., Gao, B. B., & Wu, J. (2015). Deep spatial pyramid ensemble for cultural event recognition. In ICCV Workshop on ChaLearn Looking at People, pp. 38–44.Google Scholar
  64. Xiong, Y., Zhu, K., Lin, D., & Tang, X. (2015). Recognize complex events from static images by fusing deep channels. In CVPR, pp. 1600–1609.Google Scholar
  65. Yan, Y., Yang, Y., Shen, H., Meng, D., Liu, G., Hauptmann, A. G., & Sebe, N. (2015). Complex event detection via event oriented dictionary learning. In AAAI, pp. 3841–3847.Google Scholar
  66. Yang, Y., Yang, Y., Huang, Z., Liu, J., & Ma, Z. (2012). Robust cross-media transfer for visual event detection. In Proceedings of the 20th ACM International Conference on Multimedia, pp. 1045–1048.Google Scholar
  67. Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L. J., & Li, F. (2011). Human action recognition by learning bases of action attributes and parts. In ICCV, pp. 1331–1338.Google Scholar
  68. Yao, B., & Li, F. (2010). Grouplet: A structured image representation for recognizing human and object interactions. In CVPR, pp. 9–16.Google Scholar
  69. Yao, J., Fidler, S., & Urtasun, R. (2012). Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In CVPR, pp. 702–709.Google Scholar
  70. Zheng, J., Jiang, Z., Chellappa, R., & Phillips, P. J. (2014). Submodular attribute selection for action recognition in video. In NIPS, pp 1341–1349.Google Scholar
  71. Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., & Torralba, A. (2015). Learning deep features for discriminative localization. CoRR abs/1512.04150.Google Scholar
  72. Zhou, B., Lapedriza, À., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS, pp. 487–495.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Computer Vision LaboratoryETH ZurichZurichSwitzerland
  2. 2.Department of Computer ScienceUniversity of CaliforniaIrvineUSA
  3. 3.Shenzhen Institutes of Advanced TechnologyChinese Academy of SciencesShenzhenChina

Personalised recommendations