Large Scale Holistic Video Understanding

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12350)


Video recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill this gap by presenting a large-scale “Holistic Video Understanding Dataset” (HVU). HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx. 572k videos in total with 9 million annotations for training, validation and test set spanning over 3142 labels. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes and concepts which naturally captures the real-world scenarios.

We demonstrate the generalisation capability of HVU on three challenging tasks: 1) Video classification, 2) Video captioning and 3) Video clustering tasks. In particular for video classification, we introduce a new spatio-temporal deep neural network architecture called “Holistic Appearance and Temporal Network” (HATNet) that builds on fusing 2D and 3D architectures into one by combining intermediate representations of appearance and temporal cues. HATNet focuses on the multi-label and multi-task learning problem and is trained in an end-to-end manner. Via our experiments, we validate the idea that holistic representation learning is complementary, and can play a key role in enabling many real-world applications.



This work was supported by DBOF PhD scholarship & GC4 Flemish AI project, and the ERC Starting Grant ARCA (677650). We also would like to thank Sensifai for giving us access to the Video Tagging API for dataset preparation.

Supplementary material

504441_1_En_35_MOESM1_ESM.pdf (11.9 mb)
Supplementary material 1 (pdf 12227 KB)


  1. 1.
    Google Vision AI API.
  2. 2.
    Sensifai Video Tagging API.
  3. 3.
    Abu-El-Haija, S., et al.: Youtube-8m: a large-scale video classification benchmark. arXiv:1609.08675 (2016)
  4. 4.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)Google Scholar
  5. 5.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)Google Scholar
  6. 6.
    Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)Google Scholar
  7. 7.
    Chen, S., Jiang, Y.G.: Motion guided spatial attention for video captioning. In: AAAI (2019)Google Scholar
  8. 8.
    Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006). Scholar
  9. 9.
    Damen, D., et al.: Scaling egocentric vision: the Epic-Kitchens dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 753–771. Springer, Cham (2018). Scholar
  10. 10.
    Diba, A., et al.: Spatio-temporal channel correlation networks for action classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 299–315. Springer, Cham (2018). Scholar
  11. 11.
    Diba, A., et al.: Temporal 3D convnets using temporal transition layer. In: CVPR Workshops (2018)Google Scholar
  12. 12.
    Diba, A., Sharma, V., Van Gool, L.: Deep temporal linear encoding networks. In: CVPR (2017)Google Scholar
  13. 13.
    Diba, A., Sharma, V., Van Gool, L., Stiefelhagen, R.: DynamoNet: dynamic action and motion network. In: ICCV (2019)Google Scholar
  14. 14.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  15. 15.
    Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: ICCV (2019)Google Scholar
  16. 16.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)Google Scholar
  17. 17.
    Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: CVPR (2017)Google Scholar
  18. 18.
    Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: CVPR (2015)Google Scholar
  19. 19.
    Gaidon, A., Harchaoui, Z., Schmid, C.: Temporal localization of actions with actoms. PAMI 35, 2782–2795 (2013)CrossRefGoogle Scholar
  20. 20.
    Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: CVPR (2017)Google Scholar
  21. 21.
    Girdhar, R., Tran, D., Torresani, L., Ramanan, D.: Distinit: learning video representations without a single labeled video. In: ICCV (2019)Google Scholar
  22. 22.
    Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV (2017)Google Scholar
  23. 23.
    Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR (2018)Google Scholar
  24. 24.
    Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D residual networks for action recognition. In: ICCV (2017)Google Scholar
  25. 25.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  26. 26.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  27. 27.
    Kay, W., et al.: The kinetics human action video dataset. arXiv:1705.06950 (2017)
  28. 28.
    Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC (2008)Google Scholar
  29. 29.
    Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: HMDB51: a large video database for human motion recognition. In: High Performance Computing in Science and Engineering (2013)Google Scholar
  30. 30.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  31. 31.
    Liu, S., Ren, Z., Yuan, J.: SibNet: sibling convolutional encoder for video captioning. In: ACMM (2018)Google Scholar
  32. 32.
    Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38, 39–41 (1995)CrossRefGoogle Scholar
  33. 33.
    Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). Scholar
  34. 34.
    Ng, J.Y.H., Choi, J., Neumann, J., Davis, L.S.: ActionFlowNet: learning motion representation for action recognition. In: WACV (2018)Google Scholar
  35. 35.
    Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010). Scholar
  36. 36.
    Ray, J., et al.: Scenes-objects-actions: a multi-task, multi-label video dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 660–676. Springer, Cham (2018). Scholar
  37. 37.
    Roethlingshoefer, V., Sharma, V., Stiefelhagen, R.: Self-supervised face-grouping on graph. In: ACM MM (2019)Google Scholar
  38. 38.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: ICPR (2004)Google Scholar
  39. 39.
    Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: ACM MM (2007)Google Scholar
  40. 40.
    Sharma, V., Sarfraz, S., Stiefelhagen, R.: A simple and effective technique for face clustering in TV series. In: CVPR workshop on Brave New Motion Representations (2017)Google Scholar
  41. 41.
    Sharma, V., Tapaswi, M., Sarfraz, M.S., Stiefelhagen, R.: Self-supervised learning of face representations for video face clustering. In: International Conference on Automatic Face and Gesture Recognition (2019)Google Scholar
  42. 42.
    Sharma, V., Tapaswi, M., Sarfraz, M.S., Stiefelhagen, R.: Video face clustering with self-supervised representation learning. IEEE Trans. Biometrics Behav. Identity Sci. 2, 145–157 (2019)CrossRefGoogle Scholar
  43. 43.
    Sharma, V., Tapaswi, M., Sarfraz, M.S., Stiefelhagen, R.: Clustering based contrastive learning for improving face representations. In: International Conference on Automatic Face and Gesture Recognition (2020)Google Scholar
  44. 44.
    Sharma, V., Tapaswi, M., Stiefelhagen, R.: Deep multimodal feature encoding for video ordering. In: ICCV Workshop on Holistic Video Understanding (2019)Google Scholar
  45. 45.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). Scholar
  46. 46.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  47. 47.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)
  48. 48.
    Sun, L., Jia, K., Yeung, D.Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV (2015)Google Scholar
  49. 49.
    Tang, P., Wang, X., Shi, B., Bai, X., Liu, W., Tu, Z.: Deep fishernet for object classification. arXiv:1608.00182 (2016)
  50. 50.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  51. 51.
    Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv:1708.05038 (2017)
  52. 52.
    Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: ICCV (2019)Google Scholar
  53. 53.
    Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018)Google Scholar
  54. 54.
    Wang, B., Ma, L., Zhang, W., Jiang, W., Wang, J., Liu, W.: Controllable video captioning with POS sequence guidance based on gated fusion network. In: ICCV (2019)Google Scholar
  55. 55.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  56. 56.
    Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: M3: multimodal memory modelling for video captioning. In: CVPR (2018)Google Scholar
  57. 57.
    Wang, L., Li, W., Li, W., Van Gool, L.: Appearance-and-relation networks for video classification. In: CVPR (2018)Google Scholar
  58. 58.
    Wang, L., Qiao, Yu., Tang, X.: Video action detection with relational dynamic-poselets. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 565–580. Springer, Cham (2014). Scholar
  59. 59.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). Scholar
  60. 60.
    Wei, D., Lim, J., Zisserman, A., Freeman, W.T.: Learning and using the arrow of time. In: CVPR (2018)Google Scholar
  61. 61.
    Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008). Scholar
  62. 62.
    Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR (2016)Google Scholar
  63. 63.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)Google Scholar
  64. 64.
    Zhao, H., Yan, Z., Torresani, L., Torralba, A.: HACS: human action clips and segments dataset for recognition and temporal localization. arXiv:1712.09374 (2019)

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.KU LeuvenLeuvenBelgium
  2. 2.University of BonnBonnGermany
  3. 3.KIT, KarlsruheKarlsruheGermany
  4. 4.ETH ZürichZürichSwitzerland
  5. 5.SensifaiBrusselsBelgium

Personalised recommendations