Skip to main content

Large Scale Holistic Video Understanding

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12350))

Abstract

Video recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill this gap by presenting a large-scale “Holistic Video Understanding Dataset” (HVU). HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx. 572k videos in total with 9 million annotations for training, validation and test set spanning over 3142 labels. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes and concepts which naturally captures the real-world scenarios.

We demonstrate the generalisation capability of HVU on three challenging tasks: 1) Video classification, 2) Video captioning and 3) Video clustering tasks. In particular for video classification, we introduce a new spatio-temporal deep neural network architecture called “Holistic Appearance and Temporal Network” (HATNet) that builds on fusing 2D and 3D architectures into one by combining intermediate representations of appearance and temporal cues. HATNet focuses on the multi-label and multi-task learning problem and is trained in an end-to-end manner. Via our experiments, we validate the idea that holistic representation learning is complementary, and can play a key role in enabling many real-world applications. https://holistic-video-understanding.github.io/.

A. Diba, M. Fayyaz and V. Sharma—Contributed equally to this work and listed in alphabetical order.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Google Vision AI API. cloud.google.com/vision

  2. Sensifai Video Tagging API. www.sensifai.com

  3. Abu-El-Haija, S., et al.: Youtube-8m: a large-scale video classification benchmark. arXiv:1609.08675 (2016)

  4. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)

    Google Scholar 

  5. Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)

    Google Scholar 

  6. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)

    Google Scholar 

  7. Chen, S., Jiang, Y.G.: Motion guided spatial attention for video captioning. In: AAAI (2019)

    Google Scholar 

  8. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006). https://doi.org/10.1007/11744047_33

    Chapter  Google Scholar 

  9. Damen, D., et al.: Scaling egocentric vision: the Epic-Kitchens dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 753–771. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_44

    Chapter  Google Scholar 

  10. Diba, A., et al.: Spatio-temporal channel correlation networks for action classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 299–315. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_18

    Chapter  Google Scholar 

  11. Diba, A., et al.: Temporal 3D convnets using temporal transition layer. In: CVPR Workshops (2018)

    Google Scholar 

  12. Diba, A., Sharma, V., Van Gool, L.: Deep temporal linear encoding networks. In: CVPR (2017)

    Google Scholar 

  13. Diba, A., Sharma, V., Van Gool, L., Stiefelhagen, R.: DynamoNet: dynamic action and motion network. In: ICCV (2019)

    Google Scholar 

  14. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)

    Google Scholar 

  15. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: ICCV (2019)

    Google Scholar 

  16. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)

    Google Scholar 

  17. Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: CVPR (2017)

    Google Scholar 

  18. Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: CVPR (2015)

    Google Scholar 

  19. Gaidon, A., Harchaoui, Z., Schmid, C.: Temporal localization of actions with actoms. PAMI 35, 2782–2795 (2013)

    Article  Google Scholar 

  20. Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: CVPR (2017)

    Google Scholar 

  21. Girdhar, R., Tran, D., Torresani, L., Ramanan, D.: Distinit: learning video representations without a single labeled video. In: ICCV (2019)

    Google Scholar 

  22. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV (2017)

    Google Scholar 

  23. Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR (2018)

    Google Scholar 

  24. Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D residual networks for action recognition. In: ICCV (2017)

    Google Scholar 

  25. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)

    Google Scholar 

  26. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)

    Google Scholar 

  27. Kay, W., et al.: The kinetics human action video dataset. arXiv:1705.06950 (2017)

  28. Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC (2008)

    Google Scholar 

  29. Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: HMDB51: a large video database for human motion recognition. In: High Performance Computing in Science and Engineering (2013)

    Google Scholar 

  30. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)

    Google Scholar 

  31. Liu, S., Ren, Z., Yuan, J.: SibNet: sibling convolutional encoder for video captioning. In: ACMM (2018)

    Google Scholar 

  32. Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38, 39–41 (1995)

    Article  Google Scholar 

  33. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32

    Chapter  Google Scholar 

  34. Ng, J.Y.H., Choi, J., Neumann, J., Davis, L.S.: ActionFlowNet: learning motion representation for action recognition. In: WACV (2018)

    Google Scholar 

  35. Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15552-9_29

    Chapter  Google Scholar 

  36. Ray, J., et al.: Scenes-objects-actions: a multi-task, multi-label video dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 660–676. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_39

    Chapter  Google Scholar 

  37. Roethlingshoefer, V., Sharma, V., Stiefelhagen, R.: Self-supervised face-grouping on graph. In: ACM MM (2019)

    Google Scholar 

  38. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: ICPR (2004)

    Google Scholar 

  39. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: ACM MM (2007)

    Google Scholar 

  40. Sharma, V., Sarfraz, S., Stiefelhagen, R.: A simple and effective technique for face clustering in TV series. In: CVPR workshop on Brave New Motion Representations (2017)

    Google Scholar 

  41. Sharma, V., Tapaswi, M., Sarfraz, M.S., Stiefelhagen, R.: Self-supervised learning of face representations for video face clustering. In: International Conference on Automatic Face and Gesture Recognition (2019)

    Google Scholar 

  42. Sharma, V., Tapaswi, M., Sarfraz, M.S., Stiefelhagen, R.: Video face clustering with self-supervised representation learning. IEEE Trans. Biometrics Behav. Identity Sci. 2, 145–157 (2019)

    Article  Google Scholar 

  43. Sharma, V., Tapaswi, M., Sarfraz, M.S., Stiefelhagen, R.: Clustering based contrastive learning for improving face representations. In: International Conference on Automatic Face and Gesture Recognition (2020)

    Google Scholar 

  44. Sharma, V., Tapaswi, M., Stiefelhagen, R.: Deep multimodal feature encoding for video ordering. In: ICCV Workshop on Holistic Video Understanding (2019)

    Google Scholar 

  45. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_31

    Chapter  Google Scholar 

  46. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)

    Google Scholar 

  47. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)

  48. Sun, L., Jia, K., Yeung, D.Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV (2015)

    Google Scholar 

  49. Tang, P., Wang, X., Shi, B., Bai, X., Liu, W., Tu, Z.: Deep fishernet for object classification. arXiv:1608.00182 (2016)

  50. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)

    Google Scholar 

  51. Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv:1708.05038 (2017)

  52. Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: ICCV (2019)

    Google Scholar 

  53. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018)

    Google Scholar 

  54. Wang, B., Ma, L., Zhang, W., Jiang, W., Wang, J., Liu, W.: Controllable video captioning with POS sequence guidance based on gated fusion network. In: ICCV (2019)

    Google Scholar 

  55. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)

    Google Scholar 

  56. Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: M3: multimodal memory modelling for video captioning. In: CVPR (2018)

    Google Scholar 

  57. Wang, L., Li, W., Li, W., Van Gool, L.: Appearance-and-relation networks for video classification. In: CVPR (2018)

    Google Scholar 

  58. Wang, L., Qiao, Yu., Tang, X.: Video action detection with relational dynamic-poselets. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 565–580. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_37

    Chapter  Google Scholar 

  59. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  60. Wei, D., Lim, J., Zisserman, A., Freeman, W.T.: Learning and using the arrow of time. In: CVPR (2018)

    Google Scholar 

  61. Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88688-4_48

    Chapter  Google Scholar 

  62. Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR (2016)

    Google Scholar 

  63. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)

    Google Scholar 

  64. Zhao, H., Yan, Z., Torresani, L., Torralba, A.: HACS: human action clips and segments dataset for recognition and temporal localization. arXiv:1712.09374 (2019)

Download references

Acknowledgements

This work was supported by DBOF PhD scholarship & GC4 Flemish AI project, and the ERC Starting Grant ARCA (677650). We also would like to thank Sensifai for giving us access to the Video Tagging API for dataset preparation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Diba .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12227 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Diba, A. et al. (2020). Large Scale Holistic Video Understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12350. Springer, Cham. https://doi.org/10.1007/978-3-030-58558-7_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58558-7_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58557-0

  • Online ISBN: 978-3-030-58558-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics