Skip to main content
Log in

Deep video representation learning: a survey

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper provides a review on representation learning for videos. We classify recent spatio-temporal feature learning methods for sequential visual data and compare their pros and cons for general video analysis. Building effective features for videos is a fundamental problem in computer vision tasks involving video analysis and understanding. Existing features can be generally categorized into spatial and temporal features. Their effectiveness under variations of illumination, occlusion, view and background are discussed. Finally, we discuss the remaining challenges in existing deep video representation learning studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data Availibility

All data supporting the findings of this study are available within the paper

References

  1. Arnab A, Dehghani M, Heigold G et al (2021) Vivit: a video vision transformer. In: ICCV, pp 6836–6846

  2. Athar A, Luiten J, Hermans A et al (2022) Hodor: high-level object descriptors for object re-segmentation in video learned from static images. In: CVPR, pp 3022–3031

  3. Azulay A, Halperin T, Vantzos O et al (2022) Temporally stable video segmentation without video annotations. In: WACV, pp 3449–3458

  4. Baradel F, Wolf C, Mille J et al (2018) Glimpse clouds: Human activity recognition from unstructured feature points. In: CVPR, pp 469–478

  5. Bendre N, Zand N, Bhattarai S et al (2022) Natural disaster analytics using high resolution satellite images. In: World automation congress. IEEE, pp 371–378

  6. Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, p 4

  7. Botach A, Zheltonozhskii E, Baskin C (2022) End-to-end referring video object segmentation with multimodal transformers. In: CVPR, pp 4985–4995

  8. Bruce X, Liu Y, Chan KC (2021) Multimodal fusion via teacher-student network for indoor action recognition. In: AAAI, pp 3199–3207

  9. Bruce X, Liu Y, Zhang X et al (2022) Mmnet: a model-based multimodal network for human action recognition in rgb-d videos. PAMI

  10. Caetano C, Sena J, Brémond F et al (2019) Skelemotion: a new representation of skeleton joint sequences based on motion information for 3d action recognition. In: International conference on advanced video and signal based surveillance. IEEE, pp 1–8

  11. Cai J, Jiang N, Han X et al (2021) Jolo-gcn: mining joint-centered light-weight information for skeleton-based action recognition. In: WACV, pp 2735–2744

  12. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp 6299–6308

  13. Chen D, Li H, Xiao T et al (2018a) Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: CVPR, pp 1169–1178

  14. Chen M, Wei F, Li C, et al (2022) Frame-wise action representations for long videos via sequence contrastive learning. In: CVPR, pp 13801–13810

  15. Chen X, Yuille AL (2015) Parsing occluded people by flexible compositions. In: CVPR, pp 3945–3954

  16. Chen X, Li Z, Yuan Y et al (2020) State-aware tracker for real-time video object segmentation. In: CVPR, pp 9384–9393

  17. Chen Y, Pont-Tuset J, Montes A et al (2018b) Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR, pp 1189–1198

  18. Chen Z, Wang X, Sun Z et al (2016) Motion saliency detection using a temporal fourier transform. Opt Laser Technol 80:1–15

    Article  Google Scholar 

  19. Cheng HK, Tai YW, Tang CK (2021) Modular interactive video object segmentation: interaction-to-mask, propagation and difference-aware fusion. In: CVPR, pp 5559–5568

  20. Cheng K, Zhang Y, Cao C et al (2020a) Decoupling gcn with drop graph module for skeleton-based action recognition. In: ECCV. Springer, pp 536–553

  21. Cheng K, Zhang Y, He X et al (2020b) Skeleton-based action recognition with shift graph convolutional network. In: CVPR, pp 183–192

  22. Cho S, Lee H, Kim M et al (2022) Pixel-level bijective matching for video object segmentation. In: WACV, pp 129–138

  23. Choi J, Gao C, Messou JC et al (2019) Why can’t i dance in the mall? Learning to mitigate scene bias in action recognition. NIPS 32

  24. Choutas V, Weinzaepfel P, Revaud J et al (2018) Potion: pose motion representation for action recognition. In: CVPR, pp 7024–7033

  25. Cuevas C, Quilón D, García N (2020) Techniques and applications for soccer video analysis: a survey. Multimed Tools Appl 79(39–40):29685–29721

    Article  Google Scholar 

  26. Dai R, Das S, Kahatapitiya K et al (2022) Ms-tct: multi-scale temporal convtransformer for action detection. In: CVPR, pp 20041–20051

  27. Dai X, Singh B, Ng JYH et al (2019) Tan: temporal aggregation network for dense multi-label action recognition. In: WACV. IEEE, pp 151–160

  28. De Boissiere AM, Noumeir R (2020) Infrared and 3d skeleton feature fusion for rgb-d action recognition. IEEE Access 8:168297–168308

    Article  Google Scholar 

  29. Deng J, Dong W, Socher R et al (2009a) Imagenet: a large-scale hierarchical image database. In: CVPR, pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848

  30. Deng J, Dong W, Socher R et al (2009b) Imagenet: a large-scale hierarchical image database. In: CVPR. IEEE, pp 248–255

  31. Donahue J, Anne Hendricks L, Guadarrama S et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634

  32. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp 1110–1118

  33. Duan H, Zhao Y, Chen K et al (2022) Revisiting skeleton-based action recognition. In: CVPR, pp 2969–2978

  34. Eun H, Moon J, Park J et al (2020) Learning to discriminate information for online action detection. In: CVPR, pp 809–818

  35. Fabbri M, Lanzi F, Calderara S et al (2018) Learning to detect and track visible and occluded body joints in a virtual world. In: ECCV

  36. Fan H, Xiong B, Mangalam K et al (2021) Multiscale vision transformers. In: ICCV, pp 6824–6835

  37. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR, pp 1933–1941

  38. Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: CVPR, pp 4768–4777

  39. Feichtenhofer C, Fan H, Malik J et al (2019) Slow fast networks for video recognition. In: ICCV, pp 6202–6211

  40. Gao R, Oh TH, Grauman K et al (2020) Listen to look: action recognition by previewing audio. In: CVPR, pp 10457–10467

  41. Gavrilyuk K, Ghodrati A, Li Z et al (2018) Actor and action video segmentation from a sentence. In: CVPR, pp 5958–5966

  42. Girdhar R, Ramanan D (2017) Attentional pooling for action recognition. Adv Neural Inf Process Syst 30

  43. Hamilton WL, Ying R, Leskovec J (2017) Representation learning on graphs: methods and applications. arXiv:1709.05584

  44. Hao X, Li J, Guo Y et al (2021) Hypergraph neural network for skeleton-based action recognition. TIP 30:2263–2275

    MathSciNet  Google Scholar 

  45. He D, Zhou Z, Gan C et al (2019) Stnet: local and global spatial-temporal modeling for action recognition. In: AAAI, pp 8401–8408

  46. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778

  47. He K, Gkioxari G, Dollár P et al (2017) Mask r-cnn. In: ICCV, pp 2961–2969

  48. Herzig R, Ben-Avraham E, Mangalam K et al (2022) Object-region video transformers. In: CVPR, pp 3148–3159

  49. Horn BK, Schunck BG (1981) Determining optical flow. Artif Intell 17(1–3):185–203

    Article  Google Scholar 

  50. Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In: CVPR, pp 13713–13722

  51. Hou R, Ma B, Chang H et al (2019) Vrstc: occlusion-free video person re-identification. In: CVPR, pp 7183–7192

  52. Hu JF, Zheng WS, Lai J et al (2015) Jointly learning heterogeneous features for rgb-d activity recognition. In: CVPR, pp 5344–5352

  53. Hu L, Zhang P, Zhang B et al (2021) Learning position and target consistency for memory-based video object segmentation. In: CVPR, pp 4144–4154

  54. Hu YT, Huang JB, Schwing AG (2018) Videomatch: matching based video object segmentation. In: ECCV, pp 54–70

  55. Huang X, Xu J, Tai YW et al (2020) Fast video object segmentation with temporal aggregation network and dynamic template matching. In: CVPR, pp 8879–8889

  56. Huang Z, Wan C, Probst T et al (2017) Deep learning on lie groups for skeleton-based action recognition. In: CVPR, pp 6099–6108

  57. Hussain T, Muhammad K, Ding W et al (2021) A comprehensive survey of multi-view video summarization. Pattern Recognit 109:107567

    Article  Google Scholar 

  58. Hussein N, Gavves E, Smeulders AW (2019) Timeception for complex action recognition. In: CVPR

  59. Iqbal U, Garbade M, Gall J (2017) Pose for action-action for pose. In: International conference on automatic face & gesture recognition. IEEE, pp 438–445

  60. Ji Y, Yang Y, Shen HT et al (2021) View-invariant action recognition via unsupervised attention transfer (uant). Pattern Recognit 113:107807

    Article  Google Scholar 

  61. Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: a survey. PAMI

  62. Johnander J, Danelljan M, Brissman E et al (2019) A generative appearance model for end-to-end video object segmentation. In: CVPR, pp 8953–8962

  63. Kapoor R, Sharma D, Gulati T (2021) State of the art content based image retrieval techniques using deep learning: a survey. Multimed Tools Appl 80(19):29561–29583

    Article  Google Scholar 

  64. Karbalaie A, Abtahi F, Sjöström M (2022) Event detection in surveillance videos: a review. Multimed Tools Appl 81(24):35463–35501

    Article  Google Scholar 

  65. Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. In: CVPR

  66. Ke L, Tai YW, Tang CK (2021a) Deep occlusion-aware instance segmentation with overlapping bilayers. In: CVPR, pp 4019–4028

  67. Ke L, Tai YW, Tang CK (2021b) Occlusion-aware video object inpainting. In: ICCV, pp 14468–14478

  68. Ke Q, Bennamoun M, An S et al (2017) A new representation of skeleton sequences for 3d action recognition. In: CVPR, pp 3288–3297

  69. Kim J, Li G, Yun I et al (2021) Weakly-supervised temporal attention 3d network for human action recognition. Pattern Recognit 119:108068

    Article  Google Scholar 

  70. Kim TS, Reiter A (2017) Interpretable 3d human action analysis with temporal convolutional networks. In: CVPR workshop. IEEE, pp 1623–1631

  71. Kniaz VV, Knyaz VA, Hladuvka J et al (2018) Thermalgan: multimodal color-to-thermal image translation for person re-identification in multispectral dataset. In: ECCV Workshops, pp 0–0

  72. Kong Y, Tao Z, Fu Y (2017) Deep sequential context networks for action prediction. In: CVPR, pp 1473–1481

  73. Kong Y, Tao Z, Fu Y (2018) Adversarial action prediction networks. PAMI 42(3):539–553

    Article  Google Scholar 

  74. Korbar B, Tran D, Torresani L (2019) Scsampler: sampling salient clips from video for efficient action recognition. In: ICCV, pp 6232–6242

  75. Li B, Dai Y, Cheng X et al (2017a) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: International conference on multimedia & expo workshops (ICMEW). IEEE, pp 601–604

  76. Li B, Li X, Zhang Z et al (2019a) Spatio-temporal graph routing for skeleton-based action recognition. In: AAAI, pp 8561–8568

  77. Li C, Zhong Q, Xie D et al (2017b) Skeleton-based action recognition with convolutional neural networks. In: International conference on multimedia & expo workshops. IEEE, pp 597–600

  78. Li J, Liu X, Zhang W et al (2020) Spatio-temporal attention networks for action recognition and detection. IEEE Trans Multimed 22(11):2990–3001

    Article  Google Scholar 

  79. Li L, Zheng W, Zhang Z et al (2018a) Skeleton-based relational modeling for action recognition 1(2):3. arXiv:1805.02556

  80. Li M, Chen S, Chen X et al (2019b) Actional-structural graph convolutional networks for skeleton-based action recognition. In: CVPR, pp 3595–3603

  81. Li M, Hu L, Xiong Z et al (2022a) Recurrent dynamic embedding for video object segmentation. In: CVPR, pp 1332–1341

  82. Li S, Bak S, Carr P et al (2018b) Diversity regularized spatiotemporal attention for video-based person re-identification. In: CVPR

  83. Li S, Jiang T, Huang T et al (2020b) Global co-occurrence feature learning and active coordinate system conversion for skeleton-based action recognition. In: WACV, pp 586–594

  84. Li X, Liu C, Shuai B et al (2022b) Nuta: non-uniform temporal aggregation for action recognition. In: WACV, pp 3683–3692

  85. Li Y, Li Y, Vasconcelos N (2018c) Resound: towards action recognition without representation bias. In: ECCV, pp 513–528

  86. Li Y, Yang M, Zhang Z (2018) A survey of multi-view representation learning. Trans Knowl Data Eng 31(10):1863–1883

    Article  Google Scholar 

  87. Li Y, Xia R, Liu X (2020) Learning shape and motion representations for view invariant skeleton-based action recognition. Pattern Recognit 103:107293

    Article  Google Scholar 

  88. Li Y, He J, Zhang T et al (2021) Diverse part discovery: occluded person re-identification with part-aware transformer. In: CVPR, pp 2898–2907

  89. Li Z, Gavrilyuk K, Gavves E et al (2018) Videolstm convolves, attends and flows for action recognition. Comp Vision Image Underst 166:41–50

    Article  Google Scholar 

  90. Liang J, Jiang L, Niebles JC et al (2019) Peeking into the future: predicting future person activities and locations in videos. In: CVPR, pp 5725–5734

  91. Liang W, Zhu Y, Zhu SC (2018) Tracking occluded objects and recovering incomplete trajectories by reasoning about containment relations and human actions. In: AAAI

  92. Liang Y, Li X, Jafari N et al (2020) Video object segmentation with adaptive feature bank and uncertain-region refinement. NIPS 33:3430–3441

    Google Scholar 

  93. Lin H, Qi X, Jia J (2019a) Agss-vos: attention guided single-shot video object segmentation. In: ICCV, pp 3949–3957

  94. Lin J, Gan C, Han S (2019b) Tsm: temporal shift module for efficient video understanding. In: ICCV, pp 7083–7093

  95. Lin S, Xie H, Wang B et al (2022a) Knowledge distillation via the target-aware transformer. In: CVPR, pp 10915–10924

  96. Lin Z, Yang T, Li M et al (2022b) Swem: towards real-time video object segmentation with sequential weighted expectation-maximization. In: CVPR, pp 1362–1372

  97. Liu D, Cui Y, Chen Y et al (2020) Video object detection for autonomous driving: motion-aid feature calibration. Neurocomputing 409:1–11

    Article  Google Scholar 

  98. Liu D, Cui Y, Tan W et al (2021a) Sg-net: spatial granularity network for one-stage video instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9816–9825

  99. Liu J, Shahroudy A, Xu D et al (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: ECCV. Springer, pp 816–833

  100. Liu J, Akhtar N, Mian A (2017a) Viewpoint invariant rgb-d human action recognition. In: International conference on digital image computing: techniques and applications. IEEE, pp 1–8

  101. Liu J, Wang G, Duan LY et al (2017) Skeleton-based human action recognition with global context-aware attention lstm networks. TIP 27(4):1586–1599

    MathSciNet  Google Scholar 

  102. Liu J, Wang G, Hu P et al (2017c) Global context-aware attention lstm networks for 3d action recognition. In: CVPR, pp 1647–1656

  103. Liu M, Yuan J (2018) Recognizing human actions as the evolution of pose estimation maps. In: CVPR, pp 1159–1168

  104. Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit 68:346–362

    Article  Google Scholar 

  105. Liu Y, Wang K, Li G et al (2021) Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. TIP 30:5573–5588

    Google Scholar 

  106. Liu Z, Zhang H, Chen Z et al (2020b) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: CVPR

  107. Liu Z, Ning J, Cao Y et al (2022) Video swin transformer. In: CVPR, pp 3202–3211

  108. Lu Y, Wang Q, Ma S et al (2023) Transflow: transformer as flow learner. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18063–18073

  109. Luo C, Yuille AL (2019) Grouped spatial-temporal aggregation for efficient action recognition. In: ICCV, pp 5512–5521

  110. Luvizon DC, Picard D, Tabia H (2020) Multi-task deep learning for real-time 3d human pose estimation and action recognition. PAMI 43(8):2752–2764

    Google Scholar 

  111. Lv Z, Ota K, Lloret J et al (2022) Complexity problems handled by advanced computer simulation technology in smart cities 2021

  112. Ma J, Jiang X, Fan A et al (2021) Image matching from handcrafted to deep features: a survey. IJCV 129(1):23–79

    Article  MathSciNet  Google Scholar 

  113. Meng Y, Lin CC, Panda R et al (2020) Ar-net: adaptive frame resolution for efficient action recognition. In: ECCV. Springer, pp 86–104

  114. Minaee S, Boykov YY, Porikli F et al (2021) Image segmentation using deep learning: a survey. PAMI

  115. Neimark D, Bar O, Zohar M et al (2021) Video transformer network. In: ICCV, pp 3163–3172

  116. Oh SW, Lee JY, Xu N et al (2019a) Fast user-guided video object segmentation by interaction-and-propagation networks. In: CVPR, pp 5247–5256

  117. Oh SW, Lee JY, Xu N et al (2019b) Video object segmentation using space-time memory networks. In: ICCV, pp 9226–9235

  118. Ouyang W, Wang X (2012) A discriminative deep model for pedestrian detection with occlusion handling. In: CVPR. IEEE, pp 3258–3265

  119. Ouyang W, Wang X (2013) Joint deep learning for pedestrian detection. In: ICCV, pp 2056–2063

  120. Park K, Woo S, Oh SW et al (2022) Per-clip video object segmentation. In: CVPR, pp 1352–1361

  121. Patrick M, Campbell D, Asano Y et al (2021) Keeping your eye on the ball: trajectory attention in video transformers. NIPS 34:12493–12506

    Google Scholar 

  122. Peng W, Hong X, Chen H et al (2020) Learning graph convolutional network for skeleton-based human action recognition by neural searching. In: AAAI, pp 2669–2676

  123. Pexels (n.d.) Pexels. https://www.pexels.com/, accessed November 9, 2023

  124. Piasco N, Sidibé D, Demonceaux C et al (2018) A survey on visual-based localization: on the benefit of heterogeneous data. Pattern Recognit 74:90–109

    Article  Google Scholar 

  125. Pont-Tuset J, Perazzi F, Caelles S et al (2017) The 2017 Davis challenge on video object segmentation. arXiv:1704.00675

  126. Qin X, Ge Y, Feng J et al (2020) Dtmmn: deep transfer multi-metric network for rgb-d action recognition. Neurocomputing 406:127–134

    Article  Google Scholar 

  127. Qin Z, Lu X, Nie X et al (2023) Coarse-to-fine video instance segmentation with factorized conditional appearance flows. IEEE/CAA J Autom Sin 10(5):1192–1208

    Article  Google Scholar 

  128. Ren S, Liu W, Liu Y et al (2021) Reciprocal transformations for unsupervised video object segmentation. In: CVPR, pp 15455–15464

  129. Robinson A, Lawin FJ, Danelljan M et al (2020) Learning fast and robust target models for video object segmentation. In: CVPR, pp 7406–7415

  130. Seo S, Lee JY, Han B (2020) Urvos: unified referring video object segmentation network with a large-scale benchmark. In: ECCV. Springer, pp 208–223

  131. Shahroudy A, Liu J, Ng TT et al (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: CVPR, pp 1010–1019

  132. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119

  133. Shi L, Zhang Y, Cheng J et al (2019a) Skeleton-based action recognition with directed graph neural networks. In: CVPR, pp 7912–7921

  134. Shi L, Zhang Y, Cheng J et al (2019b) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: CVPR

  135. Shi L, Zhang Y, Cheng J et al (2020a) Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In: Proceedings of the Asian conference on computer vision

  136. Shi L, Zhang Y, Cheng J et al (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. TIP 29:9532–9545

    Google Scholar 

  137. Shou Z, Chan J, Zareian A et al (2017) Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR

  138. Si C, Chen W, Wang W et al (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: CVPR, pp 1227–1236

  139. Simonyan K, Zisserman A (2014a) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199

  140. Simonyan K, Zisserman A (2014b) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  141. Song L, Yu G, Yuan J et al (2021) Human pose estimation and its application to action recognition: a survey. J Vis Commun Image Represent 103055

  142. Song YF, Zhang Z, Wang L (2019) Richly activated graph convolutional network for action recognition with incomplete skeletons. In: ICIP. IEEE, pp 1–5

  143. Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  144. de Souza Reis E, Seewald LA, Antunes RS et al (2021) Monocular multi-person pose estimation: a survey. Pattern Recognit 108046

  145. Su L, Hu C, Li G et al (2020) Msaf: multimodal split attention fusion. arXiv:2012.07175

  146. Sudhakaran S, Escalera S, Lanz O (2020) Gate-shift networks for video action recognition. In: CVPR, pp 1102–1111

  147. Sun M, Xiao J, Lim EG et al (2020) Fast template matching and update for video object tracking and segmentation. In: CVPR, pp 10791–10799

  148. Thakkar K, Narayanan P (2018) Part-based graph convolutional network for action recognition. arXiv:1809.04983

  149. Tian Y, Luo P, Wang X et al (2015) Deep learning strong parts for pedestrian detection. In: ICCV, pp 1904–1912

  150. Tran A, Cheong LF (2017) Two-stream flow-guided convolutional attention networks for action recognition. In: ICCV Workshops, pp 3110–3119

  151. Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp 4489–4497

  152. Tran D, Wang H, Torresani L et al (2019) Video classification with channel-separated convolutional networks. In: ICCV, pp 5552–5561

  153. Truong TD, Bui QH, Duong CN et al (2022) Direcformer: a directed attention in transformer approach to robust action recognition. In: CVPR, pp 20030–20040

  154. Ullah A, Muhammad K, Hussain T et al (2021) Conflux lstms network: a novel approach for multi-view action recognition. Neurocomputing 435:321–329

    Article  Google Scholar 

  155. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. NIPS 30

  156. Veeriah V, Zhuang N, Qi GJ (2015) Differential recurrent neural networks for action recognition. In: ICCV, pp 4041–4049

  157. Ventura C, Bellver M, Girbau A et al (2019) Rvos: end-to-end recurrent network for video object segmentation. In: CVPR, pp 5277–5286

  158. Voigtlaender P, Chai Y, Schroff F et al (2019) Feelvos: fast end-to-end embedding learning for video object segmentation. In: CVPR, pp 9481–9490

  159. Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: CVPR, pp 499–508

  160. Wang L, Xiong Y, Wang Z et al (2015) Towards good practices for very deep two-stream convnets. arXiv:1507.02159

  161. Wang L, Xiong Y, Wang Z et al (2016a) Temporal segment networks: towards good practices for deep action recognition. In: ECCV. Springer, pp 20–36

  162. Wang L, Tong Z, Ji B et al (2021) Tdn: temporal difference networks for efficient action recognition. In: CVPR, pp 1895–1904

  163. Wang M, Ni B, Yang X (2020) Learning multi-view interactional skeleton graph for action recognition. PAMI

  164. Wang P, Li Z, Hou Y et al (2016b) Action recognition based on joint trajectory maps using convolutional neural networks. In: Proceedings of the 24th ACM international conference on multimedia, pp 102–106

  165. Wang P, Li W, Gao Z et al (2017a) Scene flow to action map: a new representation for rgb-d based action recognition with convolutional neural networks. In: CVPR

  166. Wang P, Wang S, Gao Z et al (2017b) Structured images for rgb-d action recognition. In: ICCV Workshops

  167. Wang X, Zheng S, Yang R et al (2022) Pedestrian attribute recognition: a survey. Pattern Recognit 121:108220. https://doi.org/10.1016/j.patcog.2021.108220

  168. Wang Z, Xu J, Liu L et al (2019) Ranet: ranking attention network for fast video object segmentation. In: ICCV, pp 3978–3987

  169. Wen YH, Gao L, Fu H et al (2019) Graph cnns with motif and variable temporal block for skeleton-based action recognition. In: AAAI, pp 8989–8996

  170. Wu C, Wu XJ, Kittler J (2019a) Spatial residual layer and dense connection block enhanced spatial temporal graph convolutional network for skeleton-based action recognition. In: ICCV workshops, pp 0–0

  171. Wu D, Dong X, Shao L et al (2022a) Multi-level representation learning with semantic alignment for referring video object segmentation. In: CVPR, pp 4996–5005

  172. Wu J, Jiang Y, Sun P et al (2022b) Language as queries for referring video object segmentation. In: CVPR, pp 4974–4984

  173. Wu J, Yarram S, Liang H et al (2022c) Efficient video instance segmentation via tracklet query and proposal. In: CVPR

  174. Wu W, He D, Tan X et al (2019b) Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In: ICCV, pp 6222–6231

  175. Xie H, Yao H, Zhou S et al (2021) Efficient regional memory network for video object segmentation. In: CVPR, pp 1286–1295

  176. Xie S, Sun C, Huang J et al (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: ECCV, pp 305–321

  177. Xu C, Govindarajan LN, Zhang Y et al (2017) Lie-x: depth image based articulated object pose estimation, tracking, and action recognition on lie groups. IJCV 123(3):454–478

  178. Xu J, Zhao R, Zhu F et al (2018a) Attention-aware compositional network for person re-identification. In: CVPR, pp 2119–2128

  179. Xu K, Yao A (2022) Accelerating video object segmentation with compressed video. In: CVPR, pp 1342–1351

  180. Xu K, Wen L, Li G et al (2019a) Spatiotemporal cnn for video object segmentation. In: CVPR, pp 1379–1388

  181. Xu M, Gao M, Chen YT et al (2019b) Temporal recurrent networks for online action detection. In: ICCV, pp 5532–5541

  182. Xu N, Yang L, Fan Y et al (2018b) Youtube-vos: a large-scale video object segmentation benchmark. arXiv:1809.03327

  183. Xu S, Cheng Y, Gu K et al (2017b) Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In: ICCV, pp 4733–4742

  184. Yan A, Wang Y, Li Z et al (2019a) Pa3d: pose-action 3d machine for video recognition. In: CVPR

  185. Yan A, Wang Y, Li Z et al (2019b) Pa3d: pose-action 3d machine for video recognition. In: CVPR, pp 7922–7931

  186. Yan L, Wang Q, Cui Y et al (2022) Gl-rg: global-local representation granularity for video captioning. arXiv:2205.10706

  187. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI

  188. Yang H, Yuan C, Li B et al (2019) Asymmetric 3d convolutional neural networks for action recognition. Pattern Recognit 85:1–12

    Article  Google Scholar 

  189. Yang H, Yan D, Zhang L et al (2021) Feedback graph convolutional network for skeleton-based action recognition. TIP 31:164–175

    Google Scholar 

  190. Yang J, Dong X, Liu L et al (2022) Recurring the transformer for video action recognition. In: CVPR, pp 14063–14073

  191. Yang L, Fan Y, Xu N (2019b) Video instance segmentation. In: CVPR, pp 5188–5197

  192. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122

  193. Zhang D, Dai X, Wang YF (2018a) Dynamic temporal pyramid network: a closer look at multi-scale modeling for activity detection. In: Asian conference on computer vision. Springer, pp 712–728

  194. Zhang K, Zhao Z, Liu D et al (2021) Deep transport network for unsupervised video object segmentation. In: ICCV, pp 8781–8790

  195. Zhang L, Lin Z, Zhang J et al (2019a) Fast video object segmentation via dynamic targeting network. In: ICCV, pp 5582–5591

  196. Zhang P, Lan C, Xing J et al (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: ICCV, pp 2117–2126

  197. Zhang R, Li J, Sun H et al (2019) Scan: Self-and-collaborative attention network for video person re-identification. TIP 28(10):4870–4882

    MathSciNet  Google Scholar 

  198. Zhang S, Yang J, Schiele B (2018b) Occluded pedestrian detection through guided attention in cnns. In: CVPR, pp 6995–7003

  199. Zhang Y, Borse S, Cai H et al (2022) Perceptual consistency in video segmentation. In: WACV, pp 2564–2573

  200. Zhao H, Wildes RP (2019) Spatiotemporal feature residual propagation for action prediction. In: ICCV, pp 7003–7012

  201. Zhao L, Wang Y, Zhao J et al (2021) Learning view-disentangled human pose representation by contrastive cross-view mutual information maximization. In: CVPR, pp 12793–12802

  202. Zheng Z, An G, Wu D et al (2020) Global and local knowledge-aware attention network for action recognition. IEEE Trans Neural Netw Learn Syst 32(1):334–347

    Article  Google Scholar 

  203. Zhou C, Yuan J (2017) Multi-label learning of part detectors for heavily occluded pedestrian detection. In: ICCV, pp 3486–3495

  204. Zhou Q, Sheng K, Zheng X et al (2022a) Training-free transformer architecture search. In: CVPR, pp 10894–10903

  205. Zhou Y, Zhang H, Lee H et al (2022b) Slot-vps: object-centric representation learning for video panoptic segmentation. In: CVPR, pp 3093–3103

  206. Zhu D, Zhang Z, Cui P et al (2019) Robust graph convolutional networks against adversarial attacks. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1399–1407

  207. Zhu J, Zou W, Xu L et al (2018) Action machine: rethinking action recognition in trimmed videos. arXiv:1812.05770

  208. Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. In: ECCV, pp 695–712

  209. Zolfaghari M, Zhu Y, Gehler P et al (2021) Crossclr: cross-modal contrastive learning for multi-modal video representations. In: ICCV, pp 1450–1459

  210. Zong M, Wang R, Chen X et al (2021) Motion saliency based multi-stream multiplier resnets for action recognition. Image Vis Comput 107:104108

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Li.

Ethics declarations

Competing interest

We do not have any conflict of interest related to the manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ravanbakhsh, E., Liang, Y., Ramanujam, J. et al. Deep video representation learning: a survey. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-17815-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-023-17815-3

Keywords

Navigation