Abstract
This paper provides a review on representation learning for videos. We classify recent spatio-temporal feature learning methods for sequential visual data and compare their pros and cons for general video analysis. Building effective features for videos is a fundamental problem in computer vision tasks involving video analysis and understanding. Existing features can be generally categorized into spatial and temporal features. Their effectiveness under variations of illumination, occlusion, view and background are discussed. Finally, we discuss the remaining challenges in existing deep video representation learning studies.
Similar content being viewed by others
Data Availibility
All data supporting the findings of this study are available within the paper
References
Arnab A, Dehghani M, Heigold G et al (2021) Vivit: a video vision transformer. In: ICCV, pp 6836–6846
Athar A, Luiten J, Hermans A et al (2022) Hodor: high-level object descriptors for object re-segmentation in video learned from static images. In: CVPR, pp 3022–3031
Azulay A, Halperin T, Vantzos O et al (2022) Temporally stable video segmentation without video annotations. In: WACV, pp 3449–3458
Baradel F, Wolf C, Mille J et al (2018) Glimpse clouds: Human activity recognition from unstructured feature points. In: CVPR, pp 469–478
Bendre N, Zand N, Bhattarai S et al (2022) Natural disaster analytics using high resolution satellite images. In: World automation congress. IEEE, pp 371–378
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, p 4
Botach A, Zheltonozhskii E, Baskin C (2022) End-to-end referring video object segmentation with multimodal transformers. In: CVPR, pp 4985–4995
Bruce X, Liu Y, Chan KC (2021) Multimodal fusion via teacher-student network for indoor action recognition. In: AAAI, pp 3199–3207
Bruce X, Liu Y, Zhang X et al (2022) Mmnet: a model-based multimodal network for human action recognition in rgb-d videos. PAMI
Caetano C, Sena J, Brémond F et al (2019) Skelemotion: a new representation of skeleton joint sequences based on motion information for 3d action recognition. In: International conference on advanced video and signal based surveillance. IEEE, pp 1–8
Cai J, Jiang N, Han X et al (2021) Jolo-gcn: mining joint-centered light-weight information for skeleton-based action recognition. In: WACV, pp 2735–2744
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp 6299–6308
Chen D, Li H, Xiao T et al (2018a) Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: CVPR, pp 1169–1178
Chen M, Wei F, Li C, et al (2022) Frame-wise action representations for long videos via sequence contrastive learning. In: CVPR, pp 13801–13810
Chen X, Yuille AL (2015) Parsing occluded people by flexible compositions. In: CVPR, pp 3945–3954
Chen X, Li Z, Yuan Y et al (2020) State-aware tracker for real-time video object segmentation. In: CVPR, pp 9384–9393
Chen Y, Pont-Tuset J, Montes A et al (2018b) Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR, pp 1189–1198
Chen Z, Wang X, Sun Z et al (2016) Motion saliency detection using a temporal fourier transform. Opt Laser Technol 80:1–15
Cheng HK, Tai YW, Tang CK (2021) Modular interactive video object segmentation: interaction-to-mask, propagation and difference-aware fusion. In: CVPR, pp 5559–5568
Cheng K, Zhang Y, Cao C et al (2020a) Decoupling gcn with drop graph module for skeleton-based action recognition. In: ECCV. Springer, pp 536–553
Cheng K, Zhang Y, He X et al (2020b) Skeleton-based action recognition with shift graph convolutional network. In: CVPR, pp 183–192
Cho S, Lee H, Kim M et al (2022) Pixel-level bijective matching for video object segmentation. In: WACV, pp 129–138
Choi J, Gao C, Messou JC et al (2019) Why can’t i dance in the mall? Learning to mitigate scene bias in action recognition. NIPS 32
Choutas V, Weinzaepfel P, Revaud J et al (2018) Potion: pose motion representation for action recognition. In: CVPR, pp 7024–7033
Cuevas C, Quilón D, García N (2020) Techniques and applications for soccer video analysis: a survey. Multimed Tools Appl 79(39–40):29685–29721
Dai R, Das S, Kahatapitiya K et al (2022) Ms-tct: multi-scale temporal convtransformer for action detection. In: CVPR, pp 20041–20051
Dai X, Singh B, Ng JYH et al (2019) Tan: temporal aggregation network for dense multi-label action recognition. In: WACV. IEEE, pp 151–160
De Boissiere AM, Noumeir R (2020) Infrared and 3d skeleton feature fusion for rgb-d action recognition. IEEE Access 8:168297–168308
Deng J, Dong W, Socher R et al (2009a) Imagenet: a large-scale hierarchical image database. In: CVPR, pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Deng J, Dong W, Socher R et al (2009b) Imagenet: a large-scale hierarchical image database. In: CVPR. IEEE, pp 248–255
Donahue J, Anne Hendricks L, Guadarrama S et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp 1110–1118
Duan H, Zhao Y, Chen K et al (2022) Revisiting skeleton-based action recognition. In: CVPR, pp 2969–2978
Eun H, Moon J, Park J et al (2020) Learning to discriminate information for online action detection. In: CVPR, pp 809–818
Fabbri M, Lanzi F, Calderara S et al (2018) Learning to detect and track visible and occluded body joints in a virtual world. In: ECCV
Fan H, Xiong B, Mangalam K et al (2021) Multiscale vision transformers. In: ICCV, pp 6824–6835
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR, pp 1933–1941
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: CVPR, pp 4768–4777
Feichtenhofer C, Fan H, Malik J et al (2019) Slow fast networks for video recognition. In: ICCV, pp 6202–6211
Gao R, Oh TH, Grauman K et al (2020) Listen to look: action recognition by previewing audio. In: CVPR, pp 10457–10467
Gavrilyuk K, Ghodrati A, Li Z et al (2018) Actor and action video segmentation from a sentence. In: CVPR, pp 5958–5966
Girdhar R, Ramanan D (2017) Attentional pooling for action recognition. Adv Neural Inf Process Syst 30
Hamilton WL, Ying R, Leskovec J (2017) Representation learning on graphs: methods and applications. arXiv:1709.05584
Hao X, Li J, Guo Y et al (2021) Hypergraph neural network for skeleton-based action recognition. TIP 30:2263–2275
He D, Zhou Z, Gan C et al (2019) Stnet: local and global spatial-temporal modeling for action recognition. In: AAAI, pp 8401–8408
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
He K, Gkioxari G, Dollár P et al (2017) Mask r-cnn. In: ICCV, pp 2961–2969
Herzig R, Ben-Avraham E, Mangalam K et al (2022) Object-region video transformers. In: CVPR, pp 3148–3159
Horn BK, Schunck BG (1981) Determining optical flow. Artif Intell 17(1–3):185–203
Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In: CVPR, pp 13713–13722
Hou R, Ma B, Chang H et al (2019) Vrstc: occlusion-free video person re-identification. In: CVPR, pp 7183–7192
Hu JF, Zheng WS, Lai J et al (2015) Jointly learning heterogeneous features for rgb-d activity recognition. In: CVPR, pp 5344–5352
Hu L, Zhang P, Zhang B et al (2021) Learning position and target consistency for memory-based video object segmentation. In: CVPR, pp 4144–4154
Hu YT, Huang JB, Schwing AG (2018) Videomatch: matching based video object segmentation. In: ECCV, pp 54–70
Huang X, Xu J, Tai YW et al (2020) Fast video object segmentation with temporal aggregation network and dynamic template matching. In: CVPR, pp 8879–8889
Huang Z, Wan C, Probst T et al (2017) Deep learning on lie groups for skeleton-based action recognition. In: CVPR, pp 6099–6108
Hussain T, Muhammad K, Ding W et al (2021) A comprehensive survey of multi-view video summarization. Pattern Recognit 109:107567
Hussein N, Gavves E, Smeulders AW (2019) Timeception for complex action recognition. In: CVPR
Iqbal U, Garbade M, Gall J (2017) Pose for action-action for pose. In: International conference on automatic face & gesture recognition. IEEE, pp 438–445
Ji Y, Yang Y, Shen HT et al (2021) View-invariant action recognition via unsupervised attention transfer (uant). Pattern Recognit 113:107807
Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: a survey. PAMI
Johnander J, Danelljan M, Brissman E et al (2019) A generative appearance model for end-to-end video object segmentation. In: CVPR, pp 8953–8962
Kapoor R, Sharma D, Gulati T (2021) State of the art content based image retrieval techniques using deep learning: a survey. Multimed Tools Appl 80(19):29561–29583
Karbalaie A, Abtahi F, Sjöström M (2022) Event detection in surveillance videos: a review. Multimed Tools Appl 81(24):35463–35501
Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. In: CVPR
Ke L, Tai YW, Tang CK (2021a) Deep occlusion-aware instance segmentation with overlapping bilayers. In: CVPR, pp 4019–4028
Ke L, Tai YW, Tang CK (2021b) Occlusion-aware video object inpainting. In: ICCV, pp 14468–14478
Ke Q, Bennamoun M, An S et al (2017) A new representation of skeleton sequences for 3d action recognition. In: CVPR, pp 3288–3297
Kim J, Li G, Yun I et al (2021) Weakly-supervised temporal attention 3d network for human action recognition. Pattern Recognit 119:108068
Kim TS, Reiter A (2017) Interpretable 3d human action analysis with temporal convolutional networks. In: CVPR workshop. IEEE, pp 1623–1631
Kniaz VV, Knyaz VA, Hladuvka J et al (2018) Thermalgan: multimodal color-to-thermal image translation for person re-identification in multispectral dataset. In: ECCV Workshops, pp 0–0
Kong Y, Tao Z, Fu Y (2017) Deep sequential context networks for action prediction. In: CVPR, pp 1473–1481
Kong Y, Tao Z, Fu Y (2018) Adversarial action prediction networks. PAMI 42(3):539–553
Korbar B, Tran D, Torresani L (2019) Scsampler: sampling salient clips from video for efficient action recognition. In: ICCV, pp 6232–6242
Li B, Dai Y, Cheng X et al (2017a) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: International conference on multimedia & expo workshops (ICMEW). IEEE, pp 601–604
Li B, Li X, Zhang Z et al (2019a) Spatio-temporal graph routing for skeleton-based action recognition. In: AAAI, pp 8561–8568
Li C, Zhong Q, Xie D et al (2017b) Skeleton-based action recognition with convolutional neural networks. In: International conference on multimedia & expo workshops. IEEE, pp 597–600
Li J, Liu X, Zhang W et al (2020) Spatio-temporal attention networks for action recognition and detection. IEEE Trans Multimed 22(11):2990–3001
Li L, Zheng W, Zhang Z et al (2018a) Skeleton-based relational modeling for action recognition 1(2):3. arXiv:1805.02556
Li M, Chen S, Chen X et al (2019b) Actional-structural graph convolutional networks for skeleton-based action recognition. In: CVPR, pp 3595–3603
Li M, Hu L, Xiong Z et al (2022a) Recurrent dynamic embedding for video object segmentation. In: CVPR, pp 1332–1341
Li S, Bak S, Carr P et al (2018b) Diversity regularized spatiotemporal attention for video-based person re-identification. In: CVPR
Li S, Jiang T, Huang T et al (2020b) Global co-occurrence feature learning and active coordinate system conversion for skeleton-based action recognition. In: WACV, pp 586–594
Li X, Liu C, Shuai B et al (2022b) Nuta: non-uniform temporal aggregation for action recognition. In: WACV, pp 3683–3692
Li Y, Li Y, Vasconcelos N (2018c) Resound: towards action recognition without representation bias. In: ECCV, pp 513–528
Li Y, Yang M, Zhang Z (2018) A survey of multi-view representation learning. Trans Knowl Data Eng 31(10):1863–1883
Li Y, Xia R, Liu X (2020) Learning shape and motion representations for view invariant skeleton-based action recognition. Pattern Recognit 103:107293
Li Y, He J, Zhang T et al (2021) Diverse part discovery: occluded person re-identification with part-aware transformer. In: CVPR, pp 2898–2907
Li Z, Gavrilyuk K, Gavves E et al (2018) Videolstm convolves, attends and flows for action recognition. Comp Vision Image Underst 166:41–50
Liang J, Jiang L, Niebles JC et al (2019) Peeking into the future: predicting future person activities and locations in videos. In: CVPR, pp 5725–5734
Liang W, Zhu Y, Zhu SC (2018) Tracking occluded objects and recovering incomplete trajectories by reasoning about containment relations and human actions. In: AAAI
Liang Y, Li X, Jafari N et al (2020) Video object segmentation with adaptive feature bank and uncertain-region refinement. NIPS 33:3430–3441
Lin H, Qi X, Jia J (2019a) Agss-vos: attention guided single-shot video object segmentation. In: ICCV, pp 3949–3957
Lin J, Gan C, Han S (2019b) Tsm: temporal shift module for efficient video understanding. In: ICCV, pp 7083–7093
Lin S, Xie H, Wang B et al (2022a) Knowledge distillation via the target-aware transformer. In: CVPR, pp 10915–10924
Lin Z, Yang T, Li M et al (2022b) Swem: towards real-time video object segmentation with sequential weighted expectation-maximization. In: CVPR, pp 1362–1372
Liu D, Cui Y, Chen Y et al (2020) Video object detection for autonomous driving: motion-aid feature calibration. Neurocomputing 409:1–11
Liu D, Cui Y, Tan W et al (2021a) Sg-net: spatial granularity network for one-stage video instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9816–9825
Liu J, Shahroudy A, Xu D et al (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: ECCV. Springer, pp 816–833
Liu J, Akhtar N, Mian A (2017a) Viewpoint invariant rgb-d human action recognition. In: International conference on digital image computing: techniques and applications. IEEE, pp 1–8
Liu J, Wang G, Duan LY et al (2017) Skeleton-based human action recognition with global context-aware attention lstm networks. TIP 27(4):1586–1599
Liu J, Wang G, Hu P et al (2017c) Global context-aware attention lstm networks for 3d action recognition. In: CVPR, pp 1647–1656
Liu M, Yuan J (2018) Recognizing human actions as the evolution of pose estimation maps. In: CVPR, pp 1159–1168
Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit 68:346–362
Liu Y, Wang K, Li G et al (2021) Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. TIP 30:5573–5588
Liu Z, Zhang H, Chen Z et al (2020b) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: CVPR
Liu Z, Ning J, Cao Y et al (2022) Video swin transformer. In: CVPR, pp 3202–3211
Lu Y, Wang Q, Ma S et al (2023) Transflow: transformer as flow learner. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18063–18073
Luo C, Yuille AL (2019) Grouped spatial-temporal aggregation for efficient action recognition. In: ICCV, pp 5512–5521
Luvizon DC, Picard D, Tabia H (2020) Multi-task deep learning for real-time 3d human pose estimation and action recognition. PAMI 43(8):2752–2764
Lv Z, Ota K, Lloret J et al (2022) Complexity problems handled by advanced computer simulation technology in smart cities 2021
Ma J, Jiang X, Fan A et al (2021) Image matching from handcrafted to deep features: a survey. IJCV 129(1):23–79
Meng Y, Lin CC, Panda R et al (2020) Ar-net: adaptive frame resolution for efficient action recognition. In: ECCV. Springer, pp 86–104
Minaee S, Boykov YY, Porikli F et al (2021) Image segmentation using deep learning: a survey. PAMI
Neimark D, Bar O, Zohar M et al (2021) Video transformer network. In: ICCV, pp 3163–3172
Oh SW, Lee JY, Xu N et al (2019a) Fast user-guided video object segmentation by interaction-and-propagation networks. In: CVPR, pp 5247–5256
Oh SW, Lee JY, Xu N et al (2019b) Video object segmentation using space-time memory networks. In: ICCV, pp 9226–9235
Ouyang W, Wang X (2012) A discriminative deep model for pedestrian detection with occlusion handling. In: CVPR. IEEE, pp 3258–3265
Ouyang W, Wang X (2013) Joint deep learning for pedestrian detection. In: ICCV, pp 2056–2063
Park K, Woo S, Oh SW et al (2022) Per-clip video object segmentation. In: CVPR, pp 1352–1361
Patrick M, Campbell D, Asano Y et al (2021) Keeping your eye on the ball: trajectory attention in video transformers. NIPS 34:12493–12506
Peng W, Hong X, Chen H et al (2020) Learning graph convolutional network for skeleton-based human action recognition by neural searching. In: AAAI, pp 2669–2676
Pexels (n.d.) Pexels. https://www.pexels.com/, accessed November 9, 2023
Piasco N, Sidibé D, Demonceaux C et al (2018) A survey on visual-based localization: on the benefit of heterogeneous data. Pattern Recognit 74:90–109
Pont-Tuset J, Perazzi F, Caelles S et al (2017) The 2017 Davis challenge on video object segmentation. arXiv:1704.00675
Qin X, Ge Y, Feng J et al (2020) Dtmmn: deep transfer multi-metric network for rgb-d action recognition. Neurocomputing 406:127–134
Qin Z, Lu X, Nie X et al (2023) Coarse-to-fine video instance segmentation with factorized conditional appearance flows. IEEE/CAA J Autom Sin 10(5):1192–1208
Ren S, Liu W, Liu Y et al (2021) Reciprocal transformations for unsupervised video object segmentation. In: CVPR, pp 15455–15464
Robinson A, Lawin FJ, Danelljan M et al (2020) Learning fast and robust target models for video object segmentation. In: CVPR, pp 7406–7415
Seo S, Lee JY, Han B (2020) Urvos: unified referring video object segmentation network with a large-scale benchmark. In: ECCV. Springer, pp 208–223
Shahroudy A, Liu J, Ng TT et al (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: CVPR, pp 1010–1019
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Shi L, Zhang Y, Cheng J et al (2019a) Skeleton-based action recognition with directed graph neural networks. In: CVPR, pp 7912–7921
Shi L, Zhang Y, Cheng J et al (2019b) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: CVPR
Shi L, Zhang Y, Cheng J et al (2020a) Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In: Proceedings of the Asian conference on computer vision
Shi L, Zhang Y, Cheng J et al (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. TIP 29:9532–9545
Shou Z, Chan J, Zareian A et al (2017) Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR
Si C, Chen W, Wang W et al (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: CVPR, pp 1227–1236
Simonyan K, Zisserman A (2014a) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199
Simonyan K, Zisserman A (2014b) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Song L, Yu G, Yuan J et al (2021) Human pose estimation and its application to action recognition: a survey. J Vis Commun Image Represent 103055
Song YF, Zhang Z, Wang L (2019) Richly activated graph convolutional network for action recognition with incomplete skeletons. In: ICIP. IEEE, pp 1–5
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
de Souza Reis E, Seewald LA, Antunes RS et al (2021) Monocular multi-person pose estimation: a survey. Pattern Recognit 108046
Su L, Hu C, Li G et al (2020) Msaf: multimodal split attention fusion. arXiv:2012.07175
Sudhakaran S, Escalera S, Lanz O (2020) Gate-shift networks for video action recognition. In: CVPR, pp 1102–1111
Sun M, Xiao J, Lim EG et al (2020) Fast template matching and update for video object tracking and segmentation. In: CVPR, pp 10791–10799
Thakkar K, Narayanan P (2018) Part-based graph convolutional network for action recognition. arXiv:1809.04983
Tian Y, Luo P, Wang X et al (2015) Deep learning strong parts for pedestrian detection. In: ICCV, pp 1904–1912
Tran A, Cheong LF (2017) Two-stream flow-guided convolutional attention networks for action recognition. In: ICCV Workshops, pp 3110–3119
Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp 4489–4497
Tran D, Wang H, Torresani L et al (2019) Video classification with channel-separated convolutional networks. In: ICCV, pp 5552–5561
Truong TD, Bui QH, Duong CN et al (2022) Direcformer: a directed attention in transformer approach to robust action recognition. In: CVPR, pp 20030–20040
Ullah A, Muhammad K, Hussain T et al (2021) Conflux lstms network: a novel approach for multi-view action recognition. Neurocomputing 435:321–329
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. NIPS 30
Veeriah V, Zhuang N, Qi GJ (2015) Differential recurrent neural networks for action recognition. In: ICCV, pp 4041–4049
Ventura C, Bellver M, Girbau A et al (2019) Rvos: end-to-end recurrent network for video object segmentation. In: CVPR, pp 5277–5286
Voigtlaender P, Chai Y, Schroff F et al (2019) Feelvos: fast end-to-end embedding learning for video object segmentation. In: CVPR, pp 9481–9490
Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: CVPR, pp 499–508
Wang L, Xiong Y, Wang Z et al (2015) Towards good practices for very deep two-stream convnets. arXiv:1507.02159
Wang L, Xiong Y, Wang Z et al (2016a) Temporal segment networks: towards good practices for deep action recognition. In: ECCV. Springer, pp 20–36
Wang L, Tong Z, Ji B et al (2021) Tdn: temporal difference networks for efficient action recognition. In: CVPR, pp 1895–1904
Wang M, Ni B, Yang X (2020) Learning multi-view interactional skeleton graph for action recognition. PAMI
Wang P, Li Z, Hou Y et al (2016b) Action recognition based on joint trajectory maps using convolutional neural networks. In: Proceedings of the 24th ACM international conference on multimedia, pp 102–106
Wang P, Li W, Gao Z et al (2017a) Scene flow to action map: a new representation for rgb-d based action recognition with convolutional neural networks. In: CVPR
Wang P, Wang S, Gao Z et al (2017b) Structured images for rgb-d action recognition. In: ICCV Workshops
Wang X, Zheng S, Yang R et al (2022) Pedestrian attribute recognition: a survey. Pattern Recognit 121:108220. https://doi.org/10.1016/j.patcog.2021.108220
Wang Z, Xu J, Liu L et al (2019) Ranet: ranking attention network for fast video object segmentation. In: ICCV, pp 3978–3987
Wen YH, Gao L, Fu H et al (2019) Graph cnns with motif and variable temporal block for skeleton-based action recognition. In: AAAI, pp 8989–8996
Wu C, Wu XJ, Kittler J (2019a) Spatial residual layer and dense connection block enhanced spatial temporal graph convolutional network for skeleton-based action recognition. In: ICCV workshops, pp 0–0
Wu D, Dong X, Shao L et al (2022a) Multi-level representation learning with semantic alignment for referring video object segmentation. In: CVPR, pp 4996–5005
Wu J, Jiang Y, Sun P et al (2022b) Language as queries for referring video object segmentation. In: CVPR, pp 4974–4984
Wu J, Yarram S, Liang H et al (2022c) Efficient video instance segmentation via tracklet query and proposal. In: CVPR
Wu W, He D, Tan X et al (2019b) Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In: ICCV, pp 6222–6231
Xie H, Yao H, Zhou S et al (2021) Efficient regional memory network for video object segmentation. In: CVPR, pp 1286–1295
Xie S, Sun C, Huang J et al (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: ECCV, pp 305–321
Xu C, Govindarajan LN, Zhang Y et al (2017) Lie-x: depth image based articulated object pose estimation, tracking, and action recognition on lie groups. IJCV 123(3):454–478
Xu J, Zhao R, Zhu F et al (2018a) Attention-aware compositional network for person re-identification. In: CVPR, pp 2119–2128
Xu K, Yao A (2022) Accelerating video object segmentation with compressed video. In: CVPR, pp 1342–1351
Xu K, Wen L, Li G et al (2019a) Spatiotemporal cnn for video object segmentation. In: CVPR, pp 1379–1388
Xu M, Gao M, Chen YT et al (2019b) Temporal recurrent networks for online action detection. In: ICCV, pp 5532–5541
Xu N, Yang L, Fan Y et al (2018b) Youtube-vos: a large-scale video object segmentation benchmark. arXiv:1809.03327
Xu S, Cheng Y, Gu K et al (2017b) Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In: ICCV, pp 4733–4742
Yan A, Wang Y, Li Z et al (2019a) Pa3d: pose-action 3d machine for video recognition. In: CVPR
Yan A, Wang Y, Li Z et al (2019b) Pa3d: pose-action 3d machine for video recognition. In: CVPR, pp 7922–7931
Yan L, Wang Q, Cui Y et al (2022) Gl-rg: global-local representation granularity for video captioning. arXiv:2205.10706
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI
Yang H, Yuan C, Li B et al (2019) Asymmetric 3d convolutional neural networks for action recognition. Pattern Recognit 85:1–12
Yang H, Yan D, Zhang L et al (2021) Feedback graph convolutional network for skeleton-based action recognition. TIP 31:164–175
Yang J, Dong X, Liu L et al (2022) Recurring the transformer for video action recognition. In: CVPR, pp 14063–14073
Yang L, Fan Y, Xu N (2019b) Video instance segmentation. In: CVPR, pp 5188–5197
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122
Zhang D, Dai X, Wang YF (2018a) Dynamic temporal pyramid network: a closer look at multi-scale modeling for activity detection. In: Asian conference on computer vision. Springer, pp 712–728
Zhang K, Zhao Z, Liu D et al (2021) Deep transport network for unsupervised video object segmentation. In: ICCV, pp 8781–8790
Zhang L, Lin Z, Zhang J et al (2019a) Fast video object segmentation via dynamic targeting network. In: ICCV, pp 5582–5591
Zhang P, Lan C, Xing J et al (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: ICCV, pp 2117–2126
Zhang R, Li J, Sun H et al (2019) Scan: Self-and-collaborative attention network for video person re-identification. TIP 28(10):4870–4882
Zhang S, Yang J, Schiele B (2018b) Occluded pedestrian detection through guided attention in cnns. In: CVPR, pp 6995–7003
Zhang Y, Borse S, Cai H et al (2022) Perceptual consistency in video segmentation. In: WACV, pp 2564–2573
Zhao H, Wildes RP (2019) Spatiotemporal feature residual propagation for action prediction. In: ICCV, pp 7003–7012
Zhao L, Wang Y, Zhao J et al (2021) Learning view-disentangled human pose representation by contrastive cross-view mutual information maximization. In: CVPR, pp 12793–12802
Zheng Z, An G, Wu D et al (2020) Global and local knowledge-aware attention network for action recognition. IEEE Trans Neural Netw Learn Syst 32(1):334–347
Zhou C, Yuan J (2017) Multi-label learning of part detectors for heavily occluded pedestrian detection. In: ICCV, pp 3486–3495
Zhou Q, Sheng K, Zheng X et al (2022a) Training-free transformer architecture search. In: CVPR, pp 10894–10903
Zhou Y, Zhang H, Lee H et al (2022b) Slot-vps: object-centric representation learning for video panoptic segmentation. In: CVPR, pp 3093–3103
Zhu D, Zhang Z, Cui P et al (2019) Robust graph convolutional networks against adversarial attacks. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1399–1407
Zhu J, Zou W, Xu L et al (2018) Action machine: rethinking action recognition in trimmed videos. arXiv:1812.05770
Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. In: ECCV, pp 695–712
Zolfaghari M, Zhu Y, Gehler P et al (2021) Crossclr: cross-modal contrastive learning for multi-modal video representations. In: ICCV, pp 1450–1459
Zong M, Wang R, Chen X et al (2021) Motion saliency based multi-stream multiplier resnets for action recognition. Image Vis Comput 107:104108
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interest
We do not have any conflict of interest related to the manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ravanbakhsh, E., Liang, Y., Ramanujam, J. et al. Deep video representation learning: a survey. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-17815-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-023-17815-3