Learning spatiotemporal relationships with a unified framework for video object segmentation

Mei, Jianbiao; Wang, Mengmeng; Yang, Yu; Li, Zizhang; Liu, Yong

doi:10.1007/s10489-024-05486-y

Learning spatiotemporal relationships with a unified framework for video object segmentation

Published: 07 May 2024

(2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Jianbiao Mei¹,
Mengmeng Wang¹,
Yu Yang¹,
Zizhang Li¹ &
…
Yong Liu¹

105 Accesses
Explore all metrics

Abstract

Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve \(\sim \)50% faster inference speed with only a slight 0.2% (\( J \& F\)) drop in segmentation quality. Codes are available at https://github.com/sallymmx/TransVOS.git.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Global video object segmentation with spatial constraint module

Article Open access 03 January 2023

Tackling Background Distraction in Video Object Segmentation

PReMVOS: Proposal-Generation, Refinement and Merging for Video Object Segmentation

References

Caelles S, Maninis KK, Pont-Tuset J, Leal-Taixé L, Cremers D, Van Gool L (2017) One-shot video object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 221–230
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229
Chen X, Li Z, Yuan Y, Yu G, Shen J, Qi D (2020) State-aware tracker for real-time video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9384–9393
Cheng B, Schwing A, Kirillov A (2021) Per-pixel classification is not all you need for semantic segmentation. Adv Neural Inf Process Syst 34
Cheng HK, Tai YW, Tang CK (2021) Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Adv Neural Inf Process Syst 34:11781–11794
Google Scholar
Cheng MM, Mitra NJ, Huang X, Torr PH, Hu SM (2014) Global contrast based salient region detection. IEEE Trans Pattern Anal Mach Intell 37(3):569–582
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Duke B, Ahmed A, Wolf C, Aarabi P, Taylor GW (2021) Sstvos: Sparse spatiotemporal transformers for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5912–5921
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vision 88(2):303–338
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu YT, Huang JB, Schwing AG (2018) Videomatch: Matching based video object segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 54–70
Huang W, Gu J, Ma X, Li Y (2020) End-to-end multitask siamese network with residual hierarchical attention for real-time object tracking. Appl Intell 50(6):1908–1921
Article Google Scholar
Huang X, Xu J, Tai YW, Tang CK (2020) Fast video object segmentation with temporal aggregation network and dynamic template matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8879–8889
Johnander J, Danelljan M, Brissman E, Khan FS, Felsberg M (2019) A generative appearance model for end-to-end video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8953–8962
Lai Z, Lu E, Xie W (2020) Mast: A memory-augmented self-supervised tracker. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 6479–6488
Lan M, Zhang J, He F, Zhang L (2022) Siamese network with interactive transformer for video object segmentation. In: Proceedings of the AAAI Conference on artificial intelligence, 36:1228–1236
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980
Li Y, Hou X, Koch C, Rehg JM, Yuille AL (2014) The secrets of salient object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 280–287
Li Y, Shen Z, Shan Y (2020) Fast video object segmentation using the global context module. In: European Conference on Computer Vision, Springer, pp 735–750
Li Y, Xu N, Peng J, See J, Lin W (2020) Delving into the cyclic mechanism in semi-supervised video object segmentation. Adv Neural Inf Process Syst 33:1218–1228
Google Scholar
Liang Y, Ge C, Tong Z, Song Y, Wang J, Xie P (2022) Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800
Liang Y, Li X, Jafari N, Chen J (2020) Video object segmentation with adaptive feature bank and uncertain-region refinement. Adv Neural Inf Process Syst 33:3430–3441
Google Scholar
Lin H, Qi X, Jia J (2019) Agss-vos: Attention guided single-shot video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3949–3957
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755
Liu Y, Zhang D, Zhang Q, Han J (2021) Part-object relational visual saliency. IEEE Trans Pattern Anal Mach Intell
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Lu X, Wang W, Danelljan M, Zhou T, Shen J, Van Gool L (2020) Video object segmentation with episodic graph memory networks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, Springer, pp 661–679
Luiten J, Voigtlaender P, Leibe B (2018) Premvos: Proposal-generation, refinement and merging for video object segmentation. In: Asian conference on computer vision, Springer, pp 565–580
Mao Y, Wang N, Zhou W, Li H (2021) Joint inductive and transductive learning for video object segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9670–9679
Mei J, Wang M, Yang Y, Li Y, Liu Y (2023) Fast real-time video object segmentation with a tangled memory network. ACM Trans Intell Syst Technol 14(3):1–21
Article Google Scholar
Oh SW, Lee JY, Sunkavalli K, Kim SJ (2018) Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7376–7385
Oh SW, Lee JY, Xu N, Kim SJ (2019) Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 9226–9235
Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2663–2672
Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 724–732
Pont-Tuset J, Perazzi F, Caelles S, Arbeláez P, Sorkine-Hornung A, Van Gool L (2017) The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675
Robinson A, Lawin FJ, Danelljan M, Khan FS, Felsberg M (2020) Learning fast and robust target models for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7406–7415
Seong H, Hyun J, Kim E (2020) Kernelized memory network for video object segmentation. In: European Conference on Computer Vision, Springer, pp 629–645
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen LC (2019) Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9481–9490
Voigtlaender P, Leibe B (2017) Online adaptation of convolutional neural networks for the 2017 davis challenge on video object segmentation. In: The 2017 DAVIS challenge on video object segmentation-CVPR Workshops, vol. 5
Voigtlaender P, Luiten J, Torr PH, Leibe B (2020) Siam r-cnn: Visual tracking by re-detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6578–6588
Wang H, Jiang X, Ren H, Hu Y, Bai S (2021) Swiftnet: Real-time video object segmentation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 1296–1305
Wang H, Liu W, Xing W (2022) A temporal attention based appearance model for video object segmentation. Appl Intell 52(2):2290–2300
Article Google Scholar
Wang H, Zhu Y, Adam H, Yuille A, Chen LC (2021) Max-deeplab: End-to-end panoptic segmentation with mask transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5463–5474
Wang M, Mei J, Liu L, Tian G, Liu Y, Pan Z (2022) Delving deeper into mask utilization in video object segmentation. IEEE Trans Image Process 31:6255–6266
Article Google Scholar
Wang Q, Zhang L, Bertinetto L, Hu W, Torr PH (2019) Fast online object tracking and segmentation: A unifying approach. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 1328–1338
Wang Z, Xu J, Liu L, Zhu F, Shao L (2019) Ranet: Ranking attention network for fast video object segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3978–3987
Xie H, Yao H, Zhou S, Zhang S, Sun W (2021) Efficient regional memory network for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1286–1295
Xu N, Yang L, Fan Y, Yang J, Yue D, Liang Y, Price B, Cohen S, Huang T (2018) Youtube-vos: Sequence-to-sequence video object segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 585–601
Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI conference on artificial intelligence, pp 12549–12556
Yan B, Peng H, Fu J, Wang D, Lu H (2021) Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10448–10457
Yang J, Ge H, Su S, Liu G (2022) Transformer-based two-source motion model for multi-object tracking. Appl Intell pp 1–13
Yang Z, Wei Y, Yang Y (2020) Collaborative video object segmentation by foreground-background integration. In: European Conference on Computer Vision, Springer, pp 332–348
Yang Z, Wei Y, Yang Y (2021) Associating objects with transformers for video object segmentation. Adv Neural Inf Process Syst 34
Zhang XY, Huang YP, Mi Y, Pei YT, Zou Q, Wang S (2021) Video sketch: A middle-level representation for action recognition. Appl Intell 51(4):2589–2608
Article Google Scholar
Zhu W, Li J, Lu J, Zhou J (2021) Separable structure modeling for semi-supervised video object segmentation. IEEE Trans Circuits Syst Video Technol

Download references

Acknowledgements

This work was supported by NSFC 62088101 Autonomous Intelligent Unmanned Systems.

Author information

Authors and Affiliations

Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, 310027, People’s Republic of China
Jianbiao Mei, Mengmeng Wang, Yu Yang, Zizhang Li & Yong Liu

Authors

Jianbiao Mei
View author publications
You can also search for this author in PubMed Google Scholar
Mengmeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zizhang Li
View author publications
You can also search for this author in PubMed Google Scholar
Yong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mei, J., Wang, M., Yang, Y. et al. Learning spatiotemporal relationships with a unified framework for video object segmentation. Appl Intell (2024). https://doi.org/10.1007/s10489-024-05486-y

Download citation

Accepted: 24 April 2024
Published: 07 May 2024
DOI: https://doi.org/10.1007/s10489-024-05486-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning spatiotemporal relationships with a unified framework for video object segmentation

Abstract

Access this article

Similar content being viewed by others

Global video object segmentation with spatial constraint module

Tackling Background Distraction in Video Object Segmentation

PReMVOS: Proposal-Generation, Refinement and Merging for Video Object Segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning spatiotemporal relationships with a unified framework for video object segmentation

Abstract

Access this article

Similar content being viewed by others

Global video object segmentation with spatial constraint module

Tackling Background Distraction in Video Object Segmentation

PReMVOS: Proposal-Generation, Refinement and Merging for Video Object Segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation