Exploring the Semi-Supervised Video Object Segmentation Problem from a Cyclic Perspective

Li, Yuxi; Xu, Ning; Yang, Wenjie; See, John; Lin, Weiyao

doi:10.1007/s11263-022-01655-z

Exploring the Semi-Supervised Video Object Segmentation Problem from a Cyclic Perspective

Published: 06 August 2022

Volume 130, pages 2408–2424, (2022)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yuxi Li¹,
Ning Xu²,
Wenjie Yang³,
John See⁴ &
…
Weiyao Lin ORCID: orcid.org/0000-0001-8307-7107¹

515 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Modern video object segmentation (VOS) algorithms have achieved remarkably high performance in a sequential processing order, while most of currently prevailing pipelines still show some obvious inadequacy like accumulative error, unknown robustness or lack of proper interpretation tools. In this paper, we place the semi-supervised video object segmentation problem into a cyclic workflow and find the defects above can be collectively addressed via the inherent cyclic property of semi-supervised VOS systems. Firstly, a cyclic mechanism incorporated to the standard sequential flow can produce more consistent representations for pixel-wise correspondance. Relying on the accurate reference mask in the starting frame, we show that the error propagation problem can be mitigated. Next, a simple gradient correction module, which naturally extends the offline cyclic pipeline to an online manner, can highlight the high-frequent and detailed part of results to further improve the segmentation quality while keeping feasible computation cost. Meanwhile such correction can protect the network from severe performance degration resulted from interference signals. Finally we develop cycle effective receptive field (cycle-ERF) based on gradient correction process to provide a new perspective into analyzing object-specific regions of interests. We conduct comprehensive comparison and detailed analysis on challenging benchmarks of DAVIS16, DAVIS17 and Youtube-VOS, demonstrating that the cyclic mechanism is helpful to enhance segmentation quality, improve the robustness of VOS systems, and further provide qualitative comparison and interpretation on how different VOS algorithms work. The code of this project can be found at https://github.com/lyxok1/STM-Training.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Global video object segmentation with spatial constraint module

Article Open access 03 January 2023

Learning spatiotemporal relationships with a unified framework for video object segmentation

Article 07 May 2024

PReMVOS: Proposal-Generation, Refinement and Merging for Video Object Segmentation

Notes

This manuscript is an extended version of our conference paper to be published at the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) 2020. delving into the cyclic mechanism of semi-supervised video object segmentation (Li et al. 2020). We have cited this paper in the manuscript and extended the paper substantially but not limited in following aspects: (1). A smooth regularized term and insight from frequency domain is appended in the method part. (2). In depth analysis on the robust of VOS model and the effect from our correction methods is included in the methods part. (3) More comprehensive experiments are appended to demonstrate the generality of our studies, including comparison under different baseline models and backbones, results with COCO pretraining and more qualitative results of effect from core components.
https://github.com/Jia-Research-Lab/AGSS-VOS.

References

Bansal A, Ma S, Ramanan D, & Sheikh Y (2018) Recycle-gan: unsupervised video retargeting. In: European conference on computer vision (ECCV)
Caelles S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., & Van Gool, L. (2017). One-shot video object segmentation. In: Computer Vision and Pattern Recognition (CVPR)
Carl, V., Abhinav, S., Alireza, F., Sergio, G., & Kevin, M. (2018). Tracking emerges by colorizing videos. European Conference on Computer Vision
Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., & Li, J. (2018). Boosting adversarial attacks with momentum. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 9185–9193
Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1), 98–136.
Article Google Scholar
Goodfellow, I.J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572
He, K., Georgia, G., Piotr, D., & Ross, G. (2018). Mask r-cnn. In: international conference on computer vision (ICCV)
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Jabri, A., Owens, A., & Efros, A.A. (2020). Space-time correspondence as a contrastive random walk. Advances in Neural Information Processing Systems
Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., & Felsberg, M. (2019). A generative appearance model for end-to-end video object segmentation. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 8953–8962
Khoreva, A., Benenson, R., Ilg, E., Brox, T., & Schiele, B. (2019). Lucid data dreaming for video object segmentation. International Journal of Computer Vision (IJCV), 127, 1175–1197.
Article Google Scholar
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. In: international conference on learning representation (ICLR)
Li, Y., Shen, Z., & Shan, Y. (2020). Fast video object segmentation using the global context module. In: European conference on computer vision (ECCV), pp. 735–750
Li, Y., Xu, N., Jinlong, P., See, J., & Weiyao, L. (2020). Delving into the cyclic mechanism in semi-supervised video object segmentation. In: Neural Information Processing System (NeurIPS)
Lin, H., Qi, X., & Jia, J. (2019). Agss-vos: Attention guided single-shot video object segmentation. In: The IEEE international conference on computer vision (ICCV)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In: European conference on computer vision (ECCV). Zürich
Luiten, J., Voigtlaender, P., & Leibe, B. (2018). Premvos: Proposal-generation, refinement and merging for video object segmentation. In: Asian conference on computer vision (ACCV)
Meister, S., Hur, J., & Roth, S. (2018). UnFlow: Unsupervised learning of optical flow with a bidirectional census loss. In: AAAI. New Orleans, Louisiana
Oh, S.W., Lee, J.Y., Xu, N., & Kim, S.J. (2019). Video object segmentation using space-time memory networks. In: The IEEE international conference on computer vision (ICCV)
Peng, J., Wang, C., Wan, F., Wu, Y., Wang, Y., Tai, Y., Wang, C., Li, J., Huang, F., & Fu, Y. (2020). Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In: The European conference on computer vision (ECCV)
Pretraining code of space-time-memory network on coco for video object segmentation. https://github.com/haochenheheda/Training-Code-of-STM
Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., & Sorkine-Hornung, A. (2017). Learning video object segmentation from static images. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., & Sorkine-Hornung, A. (2016). A benchmark dataset and evaluation methodology for video object segmentation. In: computer vision and pattern recognition (CVPR)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., & Van Gool, L. (2017). The 2017 davis challenge on video object segmentation. arXiv:1704.00675
Robinson, A., Lawin, F.J., Danelljan, M., Khan, F.S., & Felsberg, M. (2020). Learning fast and robust target models for video object segmentation. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.
Article MathSciNet Google Scholar
Seong, H., Hyun, J., & Kim, E. (2020). Kernelized memory network for video object segmentation. In: European conference on computer vision (ECCV), pp. 629–645
Shi, J., Yan, Q., Xu, L., & Jia, J. (2016). Hierarchical image saliency detection on extended cssd. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(4), 717–729.
Article Google Scholar
Tinghui, Z., Philipp, K., Mathieu, A., Qixing, H., & Alexei, A.E. (2016). Learning dense correspondence via 3d-guided cycle consistency. The IEEE conference on computer vision and pattern recognition (CVPR)
Ventura, C., Bellver, M., Girbau, A., Salvador, A., Marques, F., & Giro-i Nieto, X. (2019). Rvos: End-to-end recurrent network for video object segmentation. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., & Chen, L.C. (2019). Feelvos: Fast end-to-end embedding learning for video object segmentation. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Voigtlaender, P., & Leibe, B. (2017). Online adaptation of convolutional neural networks for video object segmentation. In: British machine vision conference (BMVC)
Wang, X., Jabri, A., & Efros, A.A. (2019). Learning correspondence from the cycle-consistency of time. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Wug Oh, S., Lee, J.Y., Sunkavalli, K., & Joo Kim, S. (2018). Fast video object segmentation by reference-guided mask propagation. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Xu, N., Yang, L., Fan, Y., Yang, J., Yue, D., Liang, Y., Price, B., Cohen, S., & Huang, T. (2018). Youtube-vos: Sequence-to-sequence video object segmentation. In: the European conference on computer vision (ECCV)
Yang, Z., Wei, Y., & Yang, Y. (2021). Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, 34, 2491–2502.
Zeng, X., Liao, R., Gu, L., Xiong, Y., Fidler, S., & Urtasun, R. (2019). Dmm-net: Differentiable mask-matching network for video object segmentation. In: The IEEE international conference on computer vision (ICCV)
Zhang, Y., Wu, Z., Peng, H., & Lin, S. (2020). A transductive approach for video object segmentation. In: proceedings of the IEEE conference on computer vision and pattern recognition
Zhu, J.Y., Park, T., Isola, P., & Efros, A.A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE international conference on computer vision (ICCV)

Download references

Acknowledgements

The paper is supported in part by the following grants: National Key Research and Development Program of China Grant (No.2018AAA0100400), National Natural Science Foundation of China (No. U21B2013, 61971277).

Author information

Authors and Affiliations

Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China
Yuxi Li & Weiyao Lin
Adobe Research, San Jose, USA
Ning Xu
Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China
Wenjie Yang
Heriot-Watt University Malaysia, Putrajaya, Malaysia
John See

Authors

Yuxi Li
View author publications
You can also search for this author in PubMed Google Scholar
Ning Xu
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Yang
View author publications
You can also search for this author in PubMed Google Scholar
John See
View author publications
You can also search for this author in PubMed Google Scholar
Weiyao Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiyao Lin.

Additional information

Communicated by Karteek Alahari.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, Y., Xu, N., Yang, W. et al. Exploring the Semi-Supervised Video Object Segmentation Problem from a Cyclic Perspective. Int J Comput Vis 130, 2408–2424 (2022). https://doi.org/10.1007/s11263-022-01655-z

Download citation

Received: 31 October 2021
Accepted: 18 July 2022
Published: 06 August 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s11263-022-01655-z

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring the Semi-Supervised Video Object Segmentation Problem from a Cyclic Perspective

Abstract

Access this article

Similar content being viewed by others

Global video object segmentation with spatial constraint module

Learning spatiotemporal relationships with a unified framework for video object segmentation

PReMVOS: Proposal-Generation, Refinement and Merging for Video Object Segmentation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Navigation

Exploring the Semi-Supervised Video Object Segmentation Problem from a Cyclic Perspective

Abstract

Access this article

Similar content being viewed by others

Global video object segmentation with spatial constraint module

Learning spatiotemporal relationships with a unified framework for video object segmentation

PReMVOS: Proposal-Generation, Refinement and Merging for Video Object Segmentation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation