Skip to main content
Log in

Exploring the Semi-Supervised Video Object Segmentation Problem from a Cyclic Perspective

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Modern video object segmentation (VOS) algorithms have achieved remarkably high performance in a sequential processing order, while most of currently prevailing pipelines still show some obvious inadequacy like accumulative error, unknown robustness or lack of proper interpretation tools. In this paper, we place the semi-supervised video object segmentation problem into a cyclic workflow and find the defects above can be collectively addressed via the inherent cyclic property of semi-supervised VOS systems. Firstly, a cyclic mechanism incorporated to the standard sequential flow can produce more consistent representations for pixel-wise correspondance. Relying on the accurate reference mask in the starting frame, we show that the error propagation problem can be mitigated. Next, a simple gradient correction module, which naturally extends the offline cyclic pipeline to an online manner, can highlight the high-frequent and detailed part of results to further improve the segmentation quality while keeping feasible computation cost. Meanwhile such correction can protect the network from severe performance degration resulted from interference signals. Finally we develop cycle effective receptive field (cycle-ERF) based on gradient correction process to provide a new perspective into analyzing object-specific regions of interests. We conduct comprehensive comparison and detailed analysis on challenging benchmarks of DAVIS16, DAVIS17 and Youtube-VOS, demonstrating that the cyclic mechanism is helpful to enhance segmentation quality, improve the robustness of VOS systems, and further provide qualitative comparison and interpretation on how different VOS algorithms work. The code of this project can be found at https://github.com/lyxok1/STM-Training.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. This manuscript is an extended version of our conference paper to be published at the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) 2020. delving into the cyclic mechanism of semi-supervised video object segmentation (Li et al. 2020). We have cited this paper in the manuscript and extended the paper substantially but not limited in following aspects: (1). A smooth regularized term and insight from frequency domain is appended in the method part. (2). In depth analysis on the robust of VOS model and the effect from our correction methods is included in the methods part. (3) More comprehensive experiments are appended to demonstrate the generality of our studies, including comparison under different baseline models and backbones, results with COCO pretraining and more qualitative results of effect from core components.

  2. https://github.com/Jia-Research-Lab/AGSS-VOS.

References

  • Bansal A, Ma S, Ramanan D, & Sheikh Y (2018) Recycle-gan: unsupervised video retargeting. In: European conference on computer vision (ECCV)

  • Caelles S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., & Van Gool, L. (2017). One-shot video object segmentation. In: Computer Vision and Pattern Recognition (CVPR)

  • Carl, V., Abhinav, S., Alireza, F., Sergio, G., & Kevin, M. (2018). Tracking emerges by colorizing videos. European Conference on Computer Vision

  • Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., & Li, J. (2018). Boosting adversarial attacks with momentum. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 9185–9193

  • Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1), 98–136.

    Article  Google Scholar 

  • Goodfellow, I.J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572

  • He, K., Georgia, G., Piotr, D., & Ross, G. (2018). Mask r-cnn. In: international conference on computer vision (ICCV)

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  • Jabri, A., Owens, A., & Efros, A.A. (2020). Space-time correspondence as a contrastive random walk. Advances in Neural Information Processing Systems

  • Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., & Felsberg, M. (2019). A generative appearance model for end-to-end video object segmentation. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 8953–8962

  • Khoreva, A., Benenson, R., Ilg, E., Brox, T., & Schiele, B. (2019). Lucid data dreaming for video object segmentation. International Journal of Computer Vision (IJCV), 127, 1175–1197.

    Article  Google Scholar 

  • Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. In: international conference on learning representation (ICLR)

  • Li, Y., Shen, Z., & Shan, Y. (2020). Fast video object segmentation using the global context module. In: European conference on computer vision (ECCV), pp. 735–750

  • Li, Y., Xu, N., Jinlong, P., See, J., & Weiyao, L. (2020). Delving into the cyclic mechanism in semi-supervised video object segmentation. In: Neural Information Processing System (NeurIPS)

  • Lin, H., Qi, X., & Jia, J. (2019). Agss-vos: Attention guided single-shot video object segmentation. In: The IEEE international conference on computer vision (ICCV)

  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In: European conference on computer vision (ECCV). Zürich

  • Luiten, J., Voigtlaender, P., & Leibe, B. (2018). Premvos: Proposal-generation, refinement and merging for video object segmentation. In: Asian conference on computer vision (ACCV)

  • Meister, S., Hur, J., & Roth, S. (2018). UnFlow: Unsupervised learning of optical flow with a bidirectional census loss. In: AAAI. New Orleans, Louisiana

  • Oh, S.W., Lee, J.Y., Xu, N., & Kim, S.J. (2019). Video object segmentation using space-time memory networks. In: The IEEE international conference on computer vision (ICCV)

  • Peng, J., Wang, C., Wan, F., Wu, Y., Wang, Y., Tai, Y., Wang, C., Li, J., Huang, F., & Fu, Y. (2020). Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In: The European conference on computer vision (ECCV)

  • Pretraining code of space-time-memory network on coco for video object segmentation. https://github.com/haochenheheda/Training-Code-of-STM

  • Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., & Sorkine-Hornung, A. (2017). Learning video object segmentation from static images. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  • Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., & Sorkine-Hornung, A. (2016). A benchmark dataset and evaluation methodology for video object segmentation. In: computer vision and pattern recognition (CVPR)

  • Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., & Van Gool, L. (2017). The 2017 davis challenge on video object segmentation. arXiv:1704.00675

  • Robinson, A., Lawin, F.J., Danelljan, M., Khan, F.S., & Felsberg, M. (2020). Learning fast and robust target models for video object segmentation. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Seong, H., Hyun, J., & Kim, E. (2020). Kernelized memory network for video object segmentation. In: European conference on computer vision (ECCV), pp. 629–645

  • Shi, J., Yan, Q., Xu, L., & Jia, J. (2016). Hierarchical image saliency detection on extended cssd. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(4), 717–729.

    Article  Google Scholar 

  • Tinghui, Z., Philipp, K., Mathieu, A., Qixing, H., & Alexei, A.E. (2016). Learning dense correspondence via 3d-guided cycle consistency. The IEEE conference on computer vision and pattern recognition (CVPR)

  • Ventura, C., Bellver, M., Girbau, A., Salvador, A., Marques, F., & Giro-i Nieto, X. (2019). Rvos: End-to-end recurrent network for video object segmentation. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  • Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., & Chen, L.C. (2019). Feelvos: Fast end-to-end embedding learning for video object segmentation. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  • Voigtlaender, P., & Leibe, B. (2017). Online adaptation of convolutional neural networks for video object segmentation. In: British machine vision conference (BMVC)

  • Wang, X., Jabri, A., & Efros, A.A. (2019). Learning correspondence from the cycle-consistency of time. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  • Wug Oh, S., Lee, J.Y., Sunkavalli, K., & Joo Kim, S. (2018). Fast video object segmentation by reference-guided mask propagation. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  • Xu, N., Yang, L., Fan, Y., Yang, J., Yue, D., Liang, Y., Price, B., Cohen, S., & Huang, T. (2018). Youtube-vos: Sequence-to-sequence video object segmentation. In: the European conference on computer vision (ECCV)

  • Yang, Z., Wei, Y., & Yang, Y. (2021). Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, 34, 2491–2502.

  • Zeng, X., Liao, R., Gu, L., Xiong, Y., Fidler, S., & Urtasun, R. (2019). Dmm-net: Differentiable mask-matching network for video object segmentation. In: The IEEE international conference on computer vision (ICCV)

  • Zhang, Y., Wu, Z., Peng, H., & Lin, S. (2020). A transductive approach for video object segmentation. In: proceedings of the IEEE conference on computer vision and pattern recognition

  • Zhu, J.Y., Park, T., Isola, P., & Efros, A.A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE international conference on computer vision (ICCV)

Download references

Acknowledgements

The paper is supported in part by the following grants: National Key Research and Development Program of China Grant (No.2018AAA0100400), National Natural Science Foundation of China (No. U21B2013, 61971277).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiyao Lin.

Additional information

Communicated by Karteek Alahari.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Xu, N., Yang, W. et al. Exploring the Semi-Supervised Video Object Segmentation Problem from a Cyclic Perspective. Int J Comput Vis 130, 2408–2424 (2022). https://doi.org/10.1007/s11263-022-01655-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01655-z

Navigation