Skip to main content

VOS

  • Chapter
  • First Online:
Video Object Segmentation

Part of the book series: Synthesis Lectures on Computer Vision ((SLCV))

  • 43 Accesses

Abstract

In this chapter, we elaborate on the task of video object segmentation (VOS), which aims at dividing pixels of a video into disjoint subsets where each subset usually represents either a target object or the background. The VOS task has different problem settings given different input or output requirements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 44.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://davischallenge.org/davis2016/soa_compare.html, deadline: Mar. 2019.

  2. 2.

    https://davischallenge.org/davis2016/soa_compare.html.

References

  1. The 3rd large-scale video object segmentation challenge. https://youtube-vos.org/challenge/2021/.

  2. Ali Athar, Sabarinath Mahadevan, Aljosa Osep, Laura Leal-Taixé, and Bastian Leibe. Stem-seg: Spatio-temporal embeddings for instance segmentation in videos. In European Conference on Computer Vision, pages 158–177. Springer, 2020.

    Google Scholar 

  3. S Avinash Ramakanth and R Venkatesh Babu. Seamseg: Video object segmentation using patch seams. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 376–383, 2014.

    Google Scholar 

  4. Vijay Badrinarayanan, Ignas Budvytis, and Roberto Cipolla. Semi-supervised video segmentation using tree structured graphical models. IEEE transactions on pattern analysis and machine intelligence, 35(11):2751–2764, 2013.

    Article  Google Scholar 

  5. Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. Delving deeper into convolutional networks for learning video representations. In ICLR, 2016.

    Google Scholar 

  6. Linchao Bao, Baoyuan Wu, and Wei Liu. Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In CVPR, pages 5977–5986, 2018.

    Google Scholar 

  7. Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph., 28(3):24, 2009.

    Google Scholar 

  8. Harkirat Singh Behl, Mohammad Najafi, Anurag Arnab, and Philip HS Torr. Meta learning deep visual words for fast video object segmentation. NeurIPS Workshop, 2019.

    Google Scholar 

  9. Arnaud Benard and Michael Gygli. Interactive video object segmentation in the wild. arXiv preprint arXiv:1801.00269, 2017.

  10. Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.

    Article  Google Scholar 

  11. Gedas Bertasius and Lorenzo Torresani. Classifying, segmenting, and tracking object instances in video with mask propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9739–9748, 2020.

    Google Scholar 

  12. Erik Bochinski, Volker Eiselein, and Thomas Sikora. High-speed tracking-by-detection without using image information. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–6. IEEE, 2017.

    Google Scholar 

  13. Guillem Brasó and Laura Leal-Taixé. Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6247–6257, 2020.

    Google Scholar 

  14. Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. In ICLR, 2017.

    Google Scholar 

  15. Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Segmentation and recognition using structure from motion point clouds. In European Conference on Computer Vision (ECCV), pages 44–57, 2008.

    Google Scholar 

  16. S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. One-shot video object segmentation. In CVPR, 2017.

    Google Scholar 

  17. Sergi Caelles, Alberto Montes, Kevis-Kokitsi Maninis, Yuhua Chen, Luc Van Gool, Federico Perazzi, and Jordi Pont-Tuset. The 2018 davis challenge on video object segmentation. arXiv preprint arXiv:1803.00557, 2018.

  18. Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, Alberto Montes, Kevis-Kokitsi Maninis, and Luc Van Gool. The 2019 davis challenge on vos: Unsupervised multi-object segmentation. arXiv:1905.00737, 2019.

  19. Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In cvpr, pages 1209–1218, 2018.

    Google Scholar 

  20. Jiale Cao, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, and Ling Shao. Sipmask: Spatial information preservation for fast image and video instance segmentation. In European Conference on Computer Vision, pages 1–18. Springer, 2020.

    Google Scholar 

  21. Joao Carreira, Viorica Patraucean, Laurent Mazare, Andrew Zisserman, and Simon Osindero. Massively parallel video networks. In ECCV, pages 649–666, 2018.

    Google Scholar 

  22. Hong Chen, Yifei Huang, and Hideki Nakayama. Semantic aware attention based deep object co-segmentation. In ACCV, 2018.

    Google Scholar 

  23. Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In CVPR, 2019.

    Google Scholar 

  24. Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. mmdetection. https://github.com/open-mmlab/mmdetection, 2018.

  25. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4):834–848, 2017.

    Google Scholar 

  26. Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. CoRR, abs/1706.05587, 2017.

    Google Scholar 

  27. Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018.

    Google Scholar 

  28. Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In CVPR, 2017.

    Google Scholar 

  29. Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. Enriching visual knowledge bases via object discovery and segmentation. In CVPR, 2014.

    Google Scholar 

  30. Yuhua Chen, Jordi Pont-Tuset, Alberto Montes, and Luc Van Gool. Blazingly fast video object segmentation with pixel-wise metric learning. In CVPR, pages 1189–1198, 2018.

    Google Scholar 

  31. Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. Panoptic-deeplab. arXiv preprint arXiv:1910.04751, 2019.

  32. Jingchun Cheng, Yi-Hsuan Tsai, Wei-Chih Hung, Shengjin Wang, and Ming-Hsuan Yang. Fast and accurate online video object segmentation via tracking parts. In CVPR, 2018.

    Google Scholar 

  33. Jingchun Cheng, Yi-Hsuan Tsai, Shengjin Wang, and Ming-Hsuan Yang. Segflow: Joint learning for video object segmentation and optical flow. In Proceedings of the IEEE international conference on computer vision, pages 686–695, 2017.

    Google Scholar 

  34. Ming-Ming Cheng, Niloy J Mitra, Xiaolei Huang, Philip HS Torr, and Shi-Min Hu. Global contrast based salient region detection. TPAMI, 37(3):569–582, 2014.

    Google Scholar 

  35. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

  36. Wen-Sheng Chu, Yale Song, and Alejandro Jaimes. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3584–3592, 2015.

    Google Scholar 

  37. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016.

    Google Scholar 

  38. Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.

    Google Scholar 

  39. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

    Google Scholar 

  40. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE, 2009.

    Google Scholar 

  41. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2018.

    Google Scholar 

  42. Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.

  43. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.

  44. Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, pages 2758–2766, 2015.

    Google Scholar 

  45. Kevin Duarte, Yogesh S. Rawat, and Mubarak Shah. Capsulevos: Semi-supervised video object segmentation using capsule routing. In ICCV, 2019.

    Google Scholar 

  46. Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, and Graham W Taylor. Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5912–5921, 2021.

    Google Scholar 

  47. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

  48. Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.

    Google Scholar 

  49. Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.

    Google Scholar 

  50. Alon Faktor and Michal Irani. Video segmentation by non-local consensus voting. In BMVC, volume 2, page 8, 2014.

    Google Scholar 

  51. Q. Fan, F. Zhong, D. Lischinski, D. Cohen-Or, and B. Chen. Jumpcut:non-successive mask transfer and interpolation for video cutout. In ACM Trans. Graph., 34(6), 2015.

    Google Scholar 

  52. Qingnan Fan, Fan Zhong, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. Jumpcut: non-successive mask transfer and interpolation for video cutout. TOG, 34(6):195–1, 2015.

    Google Scholar 

  53. Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun Oh Song, Sergio Guadarrama, and Kevin P Murphy. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277, 2017.

  54. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.

    Google Scholar 

  55. Katerina Fragkiadaki, Pablo Arbelaez, Panna Felsen, and Jitendra Malik. Learning to segment moving objects in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4083–4090, 2015.

    Google Scholar 

  56. Katerina Fragkiadaki, Geng Zhang, and Jianbo Shi. Video segmentation by tracing discontinuities in a trajectory embedding. In CVPR, 2012.

    Google Scholar 

  57. Raghudeep Gadde, Varun Jampani, and Peter V Gehler. Semantic video cnns through representation warping. In ICCV, pages 4453–4462, 2017.

    Google Scholar 

  58. Zhi Gao, Loong-Fah Cheong, and Yu-Xiang Wang. Block-sparse rpca for salient motion detection. IEEE transactions on pattern analysis and machine intelligence, 36(10):1975–1987, 2014.

    Google Scholar 

  59. Shubhika Garg and Vidit Goel. Mask selection and propagation for unsupervised video object segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1680–1690, 2021.

    Google Scholar 

  60. Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272. PMLR, 2017.

    Google Scholar 

  61. Ross Girshick. Fast r-cnn. In ICCV, 2015.

    Google Scholar 

  62. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.

    Google Scholar 

  63. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.

    Google Scholar 

  64. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.

    Google Scholar 

  65. Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S. Huang. Seq-nms for video object detection. CoRR, abs/1602.08465, 2016.

    Google Scholar 

  66. Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In ICCV, pages 991–998. IEEE, 2011.

    Google Scholar 

  67. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.

    Google Scholar 

  68. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

    Google Scholar 

  69. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In ICLR, 2021.

    Google Scholar 

  70. Yuk Heo, Yeong Jun Koh, and Chang-Su Kim. Interactive video object segmentation using sparse-to-dense networks. In CVPR Workshops, volume 2, page 6, 2019.

    Google Scholar 

  71. Yuk Heo, Yeong Jun Koh, and Chang-Su Kim. Guided interactive video object segmentation using reliability-based attention maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7322–7330, 2021.

    Google Scholar 

  72. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

    Google Scholar 

  73. Kuang-Jui Hsu, Yen-Yu Lin, and Yung-Yu Chuang. Co-attention cnns for unsupervised object co-segmentation. In IJCAI, 2018.

    Google Scholar 

  74. Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.

    Google Scholar 

  75. Ping Hu, Fabian Caba, Oliver Wang, Zhe Lin, Stan Sclaroff, and Federico Perazzi. Temporally distributed networks for fast video semantic segmentation. In CVPR, pages 8818–8827, 2020.

    Google Scholar 

  76. Yuan-Ting Hu, Jia-Bin Huang, and Alexander G. Schwing. Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In ECCV, 2018.

    Google Scholar 

  77. Yuan-Ting Hu, Jia-Bin Huang, and Alexander G Schwing. Videomatch: Matching based video object segmentation. In ECCV, pages 54–70, 2018.

    Google Scholar 

  78. Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. CVPR, 2017.

    Google Scholar 

  79. Ramesh C. Jain and Hans-Hellmut Nagel. On the analysis of accumulative difference pictures from image sequences of real world scenes. IEEE Trans. Pattern Anal. Mach. Intell., 1(2):206–214, 1979.

    Google Scholar 

  80. Samvit Jain, Xin Wang, and Joseph E Gonzalez. Accel: A corrective fusion network for efficient semantic segmentation on video. In CVPR, pages 8866–8875, 2019.

    Google Scholar 

  81. Suyog Dutt Jain and Kristen Grauman. Supervoxel-consistent foreground propagation in video. In ECCV, 2014.

    Google Scholar 

  82. Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. In CVPR, 2017.

    Google Scholar 

  83. Varun Jampani, Raghudeep Gadde, and Peter V. Gehler. Video propagation networks. In CVPR, 2017.

    Google Scholar 

  84. Won-Dong Jang, Chulwoo Lee, and Chang-Su Kim. Primary object segmentation in videos via alternate convex optimization of foreground and background distributions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 696–704, 2016.

    Google Scholar 

  85. Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, et al. Video scene parsing with predictive feature learning. In ICCV, pages 5580–5588, 2017.

    Google Scholar 

  86. Joakim Johnander, Martin Danelljan, Emil Brissman, Fahad Shahbaz Khan, and Michael Felsberg. A generative appearance model for end-to-end video object segmentation. In CVPR, 2019.

    Google Scholar 

  87. Armand Joulin, Francis Bach, and Jean Ponce. Discriminative clustering for image co-segmentation. In CVPR, 2010.

    Google Scholar 

  88. Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In International Conference on Machine Learning, pages 2342–2350, 2015.

    Google Scholar 

  89. Heechul Jung, Jeongwoo Ju, and Junmo Kim. Rigid motion segmentation using randomized voting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1210–1217, 2014.

    Google Scholar 

  90. Margret Keuper, Bjoern Andres, and Thomas Brox. Motion trajectory segmentation via minimum cost multicuts. In ICCV, 2015.

    Google Scholar 

  91. Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video object segmentation with language referring expressions. In ACCV, 2018.

    Google Scholar 

  92. Changick Kim and Jenq-Neng Hwang. Fast and automatic video object segmentation and tracking for content-based applications. IEEE transactions on circuits and systems for video technology, 12(2):122–129, 2002.

    Google Scholar 

  93. Dahun Kim, Sanghyun Woo, Joon-Young Lee, and In So Kweon. Video panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9859–9868, 2020.

    Google Scholar 

  94. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representation (ICLR), 2014.

    Google Scholar 

  95. Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In NeurIPS, 2018.

    Google Scholar 

  96. Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. Panoptic feature pyramid networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6399–6408, 2019.

    Google Scholar 

  97. Yeong Jun Koh and Chang-Su Kim. Primary object segmentation in videos based on region augmentation and reduction. In CVPR, 2017.

    Google Scholar 

  98. Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.

    Google Scholar 

  99. Philipp Krähenbühl and Vladlen Koltun. Parameter learning and convergent inference for dense random fields. In International Conference on Machine Learning, pages 513–521, 2013.

    Google Scholar 

  100. Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. In ICML, 2016.

    Google Scholar 

  101. Yong Jae Lee, Jaechul Kim, and Kristen Grauman. Key-segments for video object segmentation. In ICCV, 2011.

    Google Scholar 

  102. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, 2020.

    Google Scholar 

  103. Chenglong Li, Liang Lin, Wangmeng Zuo, Wenzhong Wang, and Jin Tang. An approach to streaming video segmentation with sub-optimal low-rank decomposition. IEEE Transactions on Image Processing, 25(5):1947–1960, 2016.

    Google Scholar 

  104. Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M. Rehg. Video segmentation by tracking many figure-ground segments. In ICCV, 2013.

    Google Scholar 

  105. Jia Li, Anlin Zheng, Xiaowu Chen, and Bin Zhou. Primary video object segmentation via complementary cnns and neighborhood reversible flow. In Proceedings of the IEEE international conference on computer vision, pages 1417–1425, 2017.

    Google Scholar 

  106. Jiangyun Li, Yikai Zhao, Jun Fu, Jiajia Wu, and Jing Liu. Attention-guided network for semantic video segmentation. IEEE Access, 7:140680–140689, 2019.

    Google Scholar 

  107. Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and C.-C. Jay Kuo. Instance embedding transfer to unsupervised video object segmentation. In CVPR, 2018.

    Google Scholar 

  108. Siyang Li, Bryan Seybold, Alexey Vorobyov, Xuejing Lei, and C-C Jay Kuo. Unsupervised video object segmentation with motion-based bilateral networks. In Proceedings of the European conference on computer vision (ECCV), pages 207–223, 2018.

    Google Scholar 

  109. Weihao Li, Omid Hosseini Jafari, and Carsten Rother. Deep object co-segmentation. In ACCV, 2018.

    Google Scholar 

  110. Xiaoxiao Li and Chen Change Loy. Video object segmentation with joint re-identification and attention-aware mask propagation. In ECCV, pages 90–105, 2018.

    Google Scholar 

  111. Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Liwei Wang, Zeming Li, Jian Sun, and Jiaya Jia. Fully convolutional networks for panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 214–223, 2021.

    Google Scholar 

  112. Yu Li, Zhuoran Shen, and Ying Shan. Fast video object segmentation using the global context module. In European Conference on Computer Vision, pages 735–750. Springer, 2020.

    Google Scholar 

  113. Yule Li, Jianping Shi, and Dahua Lin. Low-latency video semantic segmentation. In CVPR, pages 5997–6005, 2018.

    Google Scholar 

  114. Yuxi Li, Ning Xu, Jinlong Peng, John See, and Weiyao Lin. Delving into the cyclic mechanism in semi-supervised video object segmentation. Advances in Neural Information Processing Systems, 33:1218–1228, 2020.

    Google Scholar 

  115. Chen Liang, Yu Wu, Yawei Luo, and Yi Yang. Clawcranenet: Leveraging object-level relation for text-based video segmentation. arXiv preprint arXiv:2103.10702, 2021.

  116. Yongqing Liang, Xin Li, Navid Jafari, and Jim Chen. Video object segmentation with adaptive feature bank and uncertain-region refinement. Advances in Neural Information Processing Systems, 33:3430–3441, 2020.

    Google Scholar 

  117. Chung-Ching Lin, Ying Hung, Rogério Feris, and Linglin He. Video instance segmentation tracking with a modified VAE architecture. In CVPR, 2020.

    Google Scholar 

  118. Huaijia Lin, Xiaojuan Qi, and Jiaya Jia. Agss-vos: Attention guided single-shot video object segmentation. In ICCV, 2019.

    Google Scholar 

  119. Huaijia Lin, Ruizheng Wu, Shu Liu, Jiangbo Lu, and Jiaya Jia. Video instance segmentation with a propose-reduce paradigm. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1739–1748, 2021.

    Google Scholar 

  120. Huaijia Lin, Ruizheng Wu, Shu Liu, Jiangbo Lu, and Jiaya Jia. Video instance segmentation with a propose-reduce paradigm. arXiv preprint arXiv:2103.13746, 2021.

  121. Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.

    Google Scholar 

  122. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.

    Google Scholar 

  123. Si Liu, Changhu Wang, Ruihe Qian, Han Yu, Renda Bao, and Yao Sun. Surveillance video parsing with single frame supervision. In CVPR, pages 1013–1021, 2017.

    Google Scholar 

  124. Yifan Liu, Chunhua Shen, Changqian Yu, and Jingdong Wang. Efficient semantic video segmentation with per-frame inference. In ECCV, 2020.

    Google Scholar 

  125. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.

    Google Scholar 

  126. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems (NeurIPS), 2016.

    Google Scholar 

  127. Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR, 2019.

    Google Scholar 

  128. Xiankai Lu, Wenguan Wang, Jianbing Shen, David Crandall, and Luc Van Gool. Segmenting objects from relational visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

    Google Scholar 

  129. Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J Crandall, and Steven CH Hoi. Learning video object segmentation from unlabeled videos. In CVPR, 2020.

    Google Scholar 

  130. Andreas Lugmayr, Martin Danelljan, Luc Van Gool, and Radu Timofte. Srflow: Learning the super-resolution space with normalizing flow. In ECCV, 2020.

    Google Scholar 

  131. Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. Premvos: Proposal-generation, refinement and merging for video object segmentation. In ACCV, pages 565–580. Springer, 2018.

    Google Scholar 

  132. Jonathon Luiten, Idil Esen Zulfikar, and Bastian Leibe. Unovost: Unsupervised offline video object segmentation and tracking. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2000–2009, 2020.

    Google Scholar 

  133. K-K Maninis, Sergi Caelles, Yuhua Chen, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. Video object segmentation without temporal information. TPAMI, 41(6):1515–1530, 2018.

    Google Scholar 

  134. Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.

    Google Scholar 

  135. Nicolas Marki, Federico Perazzi, Oliver Wang, and Alexander Sorkine-Hornung. Bilateral space video segmentation. In CVPR, 2016.

    Google Scholar 

  136. David R Martin, Charless C Fowlkes, and Jitendra Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE transactions on pattern analysis and machine intelligence, 26(5):530–549, 2004.

    Google Scholar 

  137. Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4040–4048, 2016.

    Google Scholar 

  138. Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, and Yi Yang. Large-scale video panoptic segmentation in the wild: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21033–21043, 2022.

    Google Scholar 

  139. Jiaxu Miao, Yunchao Wei, Yu Wu, Chen Liang, Guangrui Li, and Yi Yang. Vspw: A large-scale dataset for video scene parsing in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

    Google Scholar 

  140. Jiaxu Miao, Yunchao Wei, and Yi Yang. Memory aggregation networks for efficient interactive video object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

    Google Scholar 

  141. Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.

  142. Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. In EMNLP, 2016.

    Google Scholar 

  143. Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. IEEE, 2016.

    Google Scholar 

  144. Daniel Munoz, Nicolas Vandapel, and Martial Hebert. Onboard contextual classification of 3-d point clouds with learned high-order markov random fields. In IEEE International Conference on Robotics and Automation, ICRA, pages 2009–2016. IEEE, 2009.

    Google Scholar 

  145. Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.

    Google Scholar 

  146. Mohammad Najafi, Viveka Kulharia, T Ajanthan, and PH Torr. Similarity learning for dense label transfer. CVPR Workshops, 2018.

    Google Scholar 

  147. David Nilsson and Cristian Sminchisescu. Semantic video segmentation by gated recurrent flow propagation. In CVPR, pages 6819–6828, 2018.

    Google Scholar 

  148. Peter Ochs and Thomas Brox. Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions. In ICCV, 2011.

    Google Scholar 

  149. Peter Ochs and Thomas Brox. Higher order motion models and spectral clustering. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 614–621, 2012.

    Google Scholar 

  150. Peter Ochs, Jitendra Malik, and Thomas Brox. Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence, 36(6):1187–1200, 2014.

    Google Scholar 

  151. Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Fast user-guided video object segmentation by interaction-and-propagation networks. In CVPR, pages 5247–5256, 2019.

    Google Scholar 

  152. Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. ICCV, 2019.

    Google Scholar 

  153. Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. arXiv preprint arXiv:1905.10947, 2019.

  154. Anestis Papazoglou and Vittorio Ferrari. Fast object segmentation in unconstrained video. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 1777–1784. IEEE, 2013.

    Google Scholar 

  155. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch, 2017.

    Google Scholar 

  156. Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. Learning video object segmentation from static images. In CVPR, 2017.

    Google Scholar 

  157. Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, pages 724–732, 2016.

    Google Scholar 

  158. Federico Perazzi, Oliver Wang, Markus Gross, and Alexander Sorkine-Hornung. Fully connected object proposals for video segmentation. In CVPR, 2015.

    Google Scholar 

  159. Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.

  160. Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, and Vittorio Ferrari. Learning object class detectors from weakly annotated video. In CVPR, 2012.

    Google Scholar 

  161. Siyuan Qiao, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Vip-deeplab: Learning visual perception with depth-aware video panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3997–4008, 2021.

    Google Scholar 

  162. Zheyun Qin, Xiankai Lu, Xiushan Nie, Xiantong Zhen, and Yilong Yin. Learning hierarchical embedding for video instance segmentation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1884–1892, 2021.

    Google Scholar 

  163. Rong Quan, Junwei Han, Dingwen Zhang, and Feiping Nie. Object co-segmentation via graph optimized-flexible manifold ranking. In CVPR, 2016.

    Google Scholar 

  164. Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of speech recognition. Prentice-Hall, Inc., 1993.

    Google Scholar 

  165. Kate Rakelly*, Evan Shelhamer*, Trevor Darrell, Alexei A. Efros, and Sergey Levine. Meta-learning to guide segmentation. In ICLR, 2019.

    Google Scholar 

  166. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.

    Google Scholar 

  167. Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, volume 37, pages 1530–1538. PMLR, 2015.

    Google Scholar 

  168. Mrigank Rochan, Linwei Ye, and Yang Wang. Video summarization using fully convolutional sequence networks. In Proceedings of the European conference on computer vision (ECCV), pages 347–363, 2018.

    Google Scholar 

  169. Pau Rodríguez, Jordi Gonzalez, Guillem Cucurull, Josep M Gonfaus, and Xavier Roca. Regularizing cnns with locally constrained decorrelations. In ICLR, 2017.

    Google Scholar 

  170. Michael Rubinstein, Armand Joulin, Johannes Kopf, and Ce Liu. Unsupervised joint object discovery and segmentation in internet images. In CVPR, 2013.

    Google Scholar 

  171. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211–252, 2015.

    Google Scholar 

  172. Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In ICML, 2016.

    Google Scholar 

  173. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE TNNLS, 20(1):61–80, 2009.

    Google Scholar 

  174. Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. In ECCV, 2020.

    Google Scholar 

  175. Yaser Sheikh, Omar Javed, and Takeo Kanade. Background subtraction for freely moving cameras. In 2009 IEEE 12th International Conference on Computer Vision, pages 1219–1225. IEEE, 2009.

    Google Scholar 

  176. Evan Shelhamer, Kate Rakelly, Judy Hoffman, and Trevor Darrell. Clockwork convnets for video semantic segmentation. In ECCV, pages 852–868, 2016.

    Google Scholar 

  177. Jianping Shi, Qiong Yan, Li Xu, and Jiaya Jia. Hierarchical image saliency detection on extended cssd. TPAMI, 38(4):717–729, 2015.

    Google Scholar 

  178. Jae Shin Yoon, Francois Rameau, Junsik Kim, Seokju Lee, Seunghak Shin, and In So Kweon. Pixel-level matching for video object segmentation using convolutional neural networks. In CVPR, 2017.

    Google Scholar 

  179. Mennatullah Siam, Chen Jiang, Steven Lu, Laura Petrich, Mahmoud Gamal, Mohamed Elhoseiny, and Martin Jagersand. Video segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In ICRA, 2019.

    Google Scholar 

  180. Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, pages 746–760, 2012.

    Google Scholar 

  181. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

  182. Hongmei Song, Wenguan Wang, Sanyuan Zhao, Jianbing Shen, and Kin-Man Lam. Pyramid dilated deeper convlstm for video salient object detection. In ECCV, 2018.

    Google Scholar 

  183. Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. NeurIPS, 2015.

    Google Scholar 

  184. Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.

    Google Scholar 

  185. Yifan Sun, Liang Zheng, Weijian Deng, and Shengjin Wang. Svdnet for pedestrian retrieval. In ICCV, 2017.

    Google Scholar 

  186. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.

    Google Scholar 

  187. Brian Taylor, Vasiliy Karasev, and Stefano Soatto. Causal video object segmentation from persistence of occlusions. In CVPR, 2015.

    Google Scholar 

  188. Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convolutions for instance segmentation. arXiv preprint arXiv:2003.05664, 2020.

  189. Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning motion patterns in videos. In CVPR, 2017.

    Google Scholar 

  190. Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning video object segmentation with visual memory. In ICCV, 2017.

    Google Scholar 

  191. Pavel Tokmakov, Cordelia Schmid, and Karteek Alahari. Learning to segment moving objects. IJCV, 127(3):282–301, 2019.

    Google Scholar 

  192. Antonio Torralba. Contextual priming for object detection. IJCV, 53(2):169–191, 2003.

    Google Scholar 

  193. David Tsai, Matthew Flagg, Atsushi Nakazawa, and James M Rehg. Motion coherent tracking using multi-label mrf optimization. International journal of computer vision, 100(2):190–202, 2012.

    Google Scholar 

  194. Yi-Hsuan Tsai, Ming-Hsuan Yang, and Michael J Black. Video segmentation via object flow. In CVPR, 2016.

    Google Scholar 

  195. Yi-Hsuan Tsai, Guangyu Zhong, and Ming-Hsuan Yang. Semantic co-segmentation in videos. In ECCV, 2016.

    Google Scholar 

  196. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.

    Google Scholar 

  197. Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.

    Google Scholar 

  198. Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques, and Xavier Giro-i Nieto. Rvos: End-to-end recurrent network for video object segmentation. In CVPR, pages 5277–5286, 2019.

    Google Scholar 

  199. Sara Vicente, Carsten Rother, and Vladimir Kolmogorov. Object cosegmentation. In CVPR, 2011.

    Google Scholar 

  200. Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and Liang-Chieh Chen. Feelvos: Fast end-to-end embedding learning for video object segmentation. In CVPR, 2019.

    Google Scholar 

  201. Paul Voigtlaender and Bastian Leibe. Online adaptation of convolutional neural networks for video object segmentation. In BMVC, 2017.

    Google Scholar 

  202. Paul Voigtlaender, Jonathon Luiten, and Bastian Leibe. Boltvos: Box-level tracking for video object segmentation. arXiv preprint arXiv:1904.04552, 2019.

  203. John Y. A. Wang and Edward H. Adelson. Layered representation for motion analysis. In Conference on Computer Vision and Pattern Recognition, CVPR 1993, 15-17 June, 1993, New York, NY, USA, pages 361–366. IEEE, 1993.

    Google Scholar 

  204. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.

    Google Scholar 

  205. Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. Fast online object tracking and segmentation: A unifying approach. In CVPR, 2019.

    Google Scholar 

  206. Tao Wang, Ning Xu, Kean Chen, and Weiyao Lin. End-to-end video instance segmentation via spatial-temporal graph neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10797–10806, October 2021.

    Google Scholar 

  207. Wenguan Wang, Xiankai Lu, Jianbing Shen, David J Crandall, and Ling Shao. Zero-shot video object segmentation via attentive graph neural networks. In ICCV, 2019.

    Google Scholar 

  208. Wenguan Wang, Jianbing Shen, and Fatih Porikli. Saliency-aware geodesic video object segmentation. In CVPR, 2015.

    Google Scholar 

  209. Wenguan Wang, Jianbing Shen, Ruigang Yang, and Fatih Porikli. Saliency-aware video object segmentation. IEEE transactions on pattern analysis and machine intelligence, 40(1):20–33, 2017.

    Google Scholar 

  210. Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, Sanyuan Zhao, Steven CH Hoi, and Haibin Ling. Learning unsupervised video object segmentation through visual attention. In CVPR, pages 3064–3074, 2019.

    Google Scholar 

  211. Wenguan Wang, Tianfei Zhou, Fatih Porikli, David Crandall, and Luc Van Gool. A survey on deep learning technique for video segmentation. arXiv preprint arXiv:2107.01153, 2021.

  212. Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018.

    Google Scholar 

  213. Yongxin Wang, Xinshuo Weng, and Kris Kitani. Joint detection and multi-object tracking with graph neural networks. arXiv preprint arXiv:2006.13164, 2020.

  214. Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503, 2020.

  215. Ziqin Wang, Jun Xu, Li Liu, Fan Zhu, and Ling Shao. Ranet: Ranking attention network for fast video object segmentation. In ICCV, 2019.

    Google Scholar 

  216. Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender, Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers, et al. Step: Segmenting and tracking every pixel. arXiv preprint arXiv:2102.11859, 2021.

  217. Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.

  218. Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In ICIP. IEEE, 2017.

    Google Scholar 

  219. Sanghyun Woo, Dahun Kim, Joon-Young Lee, and In So Kweon. Learning to associate every segment for video panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2705–2714, 2021.

    Google Scholar 

  220. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.

  221. Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. Fast video object segmentation by reference-guided mask propagation. In CVPR, pages 7376–7385, 2018.

    Google Scholar 

  222. Huaxin Xiao, Jiashi Feng, Guosheng Lin, Yu Liu, and Maojun Zhang. Monet: Deep motion exploitation for video object segmentation. In CVPR, 2018.

    Google Scholar 

  223. Guo-Sen Xie, Li Liu, Xiaobo Jin, Fan Zhu, Zheng Zhang, Jie Qin, Yazhou Yao, and Ling Shao. Attentive region embedding network for zero-shot learning. In CVPR, 2019.

    Google Scholar 

  224. SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.

    Google Scholar 

  225. Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question answering. In ICLR, 2017.

    Google Scholar 

  226. Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8818–8826, 2019.

    Google Scholar 

  227. Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas Huang. Deep grabcut for object selection. arXiv preprint arXiv:1707.00243, 2017.

  228. Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas S Huang. Deep interactive object selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 373–381, 2016.

    Google Scholar 

  229. Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 585–601, 2018.

    Google Scholar 

  230. Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.

  231. Bin Yan, Yi Jiang, Peize Sun, Dong Wang, Zehuan Yuan, Ping Luo, and Huchuan Lu. Towards grand unification of object tracking. arXiv preprint arXiv:2207.07078, 2022.

  232. Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. Saliency detection via graph-based manifold ranking. In CVPR, 2013.

    Google Scholar 

  233. Linjie Yang, Yuchen Fan, and Ning Xu. Video instance segmentation. In ICCV, pages 5188–5197, 2019.

    Google Scholar 

  234. Linjie Yang, Xuehan Xiong, Yanran Wang, Jianchao Yang, and Aggelos K. Katsaggelos. Efficient video object segmentation via network modulation. In CVPR, 2018.

    Google Scholar 

  235. Tianyu Yang and Antoni B. Chan. Learning dynamic memory networks for object tracking. In ECCV, 2018.

    Google Scholar 

  236. Zhao Yang, Qiang Wang, Luca Bertinetto, Song Bai, Weiming Hu, and Philip H.S. Torr. Anchor diffusion for unsupervised video object segmentation. In ICCV, 2019.

    Google Scholar 

  237. Zongxin Yang, Yunchao Wei, and Yi Yang. Collaborative video object segmentation by foreground-background integration. arXiv preprint arXiv:2003.08333, 2020.

  238. Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In CVPR, pages 2636–2645, 2020.

    Google Scholar 

  239. Jiaqian Yu and Matthew B. Blaschko. Learning submodular losses with the lovász hinge. In ICML, 2015.

    Google Scholar 

  240. Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In ECCV, 2016.

    Google Scholar 

  241. Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation. In ECCV, 2020.

    Google Scholar 

  242. Ze-Huan Yuan, Tong Lu, and Yirui Wu. Deep-dense conditional random fields for object co-segmentation. In IJCAI, 2017.

    Google Scholar 

  243. Xiaohui Zeng, Renjie Liao, Li Gu, Yuwen Xiong, Sanja Fidler, and Raquel Urtasun. Dmm-net: Differentiable mask-matching network for video object segmentation. In ICCV, 2019.

    Google Scholar 

  244. Dong Zhang, Omar Javed, and Mubarak Shah. Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In CVPR, 2013.

    Google Scholar 

  245. Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.

  246. Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas Mueller, R Manmatha, et al. Resnest: Split-attention networks. arXiv preprint arXiv:2004.08955, 2020.

  247. Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

    Google Scholar 

  248. Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 633–641, 2017.

    Google Scholar 

  249. Qiang Zhou, Zilong Huang, Lichao Huang, Yongchao Gong, Han Shen, Wenyu Liu, and Xinggang Wang. Motion-guided spatial time attention for video object segmentation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.

    Google Scholar 

  250. Xiaowei Zhou, Can Yang, and Weichuan Yu. Moving object detection by detecting contiguous outliers in the low-rank representation. IEEE transactions on pattern analysis and machine intelligence, 35(3):597–610, 2012.

    Google Scholar 

  251. Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Tracking objects as points. In European Conference on Computer Vision, pages 474–490. Springer, 2020.

    Google Scholar 

  252. Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.

  253. Zhishan Zhou, Lejian Ren, Pengfei Xiong, Yifei Ji, Peisen Wang, Haoqiang Fan, and Si Liu. Enhanced memory network for video segmentation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.

    Google Scholar 

  254. Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In CVPR, pages 2349–2358, 2017.

    Google Scholar 

  255. Tao Zhuo, Zhiyong Cheng, Peng Zhang, Yongkang Wong, and Mohan Kankanhalli. Unsupervised online video object segmentation with motion property understanding. IEEE TIP, 29:237–249, 2019.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ning Xu .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Xu, N., Lin, W., Lu, X., Wei, Y. (2024). VOS. In: Video Object Segmentation. Synthesis Lectures on Computer Vision. Springer, Cham. https://doi.org/10.1007/978-3-031-44656-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44656-6_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44655-9

  • Online ISBN: 978-3-031-44656-6

  • eBook Packages: Synthesis Collection of Technology (R0)

Publish with us

Policies and ethics