Video Region Annotation with Sparse Bounding Boxes

Xu, Yuzheng; Wu, Yang; binti Zuraimi, Nur Sabrina; Nobuhara, Shohei; Nishino, Ko

doi:10.1007/s11263-022-01719-0

Video Region Annotation with Sparse Bounding Boxes

Published: 14 December 2022

Volume 131, pages 717–731, (2023)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

305 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Video analysis has been moving towards more detailed interpretation (e.g., segmentation) with encouraging progress. These tasks, however, increasingly rely on densely annotated training data both in space and time. Since such annotation is labor-intensive, few densely annotated video data with detailed region boundaries exist. This work aims to resolve this dilemma by learning to automatically generate region boundaries for all frames of a video from sparsely annotated bounding boxes of target regions. We achieve this with a Volumetric Graph Convolutional Network (VGCN), which learns to iteratively find keypoints on the region boundaries using the spatio-temporal volume of surrounding appearance and motion. We show that the global optimization of VGCN leads to more accurate annotation that generalizes better. Experimental results using three latest datasets (two real and one synthetic), including ablation studies, demonstrate the effectiveness and superiority of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Video Object Segmentation with Motion-Based Bilateral Networks

Toward Generating Human-Centered Video Annotations

Article 24 May 2019

Local Compressed Video Stream Learning for Generic Event Boundary Detection

Article 01 November 2023

References

Acuna, D., Ling, H., Kar, A., & Fidler, S. (2018). Efficient interactive annotation of segmentation datasets with polygon-rnn++. In: Proceedings CVPR (pp. 859–868).
Bengar, J. Z., Gonzalez-Garcia, A., Villalonga, G., Raducanu, B., Aghdam, H. H., Mozerov, M., Lopez, A. M., & van de Weijer, J. (2019). Temporal coherence for active learning in videos. In IEEE/CVF international conference on computer vision workshop (ICCVW) (pp. 914–923).
Bertasius, G., & Torresani, L. (2020). Classifying, segmenting, and tracking object instances in video with mask propagation. In Proceedings CVPR (pp. 9739–9748).
Bianco, S., Ciocca, G., Napoletano, P., & Schettini, R. (2015). An interactive tool for manual, semi-automatic and automatic video annotation. CVIU, 131, 88–99.
Google Scholar
Castrejon, L., Kundu, K., Urtasun, & R., Fidler, S. (2017). Annotating object instances with a polygon-rnn. In Proceedings CVPR (pp. 5230–5238).
Chen, B., Ling, H., Zeng, X., Gao, J., Xu, Z., & Fidler, S. (2020). Scribblebox: interactive annotation framework for video object segmentation. In Proceedings (pp. 293–310). ECCV Springer.
Cho, M., Kwak, S., Schmid, C., Ponce, J. (2015). Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In Proceedigs CVPR (pp. 1201–1210).
Ding, M., Wang, Z., Zhou, B., Shi, J., Lu, Z., & Luo, P. (2020) Every frame counts: Joint learning of video segmentation and optical flow. In Proceedings AAAI (pp 10713–10720).
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. IJCV, 111(1), 98–136.
Article Google Scholar
Gao, J., Wang, Z., Xuan, J., & Fidler, S. (2020). Beyond fixed grid: Learning geometric image representation with a deformable grid. Proc (pp. 108–125). ECCV: Springer.
Google Scholar
Giro-i Nieto, X., Camps, N., & Marques, F. (2010). Gat: A graphical annotation tool for semantic regions. Multimedia Tools and Applications, 46(2–3), 155–174.
Article Google Scholar
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017) Mask r-cnn. In Proceedings ICCV (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings CVPR (pp. 770–778).
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017) Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings CVPR (pp. 2462–2470).
Jain, SD., Xiong, B., & Grauman, K. (2017). Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In Proceedings CVPR (pp. 2117–2126).
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd international conference on learning representations, ICLR 2015 2015, Conference Track Proceedings. arXiv:1412.6980
Lee, H. Y., Huang, J. B., Singh, M., & Yang, M. H. (2017). Unsupervised representation learning by sorting sequences. In Proceeding ICCV (pp. 667–676).
Liang, J., Homayounfar, N., Ma, W. C., Xiong, Y., Hu, R., & Urtasun, R. (2020). Polytransform: Deep polygon transformer for instance segmentation. In Proceedings CVPR (pp. 9131–9140).
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings ECCV (pp. 740–755).
Ling, H., Gao, J., Kar, A., Chen, W., & Fidler, S. (2019). Fast interactive object annotation with curve-gcn. In Proceedings CVPR, (pp. 5257–5266).
Lipton, A., Fujiyoshi, H., & Patil, R. (1998). Moving target classification and tracking from real-time video. In Proceedings fourth IEEE workshop on applications of computer vision. WACV’98 (Cat. No.98EX201) (pp. 8–14), https://doi.org/10.1109/ACV.1998.732851.
Li, J., Zhao, Y., Fu, J., Wu, J., & Liu, J. (2019). Attention-guided network for semantic video segmentation. IEEE Access, 7, 140680–140689.
Article Google Scholar
Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C. (2021). Trackformer: Multi-object tracking with transformers. In Proc CVPR.
Nabavi, S. S., Rochan, M., & Wang, Y. (2018). Future semantic segmentation with convolutional LSTM. In Procceedings BMVC (p 137).
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, R. Garnett (Eds.) Advances in neural information processing systems 32 (Curran Associates, Inc., pp. 8024–8035). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
Paul, M., Mayer, C., Gool, L. V., Timofte, R. (2020). Efficient video semantic segmentation with labels propagation and refinement. In Proceedings WACV (pp. 2873–2882).
Peng, S., Jiang, W., Pi, H., Li, X., Bao, H., & Zhou, X. (2020). Deep snake for real-time instance segmentation. In Proceedings CVPR (pp. 8533–8542).
Porzi, L., Hofinger, M., Ruiz, I., Serrat, J., Bulo, S. R., & Kontschieder, P. (2020). Learning multi-object tracking and segmentation from automatic annotations. In Proceeding CVPR (pp. 6846–6855).
Rother, C., Kolmogorov, V., & Blake, A. (2004). “grabcut’’ interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3), 309–314.
Article Google Scholar
Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., & Leibe, B. (2019a) Mots: Multi-object tracking and segmentation. In Proceeding CVPR (pp. 7942–7951).
Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., Leibe, B. (2019b) Mots: Multi-object tracking and segmentation. In Proceeding CVPR (pp. 7942–7951).
Vondrick, C., Patterson, D., & Ramanan, D. (2013). Efficiently scaling up crowdsourced video annotation. IJCV, 101(1), 184–204.
Article Google Scholar
Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In Proceeding ICCV (pp. 2794–2802).
Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., Xia, H. (2021). End-to-end video instance segmentation with transformers. In Proceedings CVPR (pp. 8741–8750).
Wang, T., Han, B., & Collomosse, J. (2014). Touchcut: Fast image and video segmentation using single-touch interaction. CVIU, 120, 14–30.
Xu, Y., Wu, Y., binti Zuraimi, N. S., Nobuhara, S., Nishino, K. (2020). Video region annotation with sparse bounding boxes. In 31st British Machine Vision Conference 2020, BMVC 2020, Virtual Event 2020, BMVA Press.
Yang, L., Fan, Y., & Xu, N. (2019a). Video instance segmentation. In Proceeeding ICCV (pp. 5188–5197).
Yang, L., Fan, Y., & Xu, N. (2019b). Video instance segmentation. In Proceeding ICCV (pp. 5188–5197).
Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell. T. (2020). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceeding CVPR (pp. 2636–2645).
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. IJCV, 127(3), 302–321.
Article Google Scholar

Download references

Acknowledgements

This work was in part supported by JSPS KAKENHI 17K20143 and SenseTime Japan.

Author information

Authors and Affiliations

Kyoto University, Kyoto, Japan
Yuzheng Xu, Yang Wu, Nur Sabrina binti Zuraimi, Shohei Nobuhara & Ko Nishino

Authors

Yuzheng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Nur Sabrina binti Zuraimi
View author publications
You can also search for this author in PubMed Google Scholar
Shohei Nobuhara
View author publications
You can also search for this author in PubMed Google Scholar
Ko Nishino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuzheng Xu.

Additional information

Communicated by Moi Hoon Yap.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xu, Y., Wu, Y., binti Zuraimi, N.S. et al. Video Region Annotation with Sparse Bounding Boxes. Int J Comput Vis 131, 717–731 (2023). https://doi.org/10.1007/s11263-022-01719-0

Download citation

Received: 03 September 2021
Accepted: 09 November 2022
Published: 14 December 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s11263-022-01719-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video Region Annotation with Sparse Bounding Boxes

Abstract

Access this article

Similar content being viewed by others

Unsupervised Video Object Segmentation with Motion-Based Bilateral Networks

Toward Generating Human-Centered Video Annotations

Local Compressed Video Stream Learning for Generic Event Boundary Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Video Region Annotation with Sparse Bounding Boxes

Abstract

Access this article

Similar content being viewed by others

Unsupervised Video Object Segmentation with Motion-Based Bilateral Networks

Toward Generating Human-Centered Video Annotations

Local Compressed Video Stream Learning for Generic Event Boundary Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation