Skip to main content
Log in

Video Region Annotation with Sparse Bounding Boxes

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Video analysis has been moving towards more detailed interpretation (e.g., segmentation) with encouraging progress. These tasks, however, increasingly rely on densely annotated training data both in space and time. Since such annotation is labor-intensive, few densely annotated video data with detailed region boundaries exist. This work aims to resolve this dilemma by learning to automatically generate region boundaries for all frames of a video from sparsely annotated bounding boxes of target regions. We achieve this with a Volumetric Graph Convolutional Network (VGCN), which learns to iteratively find keypoints on the region boundaries using the spatio-temporal volume of surrounding appearance and motion. We show that the global optimization of VGCN leads to more accurate annotation that generalizes better. Experimental results using three latest datasets (two real and one synthetic), including ablation studies, demonstrate the effectiveness and superiority of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Acuna, D., Ling, H., Kar, A., & Fidler, S. (2018). Efficient interactive annotation of segmentation datasets with polygon-rnn++. In: Proceedings CVPR (pp. 859–868).

  • Bengar, J. Z., Gonzalez-Garcia, A., Villalonga, G., Raducanu, B., Aghdam, H. H., Mozerov, M., Lopez, A. M., & van de Weijer, J. (2019). Temporal coherence for active learning in videos. In IEEE/CVF international conference on computer vision workshop (ICCVW) (pp. 914–923).

  • Bertasius, G., & Torresani, L. (2020). Classifying, segmenting, and tracking object instances in video with mask propagation. In Proceedings CVPR (pp. 9739–9748).

  • Bianco, S., Ciocca, G., Napoletano, P., & Schettini, R. (2015). An interactive tool for manual, semi-automatic and automatic video annotation. CVIU, 131, 88–99.

    Google Scholar 

  • Castrejon, L., Kundu, K., Urtasun, & R., Fidler, S. (2017). Annotating object instances with a polygon-rnn. In Proceedings CVPR (pp. 5230–5238).

  • Chen, B., Ling, H., Zeng, X., Gao, J., Xu, Z., & Fidler, S. (2020). Scribblebox: interactive annotation framework for video object segmentation. In Proceedings (pp. 293–310). ECCV Springer.

  • Cho, M., Kwak, S., Schmid, C., Ponce, J. (2015). Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In Proceedigs CVPR (pp. 1201–1210).

  • Ding, M., Wang, Z., Zhou, B., Shi, J., Lu, Z., & Luo, P. (2020) Every frame counts: Joint learning of video segmentation and optical flow. In Proceedings AAAI (pp 10713–10720).

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. IJCV, 111(1), 98–136.

    Article  Google Scholar 

  • Gao, J., Wang, Z., Xuan, J., & Fidler, S. (2020). Beyond fixed grid: Learning geometric image representation with a deformable grid. Proc (pp. 108–125). ECCV: Springer.

    Google Scholar 

  • Giro-i Nieto, X., Camps, N., & Marques, F. (2010). Gat: A graphical annotation tool for semantic regions. Multimedia Tools and Applications, 46(2–3), 155–174.

    Article  Google Scholar 

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017) Mask r-cnn. In Proceedings ICCV (pp. 2961–2969).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings CVPR (pp. 770–778).

  • Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017) Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings CVPR (pp. 2462–2470).

  • Jain, SD., Xiong, B., & Grauman, K. (2017). Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In Proceedings CVPR (pp. 2117–2126).

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd international conference on learning representations, ICLR 2015 2015, Conference Track Proceedings. arXiv:1412.6980

  • Lee, H. Y., Huang, J. B., Singh, M., & Yang, M. H. (2017). Unsupervised representation learning by sorting sequences. In Proceeding ICCV (pp. 667–676).

  • Liang, J., Homayounfar, N., Ma, W. C., Xiong, Y., Hu, R., & Urtasun, R. (2020). Polytransform: Deep polygon transformer for instance segmentation. In Proceedings CVPR (pp. 9131–9140).

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings ECCV (pp. 740–755).

  • Ling, H., Gao, J., Kar, A., Chen, W., & Fidler, S. (2019). Fast interactive object annotation with curve-gcn. In Proceedings CVPR, (pp. 5257–5266).

  • Lipton, A., Fujiyoshi, H., & Patil, R. (1998). Moving target classification and tracking from real-time video. In Proceedings fourth IEEE workshop on applications of computer vision. WACV’98 (Cat. No.98EX201) (pp. 8–14), https://doi.org/10.1109/ACV.1998.732851.

  • Li, J., Zhao, Y., Fu, J., Wu, J., & Liu, J. (2019). Attention-guided network for semantic video segmentation. IEEE Access, 7, 140680–140689.

    Article  Google Scholar 

  • Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C. (2021). Trackformer: Multi-object tracking with transformers. In Proc CVPR.

  • Nabavi, S. S., Rochan, M., & Wang, Y. (2018). Future semantic segmentation with convolutional LSTM. In Procceedings BMVC (p 137).

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, R. Garnett (Eds.) Advances in neural information processing systems 32 (Curran Associates, Inc., pp. 8024–8035). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.

  • Paul, M., Mayer, C., Gool, L. V., Timofte, R. (2020). Efficient video semantic segmentation with labels propagation and refinement. In Proceedings WACV (pp. 2873–2882).

  • Peng, S., Jiang, W., Pi, H., Li, X., Bao, H., & Zhou, X. (2020). Deep snake for real-time instance segmentation. In Proceedings CVPR (pp. 8533–8542).

  • Porzi, L., Hofinger, M., Ruiz, I., Serrat, J., Bulo, S. R., & Kontschieder, P. (2020). Learning multi-object tracking and segmentation from automatic annotations. In Proceeding CVPR (pp. 6846–6855).

  • Rother, C., Kolmogorov, V., & Blake, A. (2004). “grabcut’’ interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3), 309–314.

    Article  Google Scholar 

  • Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., & Leibe, B. (2019a) Mots: Multi-object tracking and segmentation. In Proceeding CVPR (pp. 7942–7951).

  • Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., Leibe, B. (2019b) Mots: Multi-object tracking and segmentation. In Proceeding CVPR (pp. 7942–7951).

  • Vondrick, C., Patterson, D., & Ramanan, D. (2013). Efficiently scaling up crowdsourced video annotation. IJCV, 101(1), 184–204.

    Article  Google Scholar 

  • Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In Proceeding ICCV (pp. 2794–2802).

  • Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., Xia, H. (2021). End-to-end video instance segmentation with transformers. In Proceedings CVPR (pp. 8741–8750).

  • Wang, T., Han, B., & Collomosse, J. (2014). Touchcut: Fast image and video segmentation using single-touch interaction. CVIU, 120, 14–30.

  • Xu, Y., Wu, Y., binti Zuraimi, N. S., Nobuhara, S., Nishino, K. (2020). Video region annotation with sparse bounding boxes. In 31st British Machine Vision Conference 2020, BMVC 2020, Virtual Event 2020, BMVA Press.

  • Yang, L., Fan, Y., & Xu, N. (2019a). Video instance segmentation. In Proceeeding ICCV (pp. 5188–5197).

  • Yang, L., Fan, Y., & Xu, N. (2019b). Video instance segmentation. In Proceeding ICCV (pp. 5188–5197).

  • Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell. T. (2020). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceeding CVPR (pp. 2636–2645).

  • Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. IJCV, 127(3), 302–321.

    Article  Google Scholar 

Download references

Acknowledgements

This work was in part supported by JSPS KAKENHI 17K20143 and SenseTime Japan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuzheng Xu.

Additional information

Communicated by Moi Hoon Yap.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Y., Wu, Y., binti Zuraimi, N.S. et al. Video Region Annotation with Sparse Bounding Boxes. Int J Comput Vis 131, 717–731 (2023). https://doi.org/10.1007/s11263-022-01719-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01719-0

Keywords

Navigation